Skip to content

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265

Open
alessandralanz wants to merge 5 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression
Open

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
alessandralanz wants to merge 5 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression

Conversation

@alessandralanz

@alessandralanz alessandralanz commented Jun 26, 2026

Copy link
Copy Markdown

Description

Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:

  • 97 OKP RAG benchmark conversations covering single-turn knowledge, multi-turn retention, edge cases, and
    negative/off-topic queries
  • Baseline comparison script (compare_against_baseline) with --check-only mode for CI gating
  • System config for PR-gate evaluation using gpt-4o-mini as the judge LLM
  • Version-controlled baseline snapshot (102 evaluations)
  • A/B/C comparison script for comparing multiple evaluation runs

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Unit tests improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Claude (Claude Code CLI)
  • Generated by: N/A

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.

  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

  • make pre-commit passes all checks (pylint, pyright, ruff, black, etc.)

    • GitLab CI pipeline runs 97 conversations producing 506 evaluations against a live lightspeed-stack instance with OKP enabled
    • Baseline comparison script validates current results against stored baseline

Summary by CodeRabbit

  • New Features

    • Added regression gate comparison tools for evaluating current runs against baselines, including a new three-run A/B/C comparison flow.
    • Added support for generating markdown reports and structured output for regression results.
    • Introduced standardized evaluation configuration for scoring, thresholds, logging, visualization, and result export.
  • Bug Fixes

    • Improved handling of missing, incomplete, or invalid evaluation summaries during comparison runs.
    • Added clearer pass/warn/fail outcomes for metric regressions and fallback baseline behavior.

…. Checked against a live OKP image with 85% pass rate
… Guide's dataset sizing and distribution recommendations
…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds a regression gate config plus two CLI comparison scripts for baseline and three-run A/B/C evaluation summaries. The scripts compute per-metric deltas, classify PASS/WARN/FAIL outcomes, and can write markdown reports. New pytest helpers and tests cover summary loading, delta logic, verdict selection, and CLI execution.

Changes

Regression gate updates

Layer / File(s) Summary
Gate config and package doc
config/lcore_regression/system-config-pr-gate.yaml, script/regression/__init__.py
Defines the PR gate runtime settings, metric thresholds, result storage, visualization defaults, logging, and the regression package docstring.
Baseline loading and deltas
script/regression/compare_against_baseline.py, tests/script/conftest.py, tests/script/test_compare_against_baseline.py
Loads one summary JSON per directory, computes metric deltas, and adds shared summary fixtures plus loading and delta tests.
Baseline reporting and entrypoint
script/regression/compare_against_baseline.py, tests/script/test_compare_against_baseline.py
Formats the baseline markdown report and wires check-only and fail-on-critical exit handling, with markdown tests.
A/B/C parsing and verdicts
script/regression/compare_abc_runs.py, tests/script/test_compare_abc_runs.py
Parses Run A/B/C inputs, loads summaries, and derives OKP/PR/total PASS/WARN/FAIL verdicts.
A/B/C markdown and integration
script/regression/compare_abc_runs.py, tests/script/test_compare_abc_runs.py
Builds the A/B/C markdown report and exercises CLI output writing, missing-run handling, and baseline fallback.

Sequence Diagram(s)

Baseline comparison flow

sequenceDiagram
  participant Main as main
  participant Loader as find_and_load_summary
  participant Delta as compute_metric_deltas
  participant Report as generate_markdown_summary
  Main->>Loader: main() loads baseline and current summary JSON
  Main->>Delta: main() computes metric deltas
  Main->>Report: main() builds the markdown summary
Loading

A/B/C regression attribution flow

sequenceDiagram
  participant Main as main
  participant Verdict as determine_gate_verdict
  participant Report as generate_abc_markdown_summary
  Main->>Verdict: main() evaluates OKP, PR, and total deltas
  Main->>Report: main() builds the markdown report
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • VladimirKadlec
  • xmican10
  • asamal4
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is clear and matches the main change: adding OKP RAG regression benchmarks and baseline comparison tooling.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.

In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.

In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff47a6 and 5d59325.

📒 Files selected for processing (9)
  • baselines/lcore_regression/current_baseline_summary.json
  • config/lcore_regression/system-config-pr-gate.yaml
  • eval_data/lcore_regression/okp_rag_quality.yaml
  • script/regression/__init__.py
  • script/regression/compare_abc_runs.py
  • script/regression/compare_against_baseline.py
  • tests/script/conftest.py
  • tests/script/test_compare_abc_runs.py
  • tests/script/test_compare_against_baseline.py

Comment on lines +89 to +110
pr_has_critical = _has_critical_regression(pr_deltas)
okp_has_critical = _has_critical_regression(okp_deltas)
total_has_critical = _has_critical_regression(total_deltas)

if pr_deltas is not None and pr_has_critical:
return ("FAIL", "PR caused critical regression (B vs C).")

if okp_has_critical and not pr_has_critical:
return ("WARN", "OKP data caused regression, not the PR (A vs B).")

if total_has_critical and pr_deltas is None:
return ("FAIL", "Critical regression detected vs baseline (A vs C).")

has_any_warn = (
_has_noncritical_regression(okp_deltas)
or _has_noncritical_regression(pr_deltas)
or _has_noncritical_regression(total_deltas)
)
if has_any_warn:
return ("WARN", "Non-critical metrics degraded.")

return ("PASS", "No critical regressions detected.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

sed -n '1,220p' script/regression/compare_abc_runs.py

Repository: lightspeed-core/lightspeed-evaluation

Length of output: 7579


🏁 Script executed:

python3 - <<'PY'
from pathlib import Path
p = Path('script/regression/compare_abc_runs.py')
text = p.read_text()
for needle in ['def _has_critical_regression', 'def _has_noncritical_regression', 'def determine_gate_verdict']:
    idx = text.find(needle)
    print('\n###', needle, idx)
    if idx != -1:
        start = text.rfind('\n', 0, idx)
        end = text.find('\n\ndef ', idx+1)
        print(text[idx:end if end != -1 else None])
PY

Repository: lightspeed-core/lightspeed-evaluation

Length of output: 2244


Include total_has_critical in the gate verdict. Right now a critical regression split across A→B and B→C can still return PASS, because total_deltas is only considered when pr_deltas is None. If total cumulative quality should gate the run, check total_has_critical here too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@script/regression/compare_abc_runs.py` around lines 89 - 110, The verdict
logic in compare_abc_runs currently ignores total cumulative critical
regressions unless pr_deltas is None, so a split A→B and B→C failure can still
pass. Update the gate in the main decision block to consider total_has_critical
alongside the existing pr_has_critical and okp_has_critical checks, using the
same verdict flow in the compare_abc_runs function so any critical total
regression returns a failing result when intended.

Comment on lines +141 to +157
if baseline_mean is not None and current_mean is not None:
score_delta = current_mean - baseline_mean
else:
score_delta = None

if baseline_pass_rate is not None and current_pass_rate is not None:
pass_rate_delta = current_pass_rate - baseline_pass_rate
else:
pass_rate_delta = None

is_critical = metric in CRITICAL_METRICS
threshold = critical_delta if is_critical else warn_delta

if score_delta is not None and score_delta < -threshold:
status = "FAIL" if is_critical else "WARN"
else:
status = "PASS"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Treat missing current metrics as regressions, not PASS.

Lines 141-157 fall back to PASS whenever the current run omits a metric from the baseline. That lets a broken or disabled evaluator silently pass the gate, including for critical metrics. Please surface “present in baseline, missing in current” as a degraded result instead of a success.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@script/regression/compare_against_baseline.py` around lines 141 - 157, The
status calculation in compare_against_baseline should not default to PASS when a
baseline metric is present but the current run is missing it. Update the logic
around score_delta/status in the comparison flow to treat “present in baseline,
missing in current” as a degraded outcome, and apply the same handling for
pass_rate_delta where relevant. Use the existing compare_against_baseline metric
handling and CRITICAL_METRICS thresholding so missing current values are
reported as WARN or FAIL instead of PASS.

Comment on lines +232 to +252
print(
f"Baseline: {baseline_data['total_evaluations']} evaluations "
f"({baseline_data['timestamp']})"
)
print(
f"Current: {current_data['total_evaluations']} evaluations "
f"({current_data['timestamp']})"
)

deltas = compute_metric_deltas(
baseline_data,
current_data,
args.critical_delta,
args.warn_delta,
)

has_critical_fail = any(d["status"] == "FAIL" and d["is_critical"] for d in deltas)

if args.check_only:
print("regression" if has_critical_fail else "ok")
return 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

--check-only currently emits extra stdout.

Lines 232-239 print the run summaries before the Lines 250-252 check_only branch, so this mode outputs more than just ok/regression. That breaks the documented single-token contract for callers parsing stdout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@script/regression/compare_against_baseline.py` around lines 232 - 252, The
compare script’s check-only mode is emitting the baseline/current summary prints
before the args.check_only branch, so stdout contains more than the required
single token. In compare_against_baseline.py, update the control flow around
compute_metric_deltas and the check-only handling so the summary prints are
skipped when args.check_only is set, leaving only the final ok/regression output
from the check_only path.

Comment on lines +48 to +51
def test_raises_on_missing_directory(self) -> None:
"""Non-existent directory should raise FileNotFoundError."""
with pytest.raises(FileNotFoundError, match="Directory not found"):
find_and_load_summary("/tmp/nonexistent_dir_abc123")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Make the missing-directory test deterministic.

Line 51 hardcodes /tmp/nonexistent_dir_abc123. If that path ever exists on a developer machine or shared runner, this test stops exercising the intended error path. Build it from tmp_path / "does-not-exist" instead.

🧰 Tools
🪛 ast-grep (0.44.0)

[info] 50-50: Do not hardcode temporary file or directory names
Context: "/tmp/nonexistent_dir_abc123"
Note: [CWE-377] Insecure Temporary File.

(hardcoded-tmp-file)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/script/test_compare_against_baseline.py` around lines 48 - 51, The
missing-directory test in test_raises_on_missing_directory is nondeterministic
because it hardcodes a /tmp path that may exist on some machines. Update the
test to use pytest’s tmp_path fixture and construct a guaranteed-missing subpath
(for example via tmp_path with a non-created child) before calling
find_and_load_summary, so the FileNotFoundError assertion always exercises the
intended path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant