LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison by alessandralanz · Pull Request #265 · lightspeed-core/lightspeed-evaluation

alessandralanz · 2026-06-26T21:22:32Z

Description

Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:

97 OKP RAG benchmark conversations covering single-turn knowledge, multi-turn retention, edge cases, and
negative/off-topic queries
Baseline comparison script (compare_against_baseline) with --check-only mode for CI gating
System config for PR-gate evaluation using gpt-4o-mini as the judge LLM
Version-controlled baseline snapshot (102 evaluations)
A/B/C comparison script for comparing multiple evaluation runs

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: Claude (Claude Code CLI)
Generated by: N/A

Related Tickets & Documents

Related Issue #
Closes # LCORE-2802

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.
make pre-commit passes all checks (pylint, pyright, ruff, black, etc.)
- GitLab CI pipeline runs 97 conversations producing 506 evaluations against a live lightspeed-stack instance with OKP enabled
- Baseline comparison script validates current results against stored baseline

Summary by CodeRabbit

New Features
- Added regression gate comparison tools for evaluating current runs against baselines, including a new three-run A/B/C comparison flow.
- Added support for generating markdown reports and structured output for regression results.
- Introduced standardized evaluation configuration for scoring, thresholds, logging, visualization, and result export.
Bug Fixes
- Improved handling of missing, incomplete, or invalid evaluation summaries during comparison runs.
- Added clearer pass/warn/fail outcomes for metric regressions and fallback baseline behavior.

…. Checked against a live OKP image with 85% pass rate

… Guide's dataset sizing and distribution recommendations

…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py

coderabbitai · 2026-06-26T21:22:44Z

Walkthrough

Adds a regression gate config plus two CLI comparison scripts for baseline and three-run A/B/C evaluation summaries. The scripts compute per-metric deltas, classify PASS/WARN/FAIL outcomes, and can write markdown reports. New pytest helpers and tests cover summary loading, delta logic, verdict selection, and CLI execution.

Changes

Regression gate updates

Layer / File(s)	Summary
Gate config and package doc `config/lcore_regression/system-config-pr-gate.yaml`, `script/regression/__init__.py`	Defines the PR gate runtime settings, metric thresholds, result storage, visualization defaults, logging, and the regression package docstring.
Baseline loading and deltas `script/regression/compare_against_baseline.py`, `tests/script/conftest.py`, `tests/script/test_compare_against_baseline.py`	Loads one summary JSON per directory, computes metric deltas, and adds shared summary fixtures plus loading and delta tests.
Baseline reporting and entrypoint `script/regression/compare_against_baseline.py`, `tests/script/test_compare_against_baseline.py`	Formats the baseline markdown report and wires check-only and fail-on-critical exit handling, with markdown tests.
A/B/C parsing and verdicts `script/regression/compare_abc_runs.py`, `tests/script/test_compare_abc_runs.py`	Parses Run A/B/C inputs, loads summaries, and derives OKP/PR/total PASS/WARN/FAIL verdicts.
A/B/C markdown and integration `script/regression/compare_abc_runs.py`, `tests/script/test_compare_abc_runs.py`	Builds the A/B/C markdown report and exercises CLI output writing, missing-run handling, and baseline fallback.

Sequence Diagram(s)

Baseline comparison flow

sequenceDiagram
  participant Main as main
  participant Loader as find_and_load_summary
  participant Delta as compute_metric_deltas
  participant Report as generate_markdown_summary
  Main->>Loader: main() loads baseline and current summary JSON
  Main->>Delta: main() computes metric deltas
  Main->>Report: main() builds the markdown summary

A/B/C regression attribution flow

sequenceDiagram
  participant Main as main
  participant Verdict as determine_gate_verdict
  participant Report as generate_abc_markdown_summary
  Main->>Verdict: main() evaluates OKP, PR, and total deltas
  Main->>Report: main() builds the markdown report

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

VladimirKadlec
xmican10
asamal4

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is clear and matches the main change: adding OKP RAG regression benchmarks and baseline comparison tooling.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.

In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.

In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc

📥 Commits

Reviewing files that changed from the base of the PR and between 6ff47a6 and 5d59325.

📒 Files selected for processing (9)

baselines/lcore_regression/current_baseline_summary.json
config/lcore_regression/system-config-pr-gate.yaml
eval_data/lcore_regression/okp_rag_quality.yaml
script/regression/__init__.py
script/regression/compare_abc_runs.py
script/regression/compare_against_baseline.py
tests/script/conftest.py
tests/script/test_compare_abc_runs.py
tests/script/test_compare_against_baseline.py

coderabbitai · 2026-06-26T21:28:47Z

+    pr_has_critical = _has_critical_regression(pr_deltas)
+    okp_has_critical = _has_critical_regression(okp_deltas)
+    total_has_critical = _has_critical_regression(total_deltas)
+
+    if pr_deltas is not None and pr_has_critical:
+        return ("FAIL", "PR caused critical regression (B vs C).")
+
+    if okp_has_critical and not pr_has_critical:
+        return ("WARN", "OKP data caused regression, not the PR (A vs B).")
+
+    if total_has_critical and pr_deltas is None:
+        return ("FAIL", "Critical regression detected vs baseline (A vs C).")
+
+    has_any_warn = (
+        _has_noncritical_regression(okp_deltas)
+        or _has_noncritical_regression(pr_deltas)
+        or _has_noncritical_regression(total_deltas)
+    )
+    if has_any_warn:
+        return ("WARN", "Non-critical metrics degraded.")
+
+    return ("PASS", "No critical regressions detected.")


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

sed -n '1,220p' script/regression/compare_abc_runs.py

Repository: lightspeed-core/lightspeed-evaluation

Length of output: 7579

🏁 Script executed:

python3 - <<'PY' from pathlib import Path p = Path('script/regression/compare_abc_runs.py') text = p.read_text() for needle in ['def _has_critical_regression', 'def _has_noncritical_regression', 'def determine_gate_verdict']: idx = text.find(needle) print('\n###', needle, idx) if idx != -1: start = text.rfind('\n', 0, idx) end = text.find('\n\ndef ', idx+1) print(text[idx:end if end != -1 else None]) PY

Repository: lightspeed-core/lightspeed-evaluation

Length of output: 2244

Include total_has_critical in the gate verdict. Right now a critical regression split across A→B and B→C can still return PASS, because total_deltas is only considered when pr_deltas is None. If total cumulative quality should gate the run, check total_has_critical here too.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@script/regression/compare_abc_runs.py` around lines 89 - 110, The verdict logic in compare_abc_runs currently ignores total cumulative critical regressions unless pr_deltas is None, so a split A→B and B→C failure can still pass. Update the gate in the main decision block to consider total_has_critical alongside the existing pr_has_critical and okp_has_critical checks, using the same verdict flow in the compare_abc_runs function so any critical total regression returns a failing result when intended.

coderabbitai · 2026-06-26T21:28:48Z

+        if baseline_mean is not None and current_mean is not None:
+            score_delta = current_mean - baseline_mean
+        else:
+            score_delta = None
+
+        if baseline_pass_rate is not None and current_pass_rate is not None:
+            pass_rate_delta = current_pass_rate - baseline_pass_rate
+        else:
+            pass_rate_delta = None
+
+        is_critical = metric in CRITICAL_METRICS
+        threshold = critical_delta if is_critical else warn_delta
+
+        if score_delta is not None and score_delta < -threshold:
+            status = "FAIL" if is_critical else "WARN"
+        else:
+            status = "PASS"


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Treat missing current metrics as regressions, not PASS.

Lines 141-157 fall back to PASS whenever the current run omits a metric from the baseline. That lets a broken or disabled evaluator silently pass the gate, including for critical metrics. Please surface “present in baseline, missing in current” as a degraded result instead of a success.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@script/regression/compare_against_baseline.py` around lines 141 - 157, The status calculation in compare_against_baseline should not default to PASS when a baseline metric is present but the current run is missing it. Update the logic around score_delta/status in the comparison flow to treat “present in baseline, missing in current” as a degraded outcome, and apply the same handling for pass_rate_delta where relevant. Use the existing compare_against_baseline metric handling and CRITICAL_METRICS thresholding so missing current values are reported as WARN or FAIL instead of PASS.

coderabbitai · 2026-06-26T21:28:48Z

+    print(
+        f"Baseline: {baseline_data['total_evaluations']} evaluations "
+        f"({baseline_data['timestamp']})"
+    )
+    print(
+        f"Current:  {current_data['total_evaluations']} evaluations "
+        f"({current_data['timestamp']})"
+    )
+
+    deltas = compute_metric_deltas(
+        baseline_data,
+        current_data,
+        args.critical_delta,
+        args.warn_delta,
+    )
+
+    has_critical_fail = any(d["status"] == "FAIL" and d["is_critical"] for d in deltas)
+
+    if args.check_only:
+        print("regression" if has_critical_fail else "ok")
+        return 0


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

--check-only currently emits extra stdout.

Lines 232-239 print the run summaries before the Lines 250-252 check_only branch, so this mode outputs more than just ok/regression. That breaks the documented single-token contract for callers parsing stdout.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@script/regression/compare_against_baseline.py` around lines 232 - 252, The compare script’s check-only mode is emitting the baseline/current summary prints before the args.check_only branch, so stdout contains more than the required single token. In compare_against_baseline.py, update the control flow around compute_metric_deltas and the check-only handling so the summary prints are skipped when args.check_only is set, leaving only the final ok/regression output from the check_only path.

coderabbitai · 2026-06-26T21:28:48Z

+    def test_raises_on_missing_directory(self) -> None:
+        """Non-existent directory should raise FileNotFoundError."""
+        with pytest.raises(FileNotFoundError, match="Directory not found"):
+            find_and_load_summary("/tmp/nonexistent_dir_abc123")


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Make the missing-directory test deterministic.

Line 51 hardcodes /tmp/nonexistent_dir_abc123. If that path ever exists on a developer machine or shared runner, this test stops exercising the intended error path. Build it from tmp_path / "does-not-exist" instead.

🧰 Tools

🪛 ast-grep (0.44.0)

[info] 50-50: Do not hardcode temporary file or directory names
Context: "/tmp/nonexistent_dir_abc123"
Note: [CWE-377] Insecure Temporary File.

(hardcoded-tmp-file)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/script/test_compare_against_baseline.py` around lines 48 - 51, The missing-directory test in test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp path that may exist on some machines. Update the test to use pytest’s tmp_path fixture and construct a guaranteed-missing subpath (for example via tmp_path with a non-created child) before calling find_and_load_summary, so the FileNotFoundError assertion always exercises the intended path.

alessandralanz added 5 commits June 26, 2026 16:24

Add OKP RAG quality benchmarks and system config for LCORE regression…

ff73802

…. Checked against a live OKP image with 85% pass rate

Add baseline comparison script for regression gating

7e29a9a

Revised benchmarks to adhere to Evaluation Data Collection Standard +…

74768c9

… Guide's dataset sizing and distribution recommendations

Add version-controlled baseline for regression gating

5e1f772

Add three run A/B/C comparison script and regression tests. Attribute…

5d59325

…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265

LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
alessandralanz wants to merge 5 commits into
lightspeed-core:mainfrom
alessandralanz:lcore-regression

alessandralanz commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

alessandralanz commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Baseline comparison flow

A/B/C regression attribution flow

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alessandralanz commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading