LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265
LCORE-2802: Add OKP RAG quality regression benchmarks and baseline comparison#265alessandralanz wants to merge 5 commits into
Conversation
…. Checked against a live OKP image with 85% pass rate
… Guide's dataset sizing and distribution recommendations
…s quality regression to PR code changes vs OKP data changes using pairwise A/B/C comparisons and shared test helpers have been moved to conftest.py
WalkthroughAdds a regression gate config plus two CLI comparison scripts for baseline and three-run A/B/C evaluation summaries. The scripts compute per-metric deltas, classify PASS/WARN/FAIL outcomes, and can write markdown reports. New pytest helpers and tests cover summary loading, delta logic, verdict selection, and CLI execution. ChangesRegression gate updates
Sequence Diagram(s)Baseline comparison flowsequenceDiagram
participant Main as main
participant Loader as find_and_load_summary
participant Delta as compute_metric_deltas
participant Report as generate_markdown_summary
Main->>Loader: main() loads baseline and current summary JSON
Main->>Delta: main() computes metric deltas
Main->>Report: main() builds the markdown summary
A/B/C regression attribution flowsequenceDiagram
participant Main as main
participant Verdict as determine_gate_verdict
participant Report as generate_abc_markdown_summary
Main->>Verdict: main() evaluates OKP, PR, and total deltas
Main->>Report: main() builds the markdown report
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@script/regression/compare_abc_runs.py`:
- Around line 89-110: The verdict logic in compare_abc_runs currently ignores
total cumulative critical regressions unless pr_deltas is None, so a split A→B
and B→C failure can still pass. Update the gate in the main decision block to
consider total_has_critical alongside the existing pr_has_critical and
okp_has_critical checks, using the same verdict flow in the compare_abc_runs
function so any critical total regression returns a failing result when
intended.
In `@script/regression/compare_against_baseline.py`:
- Around line 141-157: The status calculation in compare_against_baseline should
not default to PASS when a baseline metric is present but the current run is
missing it. Update the logic around score_delta/status in the comparison flow to
treat “present in baseline, missing in current” as a degraded outcome, and apply
the same handling for pass_rate_delta where relevant. Use the existing
compare_against_baseline metric handling and CRITICAL_METRICS thresholding so
missing current values are reported as WARN or FAIL instead of PASS.
- Around line 232-252: The compare script’s check-only mode is emitting the
baseline/current summary prints before the args.check_only branch, so stdout
contains more than the required single token. In compare_against_baseline.py,
update the control flow around compute_metric_deltas and the check-only handling
so the summary prints are skipped when args.check_only is set, leaving only the
final ok/regression output from the check_only path.
In `@tests/script/test_compare_against_baseline.py`:
- Around line 48-51: The missing-directory test in
test_raises_on_missing_directory is nondeterministic because it hardcodes a /tmp
path that may exist on some machines. Update the test to use pytest’s tmp_path
fixture and construct a guaranteed-missing subpath (for example via tmp_path
with a non-created child) before calling find_and_load_summary, so the
FileNotFoundError assertion always exercises the intended path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e9f84c03-82ee-4691-8e2f-cb77a7f724fc
📒 Files selected for processing (9)
baselines/lcore_regression/current_baseline_summary.jsonconfig/lcore_regression/system-config-pr-gate.yamleval_data/lcore_regression/okp_rag_quality.yamlscript/regression/__init__.pyscript/regression/compare_abc_runs.pyscript/regression/compare_against_baseline.pytests/script/conftest.pytests/script/test_compare_abc_runs.pytests/script/test_compare_against_baseline.py
| pr_has_critical = _has_critical_regression(pr_deltas) | ||
| okp_has_critical = _has_critical_regression(okp_deltas) | ||
| total_has_critical = _has_critical_regression(total_deltas) | ||
|
|
||
| if pr_deltas is not None and pr_has_critical: | ||
| return ("FAIL", "PR caused critical regression (B vs C).") | ||
|
|
||
| if okp_has_critical and not pr_has_critical: | ||
| return ("WARN", "OKP data caused regression, not the PR (A vs B).") | ||
|
|
||
| if total_has_critical and pr_deltas is None: | ||
| return ("FAIL", "Critical regression detected vs baseline (A vs C).") | ||
|
|
||
| has_any_warn = ( | ||
| _has_noncritical_regression(okp_deltas) | ||
| or _has_noncritical_regression(pr_deltas) | ||
| or _has_noncritical_regression(total_deltas) | ||
| ) | ||
| if has_any_warn: | ||
| return ("WARN", "Non-critical metrics degraded.") | ||
|
|
||
| return ("PASS", "No critical regressions detected.") |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
sed -n '1,220p' script/regression/compare_abc_runs.pyRepository: lightspeed-core/lightspeed-evaluation
Length of output: 7579
🏁 Script executed:
python3 - <<'PY'
from pathlib import Path
p = Path('script/regression/compare_abc_runs.py')
text = p.read_text()
for needle in ['def _has_critical_regression', 'def _has_noncritical_regression', 'def determine_gate_verdict']:
idx = text.find(needle)
print('\n###', needle, idx)
if idx != -1:
start = text.rfind('\n', 0, idx)
end = text.find('\n\ndef ', idx+1)
print(text[idx:end if end != -1 else None])
PYRepository: lightspeed-core/lightspeed-evaluation
Length of output: 2244
Include total_has_critical in the gate verdict. Right now a critical regression split across A→B and B→C can still return PASS, because total_deltas is only considered when pr_deltas is None. If total cumulative quality should gate the run, check total_has_critical here too.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@script/regression/compare_abc_runs.py` around lines 89 - 110, The verdict
logic in compare_abc_runs currently ignores total cumulative critical
regressions unless pr_deltas is None, so a split A→B and B→C failure can still
pass. Update the gate in the main decision block to consider total_has_critical
alongside the existing pr_has_critical and okp_has_critical checks, using the
same verdict flow in the compare_abc_runs function so any critical total
regression returns a failing result when intended.
| if baseline_mean is not None and current_mean is not None: | ||
| score_delta = current_mean - baseline_mean | ||
| else: | ||
| score_delta = None | ||
|
|
||
| if baseline_pass_rate is not None and current_pass_rate is not None: | ||
| pass_rate_delta = current_pass_rate - baseline_pass_rate | ||
| else: | ||
| pass_rate_delta = None | ||
|
|
||
| is_critical = metric in CRITICAL_METRICS | ||
| threshold = critical_delta if is_critical else warn_delta | ||
|
|
||
| if score_delta is not None and score_delta < -threshold: | ||
| status = "FAIL" if is_critical else "WARN" | ||
| else: | ||
| status = "PASS" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Treat missing current metrics as regressions, not PASS.
Lines 141-157 fall back to PASS whenever the current run omits a metric from the baseline. That lets a broken or disabled evaluator silently pass the gate, including for critical metrics. Please surface “present in baseline, missing in current” as a degraded result instead of a success.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@script/regression/compare_against_baseline.py` around lines 141 - 157, The
status calculation in compare_against_baseline should not default to PASS when a
baseline metric is present but the current run is missing it. Update the logic
around score_delta/status in the comparison flow to treat “present in baseline,
missing in current” as a degraded outcome, and apply the same handling for
pass_rate_delta where relevant. Use the existing compare_against_baseline metric
handling and CRITICAL_METRICS thresholding so missing current values are
reported as WARN or FAIL instead of PASS.
| print( | ||
| f"Baseline: {baseline_data['total_evaluations']} evaluations " | ||
| f"({baseline_data['timestamp']})" | ||
| ) | ||
| print( | ||
| f"Current: {current_data['total_evaluations']} evaluations " | ||
| f"({current_data['timestamp']})" | ||
| ) | ||
|
|
||
| deltas = compute_metric_deltas( | ||
| baseline_data, | ||
| current_data, | ||
| args.critical_delta, | ||
| args.warn_delta, | ||
| ) | ||
|
|
||
| has_critical_fail = any(d["status"] == "FAIL" and d["is_critical"] for d in deltas) | ||
|
|
||
| if args.check_only: | ||
| print("regression" if has_critical_fail else "ok") | ||
| return 0 |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
--check-only currently emits extra stdout.
Lines 232-239 print the run summaries before the Lines 250-252 check_only branch, so this mode outputs more than just ok/regression. That breaks the documented single-token contract for callers parsing stdout.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@script/regression/compare_against_baseline.py` around lines 232 - 252, The
compare script’s check-only mode is emitting the baseline/current summary prints
before the args.check_only branch, so stdout contains more than the required
single token. In compare_against_baseline.py, update the control flow around
compute_metric_deltas and the check-only handling so the summary prints are
skipped when args.check_only is set, leaving only the final ok/regression output
from the check_only path.
| def test_raises_on_missing_directory(self) -> None: | ||
| """Non-existent directory should raise FileNotFoundError.""" | ||
| with pytest.raises(FileNotFoundError, match="Directory not found"): | ||
| find_and_load_summary("/tmp/nonexistent_dir_abc123") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Make the missing-directory test deterministic.
Line 51 hardcodes /tmp/nonexistent_dir_abc123. If that path ever exists on a developer machine or shared runner, this test stops exercising the intended error path. Build it from tmp_path / "does-not-exist" instead.
🧰 Tools
🪛 ast-grep (0.44.0)
[info] 50-50: Do not hardcode temporary file or directory names
Context: "/tmp/nonexistent_dir_abc123"
Note: [CWE-377] Insecure Temporary File.
(hardcoded-tmp-file)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/script/test_compare_against_baseline.py` around lines 48 - 51, The
missing-directory test in test_raises_on_missing_directory is nondeterministic
because it hardcodes a /tmp path that may exist on some machines. Update the
test to use pytest’s tmp_path fixture and construct a guaranteed-missing subpath
(for example via tmp_path with a non-created child) before calling
find_and_load_summary, so the FileNotFoundError assertion always exercises the
intended path.
Description
Adds an evaluation framework for OKP RAG quality regression testing against the lightspeed-stack. Includes:
negative/off-topic queries
compare_against_baseline) with--check-onlymode for CI gatinggpt-4o-minias the judge LLMType of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.
make pre-commitpasses all checks (pylint, pyright, ruff, black, etc.)lightspeed-stackinstance with OKP enabledSummary by CodeRabbit
New Features
Bug Fixes