Skip to content

Refresh PolicyBench references and rankings#54

Open
MaxGhenis wants to merge 1 commit into
mainfrom
policybench-refresh-refs-20260523
Open

Refresh PolicyBench references and rankings#54
MaxGhenis wants to merge 1 commit into
mainfrom
policybench-refresh-refs-20260523

Conversation

@MaxGhenis

@MaxGhenis MaxGhenis commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • refresh US and UK snapshot reference outputs with current PolicyEngine packages while preserving the frozen 2026 scenario manifests
  • update the default Claude Opus benchmark alias from 4.7 to 4.8, with a legacy 4.7 app fallback for pre-rerun result files
  • regenerate analysis summaries, dashboard data, and manifest hashes on top of current main
  • refresh row/case audit annotations against the new wrong-row set
  • pin @policyengine/ui-kit to the merged GitHub commit with the light-mode text contrast fix, so this PR does not depend on an npm publish
  • make the removed US SPM input compatibility adjustment reproducible in reference-outputs metadata

PolicyEngine refs

  • policyengine.py 4.10.0
  • policyengine-us 1.705.1 at 7a7791f7a71e53629ff7b682a6960f3ab3a9e594
  • policyengine-uk 2.88.22 at 7445869cfed59248be53778588856c2d688b34be

Verification

  • uv run --no-sync pytest -q tests/test_scenarios.py tests/test_snapshot_artifacts.py tests/test_analysis.py
  • uv run --no-sync pytest -q tests/test_eval_no_tools.py tests/test_chunked_eval.py tests/test_cli_helpers.py
  • uv run --no-sync ruff check policybench/cli.py policybench/policyengine_runtime.py policybench/scenarios.py tests/test_scenarios.py
  • uv run --no-sync ruff check policybench/config.py policybench/eval_no_tools.py tests/test_eval_no_tools.py tests/test_chunked_eval.py
  • bun run lint
  • bun test tests
  • bun run build
  • GitHub CI: lint, test, and app passed
  • Vercel preview passed

@vercel

vercel Bot commented May 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench-site Ready Ready Preview, Comment Jun 7, 2026 7:26pm

Request Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant