Review fixes: drop 0.3-floor metric, add paper bootstrap CIs, harden + clean up by MaxGhenis · Pull Request #58 · PolicyEngine/policybench

MaxGhenis · 2026-06-04T19:32:46Z

Summary

A detailed review of the repo (Python core, app, paper, docs, tests) followed by fixes across every area, plus an independent review-fix cycle. 8 commits, +689/−882. All tests, lint, app build, and the paper render are green.

What changed

Remove the legacy 0.3-floor impact metric — HOUSEHOLD_IMPACT_SCORE_FLOOR=0.3 was a superseded weighting scheme, still computed and written to impact_summary_by_model.csv but no longer consumed by the leaderboard, app, or paper. Removed the two analysis.py functions and all wiring, prompt_mode_comparison._impact_comparison, the app impactScore type, and the README/RESULTS/docs references. Renamed the paper's vestigial impact_floor_sensitivity() → weighting_sensitivity() / tbl-impact-floor → tbl-weighting-sensitivity and fixed its caption.

Add household-bootstrap confidence intervals to the paper (paper-only) — the headline scores are means over a 100-household sample, but the manuscript reported no uncertainty. Added analysis.bootstrap_headline_cis() (resample households with replacement, paired across models, percentile CI + bootstrap rank ranges) with unit tests, and wired CI + rank-range columns into the leaderboards plus an "Uncertainty" section. CIs are a manuscript artifact only — the app/leaderboard still show point estimates (and the app's dead bootstrap-interval scaffolding + ± stability render were removed). In the frozen snapshot the UK field is a statistical dead heat (all models overlap the leader).

Harden ground-truth / spec / eval — fixed an operator-precedence trap in the impact-weight gate, added a strict spec-resolution sentinel (so a renamed PE variable fails loudly instead of silently scoring as a +1 amount), removed dead code, and documented the load-bearing "explanation trailer is canonical" override.

App cleanup — removed dead code/deps, replaced hardcoded hex with ui-kit tokens, fixed a UK $0 currency bug and a drifting "frontier only" tooltip.

CLI / docs reproducibility — fixed a RESULTS.md loop that aborted on Claude (--parallel 4), filled runbook gaps (gemini-3.5-flash, a row-repair section), added a $POLICYBENCH_CACHE_DIR override and single-country export-full-run auto-detect, and added CLI tests.

Population-weight drift sentinel — population_weights.json is the headline weight source; added a sentinel that fails if it drifts from the spec/tax-year (stable across the in-progress microdata refresh), plus a clearer sampling-guard error.

Drop EXPERIMENTAL_MODELS for production — collapsed EXPERIMENTAL_MODELS + RUNNABLE_MODELS to a single canonical MODELS.

Verification

pytest → 289 passed (was 273; +16 new tests), ruff check/format clean
App: eslint --max-warnings=0, tsc --noEmit, bun test (7), next build all pass
Paper: every cell + inline expression executes against the frozen snapshot
Independent subagent reviews (Python+paper, app, docs/tests) + a review-fix cycle round — no actionable findings remain

No models were rerun and no microdata/frozen artifacts were regenerated (per in-progress microdata refresh); everything validated with sentinels against the frozen snapshot.

Deferred — for decision (intentionally not changed)

2024-vs-2026 input period: scenarios extract inputs at the dataset's 2024 period but label/score them at TY2026, so benchmark incomes sit ~9% below 2026-representative and below the year the weights use. Affects every household + ground truth and is entangled with the microdata refresh — needs an explicit call.
population_weights.json is stale (5,923 vs 9,343 positive-weight households vs the installed build). The sentinel now guards spec/year drift, but the values need regenerating as part of the microdata refresh.
load_excluded_household_ids is country-blind (narrow cross-country ID-collision risk) and UK prompts include obscure wealth columns — flagged, not fixed (need a design call).
retry_predictions.csv vs .csv.gz: the code writes .csv; the paper reads .csv.gz. Runbook left matching the code; the paper-side mismatch is flagged.
Eval-runtime improvements (explanation-retry loop, all-null tool-call routing): real improvements but they change untested runtime behavior, so deferred rather than shipped unvalidated.
app/src/data.json still carries an orphaned impactScore key — unread, drops out on the next export.

The household-equal impact score (HOUSEHOLD_IMPACT_SCORE_FLOOR = 0.3) was a superseded weighting scheme. The headline ranking is the within-1% weighted hit rate and the secondary "sensitivity" family is the floor-free bounded / equal / budget weighting, so the 0.3-floor path was dead: still computed and written to impact_summary_by_model.csv, but no longer consumed by the leaderboard, the app, or the paper (its impact_floor_sensitivity() helper already computes the floor-free comparison). Removes: - config.HOUSEHOLD_IMPACT_SCORE_FLOOR - analysis.household_equal_impact_scores / household_impact_summary_by_model and all impact_summary wiring (analyze_no_tools, render_markdown_report, build_dashboard_payload impactScore, export_analysis CSV) - prompt_mode_comparison._impact_comparison - tests for the removed functions and impact_summary artifacts - app ModelStat.impactScore / impactCountryScores types - README / RESULTS / docs references to impact_summary_by_model.csv Paper: renames the vestigial impact_floor_sensitivity() -> weighting_sensitivity() and tbl-impact-floor -> tbl-weighting-sensitivity, fixes the table caption to match the actual row order/terminology (household, equal, budget), and drops the now-meaningless "floor-free" qualifier. The frozen app/src/data.json still carries an orphaned impactScore key; it is unread and will drop out on the next legitimate export (microdata refresh in progress), so it is intentionally left untouched here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The leaderboard scores are means over a 100-household sample per country, so the rankings carry real sampling uncertainty that the manuscript previously did not quantify (and docs/site_improvements_scope.md wrongly claimed it did). analysis.py: - household_headline_scores(): per-(model, scenario) population-weighted hit score for one headline metric — the household-level building block whose equal-household mean reproduces weighted_hit_rate_scores_by_model. - bootstrap_headline_cis(): resamples the households with replacement (paired across models, household carried with its full output vector), returning per-model point, 95% percentile CI, and bootstrap rank range. - Unit tests cover consistency with the headline path, CI bracketing, zero vs positive width, reproducibility, and the exact-match metric. paper: - tbl-us-top / tbl-uk-top gain "Within 1% 95% CI" and "Within-1% rank" columns; the top-three prose carries the CI; a new Uncertainty subsection documents the bootstrap (10,000 resamples, household unit) and reports, dynamically, how many models overlap the leader's interval. In the frozen snapshot the UK field is effectively a tie (all models overlap the leader). - CIs are a manuscript artifact only; the leaderboard/app continue to show point estimates, and site_improvements_scope.md now records that as a decision. Validated by executing every paper cell and inline expression against the frozen snapshot (no model reruns). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Defensive correctness fixes from the review (no scoring-behavior change): - ground_truth.py: parenthesize the impact-weight masking gate `(output binary) or (aggregation == any_positive)` — the previous `A and B or C` parsed as `(A and B) or C` and only happened to be correct for current outputs. Remove the dead DERIVED_HOUSEHOLD_BOOLEAN_VARIABLES fallback (both school-meals outputs already resolve via benchmark_specs.json). - spec.py: add a sentinel test asserting every headline output id (US + UK) resolves to a real spec or person template. metric_type_for_output / net_income_sign_for_output silently default unknown ids to amount/+1, so a variable renamed by the in-progress model bump would otherwise mis-score instead of failing loudly. Fix the stale parse_person_output docstring example. - eval_no_tools.py: document that the terminal `value = X` explanation line is canonical and overrides a disagreeing structured/tool value (load-bearing for every scored row); fix the stale "every 100 rows" checkpoint docstring; rename MAX_RETRIES -> MAX_ATTEMPTS (it bounds attempts, not retries). - reparse_predictions.py: correct the docstring — reparsing can replace an existing value with the canonical explanation trailer, not only fill missing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Frontend cleanup from the review (verified: eslint --max-warnings=0, tsc --noEmit, bun test 7/7, and next build all pass): - Remove the run-to-run stability render (fmtRunStability "% +/- std over N runs") and its Methodology copy. Confidence intervals / uncertainty belong in the paper only, and these were dead anyway (no run data is shipped). - Remove the dead "bootstrap intervals" scaffolding in lib/sensitivity.ts (buildRows/buildAllRows/ScoreRow) — the app never renders intervals. - Replace hardcoded hex fallbacks in theme.css with ui-kit color tokens. - Drop the stale grok-4.20 PENDING_MODELS entry (now in data.json) and the now-empty pending-rows UI. - Fix a UK currency bug in ScenarioExplorer (zero-reference string used a literal "$0"); derive the "frontier only" tooltip from FRONTIER_MODELS so it stops drifting (now includes DeepSeek V4 Pro). - Remove unused code/deps: ExplanationTooltip.tsx, UI_COLORS/MODEL_COLORS and orphaned color constants, scoring.ts prefix/suffix tables, format.ts formatDollars, and the recharts/motion/simple-icons/@icons-pack deps plus their vestigial CSS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- RESULTS.md: the quick-run loop sent Claude models through eval-no-tools-chunked at --parallel 4, which aborts (Claude must run --parallel 1 --model-parallel 1). Split into a serial Claude loop and a parallel non-Claude loop, matching docs/runbook.md. - docs/runbook.md: add gemini-3.5-flash (in config.MODELS and the frozen snapshot) to the Gemini group and the default-set list, fix that list's indentation, note newer config.MODELS entries (DeepSeek), and add a row-level "5b. Repair Individual Broken Rows" section — the manuscript's Appendix A depends on repair-failed-rows, which the runbook previously omitted. - cache.py: honor $POLICYBENCH_CACHE_DIR and an explicit cache_dir argument so chunked/resumed runs launched from different working directories share one disk cache instead of silently starting fresh. - full_run_export.export_full_run: when --country is omitted, auto-detect the populated country subdirectories instead of hardcoding both us and uk (a single-country run previously raised FileNotFoundError), with a clear error when none are found. - Add CLI-helper tests (_parse_models / _parse_programs / _slice_scenarios guards) and export auto-detect tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Reviewer flagged that population_weights.json is the headline-ranking weight source but nothing detects when it drifts out of sync with the spec/tax year (values are also stale vs the in-progress microdata refresh, which is expected and handled by regeneration, not here). - Add a drift sentinel asserting both countries' weight artifacts target TAX_YEAR and that the actual loader (matching_population_weight_series) covers the current headline output groups for every weighting kind. This guards the stable id/tax-year contract without pinning weight values or household counts, so it survives the microdata refresh but fails loudly if a spec rename leaves the scorer silently falling back to benchmark-row weights. - _sample_household_ids: when sampling by weight without replacement, require at least n positive-weight households and raise a clear error instead of numpy's cryptic "Fewer non-zero entries in p than size". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Independent re-review of the app and docs/tests diffs surfaced three actionable items (all others were confirmed clean): - app/theme.css: the previous fallbacks var(--color-orange-700) / var(--color-red-700) are not defined in @policyengine/ui-kit (there is no orange or red scale), making them invalid fallbacks — strictly worse than the hex they replaced. The primary --text-warning / --text-error / --text-success tokens are all defined by the ui-kit theme, so drop the fallbacks entirely (no hardcoded hex, no dangling token). - docs/runbook.md: row_repair.py writes row_repair_attempts.csv uncompressed, not .csv.gz — correct the documented artifact name. - tests/test_cli_helpers.py: _parse_models validates/returns from RUNNABLE_MODELS, not MODELS; assert against RUNNABLE_MODELS so the test does not silently rely on EXPERIMENTAL_MODELS being empty. - tests/test_analysis.py: add a deterministic rank-range assertion (the all-1.0 model can never be ranked below first) to exercise the bootstrap rank machinery directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

EXPERIMENTAL_MODELS was the (empty) shelf for models runnable via --model but excluded from default runs/leaderboard, and RUNNABLE_MODELS existed only to merge it with MODELS. With the benchmark going to production there is one canonical model set, so collapse to MODELS everywhere: - config.py: remove EXPERIMENTAL_MODELS and RUNNABLE_MODELS. - cli.py / row_repair.py / chunked_eval.py: validate and resolve against MODELS. - tests: assert against MODELS; drop the now-moot "deepseek not in EXPERIMENTAL_MODELS" assertions (kept the public-default id checks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-06-04T19:32:51Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench-site	Ready	Preview, Comment	Jun 4, 2026 10:10pm

The EXPERIMENTAL_MODELS removal left one prose echo of the deleted concept in a test docstring. Reword to describe current behavior (DeepSeek V4 are public default models) rather than the removed history. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MaxGhenis and others added 8 commits June 4, 2026 07:41

vercel Bot deployed to Preview June 4, 2026 19:34 View deployment

vercel Bot deployed to Preview June 4, 2026 22:10 View deployment

MaxGhenis merged commit 0ea540b into main Jun 4, 2026
5 checks passed

MaxGhenis deleted the policybench-review-fixes branch June 4, 2026 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review fixes: drop 0.3-floor metric, add paper bootstrap CIs, harden + clean up#58

Review fixes: drop 0.3-floor metric, add paper bootstrap CIs, harden + clean up#58
MaxGhenis merged 9 commits into
mainfrom
policybench-review-fixes

MaxGhenis commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Jun 4, 2026

Summary

What changed

Verification

Deferred — for decision (intentionally not changed)

Related

Uh oh!

vercel Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 4, 2026 •

edited

Loading