Skip to content

Review fixes: drop 0.3-floor metric, add paper bootstrap CIs, harden + clean up#58

Merged
MaxGhenis merged 9 commits into
mainfrom
policybench-review-fixes
Jun 4, 2026
Merged

Review fixes: drop 0.3-floor metric, add paper bootstrap CIs, harden + clean up#58
MaxGhenis merged 9 commits into
mainfrom
policybench-review-fixes

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

Summary

A detailed review of the repo (Python core, app, paper, docs, tests) followed by fixes across every area, plus an independent review-fix cycle. 8 commits, +689/−882. All tests, lint, app build, and the paper render are green.

What changed

Remove the legacy 0.3-floor impact metricHOUSEHOLD_IMPACT_SCORE_FLOOR=0.3 was a superseded weighting scheme, still computed and written to impact_summary_by_model.csv but no longer consumed by the leaderboard, app, or paper. Removed the two analysis.py functions and all wiring, prompt_mode_comparison._impact_comparison, the app impactScore type, and the README/RESULTS/docs references. Renamed the paper's vestigial impact_floor_sensitivity()weighting_sensitivity() / tbl-impact-floortbl-weighting-sensitivity and fixed its caption.

Add household-bootstrap confidence intervals to the paper (paper-only) — the headline scores are means over a 100-household sample, but the manuscript reported no uncertainty. Added analysis.bootstrap_headline_cis() (resample households with replacement, paired across models, percentile CI + bootstrap rank ranges) with unit tests, and wired CI + rank-range columns into the leaderboards plus an "Uncertainty" section. CIs are a manuscript artifact only — the app/leaderboard still show point estimates (and the app's dead bootstrap-interval scaffolding + ± stability render were removed). In the frozen snapshot the UK field is a statistical dead heat (all models overlap the leader).

Harden ground-truth / spec / eval — fixed an operator-precedence trap in the impact-weight gate, added a strict spec-resolution sentinel (so a renamed PE variable fails loudly instead of silently scoring as a +1 amount), removed dead code, and documented the load-bearing "explanation trailer is canonical" override.

App cleanup — removed dead code/deps, replaced hardcoded hex with ui-kit tokens, fixed a UK $0 currency bug and a drifting "frontier only" tooltip.

CLI / docs reproducibility — fixed a RESULTS.md loop that aborted on Claude (--parallel 4), filled runbook gaps (gemini-3.5-flash, a row-repair section), added a $POLICYBENCH_CACHE_DIR override and single-country export-full-run auto-detect, and added CLI tests.

Population-weight drift sentinelpopulation_weights.json is the headline weight source; added a sentinel that fails if it drifts from the spec/tax-year (stable across the in-progress microdata refresh), plus a clearer sampling-guard error.

Drop EXPERIMENTAL_MODELS for production — collapsed EXPERIMENTAL_MODELS + RUNNABLE_MODELS to a single canonical MODELS.

Verification

  • pytest → 289 passed (was 273; +16 new tests), ruff check/format clean
  • App: eslint --max-warnings=0, tsc --noEmit, bun test (7), next build all pass
  • Paper: every cell + inline expression executes against the frozen snapshot
  • Independent subagent reviews (Python+paper, app, docs/tests) + a review-fix cycle round — no actionable findings remain

No models were rerun and no microdata/frozen artifacts were regenerated (per in-progress microdata refresh); everything validated with sentinels against the frozen snapshot.

Deferred — for decision (intentionally not changed)

  • 2024-vs-2026 input period: scenarios extract inputs at the dataset's 2024 period but label/score them at TY2026, so benchmark incomes sit ~9% below 2026-representative and below the year the weights use. Affects every household + ground truth and is entangled with the microdata refresh — needs an explicit call.
  • population_weights.json is stale (5,923 vs 9,343 positive-weight households vs the installed build). The sentinel now guards spec/year drift, but the values need regenerating as part of the microdata refresh.
  • load_excluded_household_ids is country-blind (narrow cross-country ID-collision risk) and UK prompts include obscure wealth columns — flagged, not fixed (need a design call).
  • retry_predictions.csv vs .csv.gz: the code writes .csv; the paper reads .csv.gz. Runbook left matching the code; the paper-side mismatch is flagged.
  • Eval-runtime improvements (explanation-retry loop, all-null tool-call routing): real improvements but they change untested runtime behavior, so deferred rather than shipped unvalidated.
  • app/src/data.json still carries an orphaned impactScore key — unread, drops out on the next export.

Related

Design-token accessibility fix split upstream: PolicyEngine/policyengine-ui-kit#37 (light --text-tertiary WCAG AA). Once that's published and the dep bumped, the app's muted-text override can be removed.

🤖 Generated with Claude Code

MaxGhenis and others added 8 commits June 4, 2026 07:41
The household-equal impact score (HOUSEHOLD_IMPACT_SCORE_FLOOR = 0.3) was a
superseded weighting scheme. The headline ranking is the within-1% weighted hit
rate and the secondary "sensitivity" family is the floor-free bounded / equal /
budget weighting, so the 0.3-floor path was dead: still computed and written to
impact_summary_by_model.csv, but no longer consumed by the leaderboard, the app,
or the paper (its impact_floor_sensitivity() helper already computes the
floor-free comparison).

Removes:
- config.HOUSEHOLD_IMPACT_SCORE_FLOOR
- analysis.household_equal_impact_scores / household_impact_summary_by_model and
  all impact_summary wiring (analyze_no_tools, render_markdown_report,
  build_dashboard_payload impactScore, export_analysis CSV)
- prompt_mode_comparison._impact_comparison
- tests for the removed functions and impact_summary artifacts
- app ModelStat.impactScore / impactCountryScores types
- README / RESULTS / docs references to impact_summary_by_model.csv

Paper: renames the vestigial impact_floor_sensitivity() -> weighting_sensitivity()
and tbl-impact-floor -> tbl-weighting-sensitivity, fixes the table caption to
match the actual row order/terminology (household, equal, budget), and drops the
now-meaningless "floor-free" qualifier.

The frozen app/src/data.json still carries an orphaned impactScore key; it is
unread and will drop out on the next legitimate export (microdata refresh in
progress), so it is intentionally left untouched here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The leaderboard scores are means over a 100-household sample per country, so the
rankings carry real sampling uncertainty that the manuscript previously did not
quantify (and docs/site_improvements_scope.md wrongly claimed it did).

analysis.py:
- household_headline_scores(): per-(model, scenario) population-weighted hit
  score for one headline metric — the household-level building block whose
  equal-household mean reproduces weighted_hit_rate_scores_by_model.
- bootstrap_headline_cis(): resamples the households with replacement
  (paired across models, household carried with its full output vector),
  returning per-model point, 95% percentile CI, and bootstrap rank range.
- Unit tests cover consistency with the headline path, CI bracketing, zero vs
  positive width, reproducibility, and the exact-match metric.

paper:
- tbl-us-top / tbl-uk-top gain "Within 1% 95% CI" and "Within-1% rank" columns;
  the top-three prose carries the CI; a new Uncertainty subsection documents the
  bootstrap (10,000 resamples, household unit) and reports, dynamically, how
  many models overlap the leader's interval. In the frozen snapshot the UK field
  is effectively a tie (all models overlap the leader).
- CIs are a manuscript artifact only; the leaderboard/app continue to show point
  estimates, and site_improvements_scope.md now records that as a decision.

Validated by executing every paper cell and inline expression against the frozen
snapshot (no model reruns).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Defensive correctness fixes from the review (no scoring-behavior change):

- ground_truth.py: parenthesize the impact-weight masking gate
  `(output binary) or (aggregation == any_positive)` — the previous
  `A and B or C` parsed as `(A and B) or C` and only happened to be correct
  for current outputs. Remove the dead DERIVED_HOUSEHOLD_BOOLEAN_VARIABLES
  fallback (both school-meals outputs already resolve via benchmark_specs.json).
- spec.py: add a sentinel test asserting every headline output id (US + UK)
  resolves to a real spec or person template. metric_type_for_output /
  net_income_sign_for_output silently default unknown ids to amount/+1, so a
  variable renamed by the in-progress model bump would otherwise mis-score
  instead of failing loudly. Fix the stale parse_person_output docstring example.
- eval_no_tools.py: document that the terminal `value = X` explanation line is
  canonical and overrides a disagreeing structured/tool value (load-bearing for
  every scored row); fix the stale "every 100 rows" checkpoint docstring; rename
  MAX_RETRIES -> MAX_ATTEMPTS (it bounds attempts, not retries).
- reparse_predictions.py: correct the docstring — reparsing can replace an
  existing value with the canonical explanation trailer, not only fill missing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Frontend cleanup from the review (verified: eslint --max-warnings=0, tsc
--noEmit, bun test 7/7, and next build all pass):

- Remove the run-to-run stability render (fmtRunStability "% +/- std over N
  runs") and its Methodology copy. Confidence intervals / uncertainty belong in
  the paper only, and these were dead anyway (no run data is shipped).
- Remove the dead "bootstrap intervals" scaffolding in lib/sensitivity.ts
  (buildRows/buildAllRows/ScoreRow) — the app never renders intervals.
- Replace hardcoded hex fallbacks in theme.css with ui-kit color tokens.
- Drop the stale grok-4.20 PENDING_MODELS entry (now in data.json) and the
  now-empty pending-rows UI.
- Fix a UK currency bug in ScenarioExplorer (zero-reference string used a
  literal "$0"); derive the "frontier only" tooltip from FRONTIER_MODELS so it
  stops drifting (now includes DeepSeek V4 Pro).
- Remove unused code/deps: ExplanationTooltip.tsx, UI_COLORS/MODEL_COLORS and
  orphaned color constants, scoring.ts prefix/suffix tables, format.ts
  formatDollars, and the recharts/motion/simple-icons/@icons-pack deps plus
  their vestigial CSS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- RESULTS.md: the quick-run loop sent Claude models through
  eval-no-tools-chunked at --parallel 4, which aborts (Claude must run
  --parallel 1 --model-parallel 1). Split into a serial Claude loop and a
  parallel non-Claude loop, matching docs/runbook.md.
- docs/runbook.md: add gemini-3.5-flash (in config.MODELS and the frozen
  snapshot) to the Gemini group and the default-set list, fix that list's
  indentation, note newer config.MODELS entries (DeepSeek), and add a row-level
  "5b. Repair Individual Broken Rows" section — the manuscript's Appendix A
  depends on repair-failed-rows, which the runbook previously omitted.
- cache.py: honor $POLICYBENCH_CACHE_DIR and an explicit cache_dir argument so
  chunked/resumed runs launched from different working directories share one
  disk cache instead of silently starting fresh.
- full_run_export.export_full_run: when --country is omitted, auto-detect the
  populated country subdirectories instead of hardcoding both us and uk (a
  single-country run previously raised FileNotFoundError), with a clear error
  when none are found.
- Add CLI-helper tests (_parse_models / _parse_programs / _slice_scenarios
  guards) and export auto-detect tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reviewer flagged that population_weights.json is the headline-ranking weight
source but nothing detects when it drifts out of sync with the spec/tax year
(values are also stale vs the in-progress microdata refresh, which is expected
and handled by regeneration, not here).

- Add a drift sentinel asserting both countries' weight artifacts target
  TAX_YEAR and that the actual loader (matching_population_weight_series) covers
  the current headline output groups for every weighting kind. This guards the
  stable id/tax-year contract without pinning weight values or household counts,
  so it survives the microdata refresh but fails loudly if a spec rename leaves
  the scorer silently falling back to benchmark-row weights.
- _sample_household_ids: when sampling by weight without replacement, require at
  least n positive-weight households and raise a clear error instead of numpy's
  cryptic "Fewer non-zero entries in p than size".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent re-review of the app and docs/tests diffs surfaced three actionable
items (all others were confirmed clean):

- app/theme.css: the previous fallbacks var(--color-orange-700) /
  var(--color-red-700) are not defined in @policyengine/ui-kit (there is no
  orange or red scale), making them invalid fallbacks — strictly worse than the
  hex they replaced. The primary --text-warning / --text-error / --text-success
  tokens are all defined by the ui-kit theme, so drop the fallbacks entirely
  (no hardcoded hex, no dangling token).
- docs/runbook.md: row_repair.py writes row_repair_attempts.csv uncompressed,
  not .csv.gz — correct the documented artifact name.
- tests/test_cli_helpers.py: _parse_models validates/returns from
  RUNNABLE_MODELS, not MODELS; assert against RUNNABLE_MODELS so the test does
  not silently rely on EXPERIMENTAL_MODELS being empty.
- tests/test_analysis.py: add a deterministic rank-range assertion (the all-1.0
  model can never be ranked below first) to exercise the bootstrap rank
  machinery directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
EXPERIMENTAL_MODELS was the (empty) shelf for models runnable via --model but
excluded from default runs/leaderboard, and RUNNABLE_MODELS existed only to
merge it with MODELS. With the benchmark going to production there is one
canonical model set, so collapse to MODELS everywhere:

- config.py: remove EXPERIMENTAL_MODELS and RUNNABLE_MODELS.
- cli.py / row_repair.py / chunked_eval.py: validate and resolve against MODELS.
- tests: assert against MODELS; drop the now-moot "deepseek not in
  EXPERIMENTAL_MODELS" assertions (kept the public-default id checks).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 4, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench-site Ready Ready Preview, Comment Jun 4, 2026 10:10pm

Request Review

The EXPERIMENTAL_MODELS removal left one prose echo of the deleted concept in a
test docstring. Reword to describe current behavior (DeepSeek V4 are public
default models) rather than the removed history.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 0ea540b into main Jun 4, 2026
5 checks passed
@MaxGhenis MaxGhenis deleted the policybench-review-fixes branch June 4, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant