Review fixes: drop 0.3-floor metric, add paper bootstrap CIs, harden + clean up#58
Merged
Conversation
The household-equal impact score (HOUSEHOLD_IMPACT_SCORE_FLOOR = 0.3) was a superseded weighting scheme. The headline ranking is the within-1% weighted hit rate and the secondary "sensitivity" family is the floor-free bounded / equal / budget weighting, so the 0.3-floor path was dead: still computed and written to impact_summary_by_model.csv, but no longer consumed by the leaderboard, the app, or the paper (its impact_floor_sensitivity() helper already computes the floor-free comparison). Removes: - config.HOUSEHOLD_IMPACT_SCORE_FLOOR - analysis.household_equal_impact_scores / household_impact_summary_by_model and all impact_summary wiring (analyze_no_tools, render_markdown_report, build_dashboard_payload impactScore, export_analysis CSV) - prompt_mode_comparison._impact_comparison - tests for the removed functions and impact_summary artifacts - app ModelStat.impactScore / impactCountryScores types - README / RESULTS / docs references to impact_summary_by_model.csv Paper: renames the vestigial impact_floor_sensitivity() -> weighting_sensitivity() and tbl-impact-floor -> tbl-weighting-sensitivity, fixes the table caption to match the actual row order/terminology (household, equal, budget), and drops the now-meaningless "floor-free" qualifier. The frozen app/src/data.json still carries an orphaned impactScore key; it is unread and will drop out on the next legitimate export (microdata refresh in progress), so it is intentionally left untouched here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The leaderboard scores are means over a 100-household sample per country, so the rankings carry real sampling uncertainty that the manuscript previously did not quantify (and docs/site_improvements_scope.md wrongly claimed it did). analysis.py: - household_headline_scores(): per-(model, scenario) population-weighted hit score for one headline metric — the household-level building block whose equal-household mean reproduces weighted_hit_rate_scores_by_model. - bootstrap_headline_cis(): resamples the households with replacement (paired across models, household carried with its full output vector), returning per-model point, 95% percentile CI, and bootstrap rank range. - Unit tests cover consistency with the headline path, CI bracketing, zero vs positive width, reproducibility, and the exact-match metric. paper: - tbl-us-top / tbl-uk-top gain "Within 1% 95% CI" and "Within-1% rank" columns; the top-three prose carries the CI; a new Uncertainty subsection documents the bootstrap (10,000 resamples, household unit) and reports, dynamically, how many models overlap the leader's interval. In the frozen snapshot the UK field is effectively a tie (all models overlap the leader). - CIs are a manuscript artifact only; the leaderboard/app continue to show point estimates, and site_improvements_scope.md now records that as a decision. Validated by executing every paper cell and inline expression against the frozen snapshot (no model reruns). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Defensive correctness fixes from the review (no scoring-behavior change): - ground_truth.py: parenthesize the impact-weight masking gate `(output binary) or (aggregation == any_positive)` — the previous `A and B or C` parsed as `(A and B) or C` and only happened to be correct for current outputs. Remove the dead DERIVED_HOUSEHOLD_BOOLEAN_VARIABLES fallback (both school-meals outputs already resolve via benchmark_specs.json). - spec.py: add a sentinel test asserting every headline output id (US + UK) resolves to a real spec or person template. metric_type_for_output / net_income_sign_for_output silently default unknown ids to amount/+1, so a variable renamed by the in-progress model bump would otherwise mis-score instead of failing loudly. Fix the stale parse_person_output docstring example. - eval_no_tools.py: document that the terminal `value = X` explanation line is canonical and overrides a disagreeing structured/tool value (load-bearing for every scored row); fix the stale "every 100 rows" checkpoint docstring; rename MAX_RETRIES -> MAX_ATTEMPTS (it bounds attempts, not retries). - reparse_predictions.py: correct the docstring — reparsing can replace an existing value with the canonical explanation trailer, not only fill missing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Frontend cleanup from the review (verified: eslint --max-warnings=0, tsc --noEmit, bun test 7/7, and next build all pass): - Remove the run-to-run stability render (fmtRunStability "% +/- std over N runs") and its Methodology copy. Confidence intervals / uncertainty belong in the paper only, and these were dead anyway (no run data is shipped). - Remove the dead "bootstrap intervals" scaffolding in lib/sensitivity.ts (buildRows/buildAllRows/ScoreRow) — the app never renders intervals. - Replace hardcoded hex fallbacks in theme.css with ui-kit color tokens. - Drop the stale grok-4.20 PENDING_MODELS entry (now in data.json) and the now-empty pending-rows UI. - Fix a UK currency bug in ScenarioExplorer (zero-reference string used a literal "$0"); derive the "frontier only" tooltip from FRONTIER_MODELS so it stops drifting (now includes DeepSeek V4 Pro). - Remove unused code/deps: ExplanationTooltip.tsx, UI_COLORS/MODEL_COLORS and orphaned color constants, scoring.ts prefix/suffix tables, format.ts formatDollars, and the recharts/motion/simple-icons/@icons-pack deps plus their vestigial CSS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- RESULTS.md: the quick-run loop sent Claude models through eval-no-tools-chunked at --parallel 4, which aborts (Claude must run --parallel 1 --model-parallel 1). Split into a serial Claude loop and a parallel non-Claude loop, matching docs/runbook.md. - docs/runbook.md: add gemini-3.5-flash (in config.MODELS and the frozen snapshot) to the Gemini group and the default-set list, fix that list's indentation, note newer config.MODELS entries (DeepSeek), and add a row-level "5b. Repair Individual Broken Rows" section — the manuscript's Appendix A depends on repair-failed-rows, which the runbook previously omitted. - cache.py: honor $POLICYBENCH_CACHE_DIR and an explicit cache_dir argument so chunked/resumed runs launched from different working directories share one disk cache instead of silently starting fresh. - full_run_export.export_full_run: when --country is omitted, auto-detect the populated country subdirectories instead of hardcoding both us and uk (a single-country run previously raised FileNotFoundError), with a clear error when none are found. - Add CLI-helper tests (_parse_models / _parse_programs / _slice_scenarios guards) and export auto-detect tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reviewer flagged that population_weights.json is the headline-ranking weight source but nothing detects when it drifts out of sync with the spec/tax year (values are also stale vs the in-progress microdata refresh, which is expected and handled by regeneration, not here). - Add a drift sentinel asserting both countries' weight artifacts target TAX_YEAR and that the actual loader (matching_population_weight_series) covers the current headline output groups for every weighting kind. This guards the stable id/tax-year contract without pinning weight values or household counts, so it survives the microdata refresh but fails loudly if a spec rename leaves the scorer silently falling back to benchmark-row weights. - _sample_household_ids: when sampling by weight without replacement, require at least n positive-weight households and raise a clear error instead of numpy's cryptic "Fewer non-zero entries in p than size". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent re-review of the app and docs/tests diffs surfaced three actionable items (all others were confirmed clean): - app/theme.css: the previous fallbacks var(--color-orange-700) / var(--color-red-700) are not defined in @policyengine/ui-kit (there is no orange or red scale), making them invalid fallbacks — strictly worse than the hex they replaced. The primary --text-warning / --text-error / --text-success tokens are all defined by the ui-kit theme, so drop the fallbacks entirely (no hardcoded hex, no dangling token). - docs/runbook.md: row_repair.py writes row_repair_attempts.csv uncompressed, not .csv.gz — correct the documented artifact name. - tests/test_cli_helpers.py: _parse_models validates/returns from RUNNABLE_MODELS, not MODELS; assert against RUNNABLE_MODELS so the test does not silently rely on EXPERIMENTAL_MODELS being empty. - tests/test_analysis.py: add a deterministic rank-range assertion (the all-1.0 model can never be ranked below first) to exercise the bootstrap rank machinery directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
EXPERIMENTAL_MODELS was the (empty) shelf for models runnable via --model but excluded from default runs/leaderboard, and RUNNABLE_MODELS existed only to merge it with MODELS. With the benchmark going to production there is one canonical model set, so collapse to MODELS everywhere: - config.py: remove EXPERIMENTAL_MODELS and RUNNABLE_MODELS. - cli.py / row_repair.py / chunked_eval.py: validate and resolve against MODELS. - tests: assert against MODELS; drop the now-moot "deepseek not in EXPERIMENTAL_MODELS" assertions (kept the public-default id checks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The EXPERIMENTAL_MODELS removal left one prose echo of the deleted concept in a test docstring. Reword to describe current behavior (DeepSeek V4 are public default models) rather than the removed history. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A detailed review of the repo (Python core, app, paper, docs, tests) followed by fixes across every area, plus an independent review-fix cycle. 8 commits, +689/−882. All tests, lint, app build, and the paper render are green.
What changed
Remove the legacy 0.3-floor impact metric —
HOUSEHOLD_IMPACT_SCORE_FLOOR=0.3was a superseded weighting scheme, still computed and written toimpact_summary_by_model.csvbut no longer consumed by the leaderboard, app, or paper. Removed the twoanalysis.pyfunctions and all wiring,prompt_mode_comparison._impact_comparison, the appimpactScoretype, and the README/RESULTS/docs references. Renamed the paper's vestigialimpact_floor_sensitivity()→weighting_sensitivity()/tbl-impact-floor→tbl-weighting-sensitivityand fixed its caption.Add household-bootstrap confidence intervals to the paper (paper-only) — the headline scores are means over a 100-household sample, but the manuscript reported no uncertainty. Added
analysis.bootstrap_headline_cis()(resample households with replacement, paired across models, percentile CI + bootstrap rank ranges) with unit tests, and wired CI + rank-range columns into the leaderboards plus an "Uncertainty" section. CIs are a manuscript artifact only — the app/leaderboard still show point estimates (and the app's dead bootstrap-interval scaffolding +±stability render were removed). In the frozen snapshot the UK field is a statistical dead heat (all models overlap the leader).Harden ground-truth / spec / eval — fixed an operator-precedence trap in the impact-weight gate, added a strict spec-resolution sentinel (so a renamed PE variable fails loudly instead of silently scoring as a
+1amount), removed dead code, and documented the load-bearing "explanation trailer is canonical" override.App cleanup — removed dead code/deps, replaced hardcoded hex with ui-kit tokens, fixed a UK
$0currency bug and a drifting "frontier only" tooltip.CLI / docs reproducibility — fixed a
RESULTS.mdloop that aborted on Claude (--parallel 4), filled runbook gaps (gemini-3.5-flash, a row-repair section), added a$POLICYBENCH_CACHE_DIRoverride and single-countryexport-full-runauto-detect, and added CLI tests.Population-weight drift sentinel —
population_weights.jsonis the headline weight source; added a sentinel that fails if it drifts from the spec/tax-year (stable across the in-progress microdata refresh), plus a clearer sampling-guard error.Drop
EXPERIMENTAL_MODELSfor production — collapsedEXPERIMENTAL_MODELS+RUNNABLE_MODELSto a single canonicalMODELS.Verification
pytest→ 289 passed (was 273; +16 new tests),ruff check/formatcleaneslint --max-warnings=0,tsc --noEmit,bun test(7),next buildall passNo models were rerun and no microdata/frozen artifacts were regenerated (per in-progress microdata refresh); everything validated with sentinels against the frozen snapshot.
Deferred — for decision (intentionally not changed)
population_weights.jsonis stale (5,923 vs 9,343 positive-weight households vs the installed build). The sentinel now guards spec/year drift, but the values need regenerating as part of the microdata refresh.load_excluded_household_idsis country-blind (narrow cross-country ID-collision risk) and UK prompts include obscure wealth columns — flagged, not fixed (need a design call).retry_predictions.csvvs.csv.gz: the code writes.csv; the paper reads.csv.gz. Runbook left matching the code; the paper-side mismatch is flagged.app/src/data.jsonstill carries an orphanedimpactScorekey — unread, drops out on the next export.Related
Design-token accessibility fix split upstream: PolicyEngine/policyengine-ui-kit#37 (light
--text-tertiaryWCAG AA). Once that's published and the dep bumped, the app's muted-text override can be removed.🤖 Generated with Claude Code