Pin US PolicyBench runs to policyengine.py 4.14.2 by MaxGhenis · Pull Request #59 · PolicyEngine/policybench

MaxGhenis · 2026-06-07T19:56:03Z

Summary

pin the PolicyBench Python dependency to the verified latest policyengine[us]==4.14.2 and remove the direct policyengine-us GitHub commit pin
read matching PolicyEngine bundle provenance from the raw policyengine.py release manifest so metadata does not depend on importing top-level policyengine
route US population-weight generation through the PolicyBench runtime wrapper and record the certified dataset URI, model/data versions, build ID, and sha256
continue excluding fsla_overtime_premium from prompt/situation inputs under the latest policyengine-us where it is now discovered as an input

Sentinel verification

Authoritative latest-version checks:

PyPI JSON for policyengine -> 4.14.2
python3 -m pip index versions policyengine -> latest 4.14.2

Resolved local environment:

policyengine==4.14.2
policyengine-us==1.715.2
policyengine-core==3.26.1
raw US manifest bundle_id=us-4.14.2, data package policyengine-us-data==1.115.5, default dataset enhanced_cps_2024, URI hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5@d47fb5475144260a75467d2f2e22b2d5d53d4d57

Local sentinels:

uv run python -m policybench.cli reference-outputs --country us --num-scenarios 3 --seed 42 --output results/local/sentinels/policyengine_4_14_2/us/reference_outputs.csv --scenario-manifest-output results/local/sentinels/policyengine_4_14_2/us/scenarios.csv
uv run python -m policybench.cli population-weights --country us --output results/local/sentinels/policyengine_4_14_2/us/population_weights.json

The population-weight sentinel recorded source_household_rows=41314, positive_weight_households=9343, and all household/aggregate/equal weight vectors summed to 1.

Checks

uv run pytest tests/test_policyengine_runtime.py tests/test_population_weights.py -q
uv run pytest tests/test_scenarios.py::test_formula_overtime_premium_is_not_prompted_or_sent_to_policyengine -q
uv run pytest -m "not slow" --tb=short -q
uv run ruff format --check .
uv run ruff check .

vercel · 2026-06-07T19:56:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench-site	Ready	Preview, Comment	Jun 7, 2026 7:57pm

MaxGhenis · 2026-06-07T20:18:22Z

Paid US sentinel completed locally against the 3-household policyengine_4_14_2 manifest.

Models run, one-household chunks, serial:

gpt-5.4-nano
gemini-3.5-flash
grok-4.1-fast
claude-haiku-4.5

Structural result:

264 combined prediction rows (66 per model)
0 missing predictions
0 missing explanations
0 provider error rows
analysis export succeeded under results/local/sentinels/policyengine_4_14_2/us/paid_smoke/analysis

Usage/cost summary:

gpt-5.4-nano: $0.006231, 30.6s
gemini-3.5-flash: $0.069912, 35.4s
grok-4.1-fast: $0.022298, 18.5s
claude-haiku-4.5: $0.201590, 204.9s
total: $0.300031

Provider metadata notes:

gpt-5.4-nano resolved as gpt-5.4-nano-2026-03-17
gemini-3.5-flash resolved as gemini-3.5-flash
claude-haiku-4.5 resolved as claude-haiku-4-5-20251001
grok-4.1-fast rows reported provider-resolved model grok-4.3; worth checking before treating that alias as a distinct full-run model.

Non-blocking stdout warning: LiteLLM/Pydantic serializer warnings appeared for Gemini/xAI/Claude response serialization, but all chunks wrote complete rows and analysis parsed them.

Pin PolicyBench US runs to policyengine.py 4.14.2

8efa697

vercel Bot deployed to Preview June 7, 2026 19:57 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pin US PolicyBench runs to policyengine.py 4.14.2#59

Pin US PolicyBench runs to policyengine.py 4.14.2#59
MaxGhenis wants to merge 1 commit into
codex/policybench-lingering-fixesfrom
codex/policybench-latest-policyengine-us

MaxGhenis commented Jun 7, 2026

Uh oh!

vercel Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

MaxGhenis commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Jun 7, 2026

Summary

Sentinel verification

Checks

Uh oh!

vercel Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGhenis commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 7, 2026 •

edited

Loading