Skip to content

Pin US PolicyBench runs to policyengine.py 4.14.2#59

Open
MaxGhenis wants to merge 1 commit into
codex/policybench-lingering-fixesfrom
codex/policybench-latest-policyengine-us
Open

Pin US PolicyBench runs to policyengine.py 4.14.2#59
MaxGhenis wants to merge 1 commit into
codex/policybench-lingering-fixesfrom
codex/policybench-latest-policyengine-us

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

Summary

  • pin the PolicyBench Python dependency to the verified latest policyengine[us]==4.14.2 and remove the direct policyengine-us GitHub commit pin
  • read matching PolicyEngine bundle provenance from the raw policyengine.py release manifest so metadata does not depend on importing top-level policyengine
  • route US population-weight generation through the PolicyBench runtime wrapper and record the certified dataset URI, model/data versions, build ID, and sha256
  • continue excluding fsla_overtime_premium from prompt/situation inputs under the latest policyengine-us where it is now discovered as an input

Sentinel verification

Authoritative latest-version checks:

  • PyPI JSON for policyengine -> 4.14.2
  • python3 -m pip index versions policyengine -> latest 4.14.2

Resolved local environment:

  • policyengine==4.14.2
  • policyengine-us==1.715.2
  • policyengine-core==3.26.1
  • raw US manifest bundle_id=us-4.14.2, data package policyengine-us-data==1.115.5, default dataset enhanced_cps_2024, URI hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5@d47fb5475144260a75467d2f2e22b2d5d53d4d57

Local sentinels:

  • uv run python -m policybench.cli reference-outputs --country us --num-scenarios 3 --seed 42 --output results/local/sentinels/policyengine_4_14_2/us/reference_outputs.csv --scenario-manifest-output results/local/sentinels/policyengine_4_14_2/us/scenarios.csv
  • uv run python -m policybench.cli population-weights --country us --output results/local/sentinels/policyengine_4_14_2/us/population_weights.json

The population-weight sentinel recorded source_household_rows=41314, positive_weight_households=9343, and all household/aggregate/equal weight vectors summed to 1.

Checks

  • uv run pytest tests/test_policyengine_runtime.py tests/test_population_weights.py -q
  • uv run pytest tests/test_scenarios.py::test_formula_overtime_premium_is_not_prompted_or_sent_to_policyengine -q
  • uv run pytest -m "not slow" --tb=short -q
  • uv run ruff format --check .
  • uv run ruff check .

@vercel

vercel Bot commented Jun 7, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench-site Ready Ready Preview, Comment Jun 7, 2026 7:57pm

Request Review

@MaxGhenis

Copy link
Copy Markdown
Contributor Author

Paid US sentinel completed locally against the 3-household policyengine_4_14_2 manifest.

Models run, one-household chunks, serial:

  • gpt-5.4-nano
  • gemini-3.5-flash
  • grok-4.1-fast
  • claude-haiku-4.5

Structural result:

  • 264 combined prediction rows (66 per model)
  • 0 missing predictions
  • 0 missing explanations
  • 0 provider error rows
  • analysis export succeeded under results/local/sentinels/policyengine_4_14_2/us/paid_smoke/analysis

Usage/cost summary:

  • gpt-5.4-nano: $0.006231, 30.6s
  • gemini-3.5-flash: $0.069912, 35.4s
  • grok-4.1-fast: $0.022298, 18.5s
  • claude-haiku-4.5: $0.201590, 204.9s
  • total: $0.300031

Provider metadata notes:

  • gpt-5.4-nano resolved as gpt-5.4-nano-2026-03-17
  • gemini-3.5-flash resolved as gemini-3.5-flash
  • claude-haiku-4.5 resolved as claude-haiku-4-5-20251001
  • grok-4.1-fast rows reported provider-resolved model grok-4.3; worth checking before treating that alias as a distinct full-run model.

Non-blocking stdout warning: LiteLLM/Pydantic serializer warnings appeared for Gemini/xAI/Claude response serialization, but all chunks wrote complete rows and analysis parsed them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant