Compare PolicyBench to Vals AI's Public Benefits Bench in related work by MaxGhenis · Pull Request #62 · PolicyEngine/policybench

MaxGhenis · 2026-06-09T22:01:08Z

Why

Vals AI released Public Benefits Bench on June 9, 2026 with the Center for Civic Futures and Code for America — the closest public-benefits benchmark to PolicyBench. The paper's related work should position against it.

What

New related-work paragraph comparing the two benchmarks as complementary: conversational SNAP guidance graded by expert rubrics + LLM judge with tools in play, versus no-tool numeric estimation across tax and transfer outputs in two countries scored against microsimulation references. Notes their consistent conclusions (their best configuration passes 71.7% of rubric criteria; "no general-purpose AI model performs well enough to be trusted with SNAP benefits guidance").
@vals2026publicbenefits bib entry.
Re-rendered policybench.pdf and the web export from the unchanged 2026-05-20 frozen snapshot (prose-only change; figures and tables are byte-identical), and bumped the /paper iframe cache-buster.
Documented render-venv requirements in paper/README.md (ipykernel/nbformat/nbclient + the policybench-paper kernelspec, and the gotcha where a kernelspec pointing at another checkout's venv silently renders that checkout's policybench code).

Verification

Every cited figure was checked against the benchmark page in three separate fetches; a "13 models evaluated" figure from the first fetch turned out to be wrong (the page says 12), so the paragraph omits the model count.
Rendered PDF inspected visually — the paragraph lands on p. 3 with the citation resolving to "(Vals AI 2026)" and the bibliography entry present.
bun run lint passes for the app change.

🤖 Generated with Claude Code

Public Benefits Bench (Vals AI with the Center for Civic Futures and Code for America, released 2026-06-09) grades free-text SNAP guidance against expert rubrics with an LLM judge and varies tool access; PolicyBench scores no-tool numeric estimation against microsimulation references. The new related-work paragraph positions the two as complementary and notes their consistent headline conclusions. All cited figures (459 scenarios, 230-question test set, 80.6% judge agreement, +7.6pp multi-turn / +6.9pp web search, 71.7% best score) were verified against the benchmark page in three separate fetches. Re-renders the manuscript PDF and web export from the unchanged 2026-05-20 frozen snapshot, bumps the web cache-buster, and documents the render venv requirements (ipykernel/nbformat/nbclient and the policybench-paper kernelspec) in paper/README.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

vercel · 2026-06-09T22:01:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policybench-site	Ready	Preview, Comment	Jun 9, 2026 10:15pm

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

vercel Bot deployed to Preview June 9, 2026 22:03 View deployment

Update rendered paper artifact hashes in snapshot manifest

39636d7

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

vercel Bot deployed to Preview June 9, 2026 22:15 View deployment

MaxGhenis mentioned this pull request Jun 10, 2026

Make the paper render hermetic; render-check it in CI #64

Merged

MaxGhenis merged commit 74f4ee0 into main Jun 10, 2026
5 checks passed

MaxGhenis deleted the paper-vals-comparison branch June 10, 2026 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare PolicyBench to Vals AI's Public Benefits Bench in related work#62

Compare PolicyBench to Vals AI's Public Benefits Bench in related work#62
MaxGhenis merged 2 commits into
mainfrom
paper-vals-comparison

MaxGhenis commented Jun 9, 2026

Uh oh!

vercel Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Jun 9, 2026

Why

What

Verification

Uh oh!

vercel Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 9, 2026 •

edited

Loading