Skip to content

Compare PolicyBench to Vals AI's Public Benefits Bench in related work#62

Merged
MaxGhenis merged 2 commits into
mainfrom
paper-vals-comparison
Jun 10, 2026
Merged

Compare PolicyBench to Vals AI's Public Benefits Bench in related work#62
MaxGhenis merged 2 commits into
mainfrom
paper-vals-comparison

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

Why

Vals AI released Public Benefits Bench on June 9, 2026 with the Center for Civic Futures and Code for America — the closest public-benefits benchmark to PolicyBench. The paper's related work should position against it.

What

  • New related-work paragraph comparing the two benchmarks as complementary: conversational SNAP guidance graded by expert rubrics + LLM judge with tools in play, versus no-tool numeric estimation across tax and transfer outputs in two countries scored against microsimulation references. Notes their consistent conclusions (their best configuration passes 71.7% of rubric criteria; "no general-purpose AI model performs well enough to be trusted with SNAP benefits guidance").
  • @vals2026publicbenefits bib entry.
  • Re-rendered policybench.pdf and the web export from the unchanged 2026-05-20 frozen snapshot (prose-only change; figures and tables are byte-identical), and bumped the /paper iframe cache-buster.
  • Documented render-venv requirements in paper/README.md (ipykernel/nbformat/nbclient + the policybench-paper kernelspec, and the gotcha where a kernelspec pointing at another checkout's venv silently renders that checkout's policybench code).

Verification

  • Every cited figure was checked against the benchmark page in three separate fetches; a "13 models evaluated" figure from the first fetch turned out to be wrong (the page says 12), so the paragraph omits the model count.
  • Rendered PDF inspected visually — the paragraph lands on p. 3 with the citation resolving to "(Vals AI 2026)" and the bibliography entry present.
  • bun run lint passes for the app change.

🤖 Generated with Claude Code

Public Benefits Bench (Vals AI with the Center for Civic Futures and
Code for America, released 2026-06-09) grades free-text SNAP guidance
against expert rubrics with an LLM judge and varies tool access;
PolicyBench scores no-tool numeric estimation against microsimulation
references. The new related-work paragraph positions the two as
complementary and notes their consistent headline conclusions.

All cited figures (459 scenarios, 230-question test set, 80.6% judge
agreement, +7.6pp multi-turn / +6.9pp web search, 71.7% best score)
were verified against the benchmark page in three separate fetches.

Re-renders the manuscript PDF and web export from the unchanged
2026-05-20 frozen snapshot, bumps the web cache-buster, and documents
the render venv requirements (ipykernel/nbformat/nbclient and the
policybench-paper kernelspec) in paper/README.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policybench-site Ready Ready Preview, Comment Jun 9, 2026 10:15pm

Request Review

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 74f4ee0 into main Jun 10, 2026
5 checks passed
@MaxGhenis MaxGhenis deleted the paper-vals-comparison branch June 10, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant