Compare PolicyBench to Vals AI's Public Benefits Bench in related work#62
Merged
Conversation
Public Benefits Bench (Vals AI with the Center for Civic Futures and Code for America, released 2026-06-09) grades free-text SNAP guidance against expert rubrics with an LLM judge and varies tool access; PolicyBench scores no-tool numeric estimation against microsimulation references. The new related-work paragraph positions the two as complementary and notes their consistent headline conclusions. All cited figures (459 scenarios, 230-question test set, 80.6% judge agreement, +7.6pp multi-turn / +6.9pp web search, 71.7% best score) were verified against the benchmark page in three separate fetches. Re-renders the manuscript PDF and web export from the unchanged 2026-05-20 frozen snapshot, bumps the web cache-buster, and documents the render venv requirements (ipykernel/nbformat/nbclient and the policybench-paper kernelspec) in paper/README.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Vals AI released Public Benefits Bench on June 9, 2026 with the Center for Civic Futures and Code for America — the closest public-benefits benchmark to PolicyBench. The paper's related work should position against it.
What
@vals2026publicbenefitsbib entry.policybench.pdfand the web export from the unchanged 2026-05-20 frozen snapshot (prose-only change; figures and tables are byte-identical), and bumped the/paperiframe cache-buster.paper/README.md(ipykernel/nbformat/nbclient + thepolicybench-paperkernelspec, and the gotcha where a kernelspec pointing at another checkout's venv silently renders that checkout'spolicybenchcode).Verification
bun run lintpasses for the app change.🤖 Generated with Claude Code