feat: add LisanBench environment by Alfianfc · Pull Request #1486 · PrimeIntellect-ai/verifiers

Alfianfc · 2026-05-29T18:18:28Z

Summary

Add a packaged lisanbench environment for the Algora LisanBench bounty.
Port the core LisanBench word-chain task: start from a word, generate a longest possible chain, require Levenshtein distance 1 between adjacent words, valid English words, and no repeats.
Add reward functions for correct start word, edit-distance transition validity, valid-prefix length, duplicate avoidance, and comma-list formatting.
Cache the dwyl/english-words dictionary on first use with a small offline fallback for smoke tests.
Document quickstart and source/bounty references.

Verification

uv run --no-dev ruff check environments/lisanbench
uv pip install -e environments/lisanbench
uv run --no-dev python environments/lisanbench/lisanbench.py
uv run --no-dev python - <<'PY' ... vf.load_environment('lisanbench') + reward smoke checks ... PY
CHANGED_ENVS=lisanbench uv run --no-dev pytest tests/test_envs.py -q --tb=short was attempted, but the Windows host cannot execute the test's hard-coded /bin/bash subprocess path.

Algora bounty: https://algora.io/PrimeIntellect-ai/bounties/dDffD24XfkQUaR7a
Reference implementation: https://github.com/voice-from-the-outer-world/lisan-bench

Note

Low Risk
Self-contained new environment under environments/lisanbench with no changes to core auth, data pipelines, or shared runtime beyond documentation.

Overview
Adds a new installable lisanbench single-turn environment and documents it in the environments index.

The model must extend a given starting word into a comma-separated English word chain where each step has Levenshtein distance 1, words are dictionary-valid, and repeats are forbidden. Scoring uses a weighted rubric (start word, transition validity, valid-prefix length capped at 25 words, no duplicates, list-only formatting). The English lexicon is loaded from dwyl/english-words into ~/.cache/verifiers/lisanbench/ on first use, with a small embedded fallback when download or read fails. The package includes load_environment, default starting-word tasks, and eval defaults in pyproject.toml.

^{Reviewed by Cursor Bugbot for commit 700bf83. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add LisanBench single-turn word-chain environment

Adds a new lisanbench environment in lisanbench.py implementing a word-chain task where a model must produce a comma-separated chain of English words each differing by edit distance 1 from the previous.
Loads a word dictionary from a cached copy of words_alpha.txt (stored at ~/.cache/verifiers/lisanbench/words_alpha.txt), falling back to a small in-module word set on download failure.
Scores completions with a weighted rubric (total weight 6.5) covering: correct starting word, valid-link ratio, valid-prefix length (capped at 25), duplicate avoidance, and list-like formatting.
Exposes a load_environment() function returning a vf.SingleTurnEnv and documents the environment in environments/README.md.

^{Macroscope summarized 700bf83.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 700bf83. Configure here.}

cursor · 2026-05-29T18:24:17Z

+        return 0.0
+    has_explanation_markers = bool(re.search(r"\b(reason|because|explanation|steps?)\b", completion, re.I))
+    comma_like = "," in completion or "->" in completion or "→" in completion
+    return 1.0 if comma_like and not has_explanation_markers else 0.5 if comma_like else 0.0


Reward functions receive message list, not string

High Severity

All five reward functions annotate completion as str and pass it directly to extract_word_chain(), which calls completion.lower(). However, the Verifiers framework passes completion as state["completion"] — a list[dict[str, str]] of message dicts, not a string. Every other environment in the repo (e.g. reverse_text, wordle, mmmu) correctly uses parser.parse_answer(completion) to extract text first. Calling .lower() on a list raises AttributeError, which the rubric silently catches, returning 0.0 for every reward function. The format_reward function also directly applies re.search() and "," in completion to the list, compounding the issue. The environment appears to run but all rewards are always 0.0, making it completely non-functional for evaluation and training.

Additional Locations (1)

environments/lisanbench/lisanbench.py#L163-L176

^{Reviewed by Cursor Bugbot for commit 700bf83. Configure here.}

feat: add LisanBench environment

700bf83

cursor Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LisanBench environment#1486

feat: add LisanBench environment#1486
Alfianfc wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
Alfianfc:feat/lisanbench-env

Alfianfc commented May 29, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alfianfc commented May 29, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Add LisanBench single-turn word-chain environment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 29, 2026

Choose a reason for hiding this comment

Reward functions receive message list, not string

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Alfianfc commented May 29, 2026 •

edited by macroscopeapp Bot

Loading