Skip to content

feat: add SciCode environment#1487

Open
Alfianfc wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
Alfianfc:feat/scicode-env
Open

feat: add SciCode environment#1487
Alfianfc wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
Alfianfc:feat/scicode-env

Conversation

@Alfianfc
Copy link
Copy Markdown

@Alfianfc Alfianfc commented May 29, 2026

Summary

  • Add a packaged scicode environment for the Algora SciCode bounty.
  • Load SciCode1/SciCode from Hugging Face and convert each scientific research substep into a SingleTurnEnv coding prompt.
  • Preserve SciCode prompt structure: prior step context, next step description, required dependencies, exact function/class header, and response guidelines.
  • Add static reward functions for valid Python syntax, expected function/class implementation, fenced Python code block formatting, no top-level examples/tests/prints/asserts, background comment, and return statements.
  • Include a small fallback sample for offline smoke tests and document quickstart/source links.

Verification

  • uv run --no-dev ruff check environments/scicode
  • uv pip install -e environments/scicode
  • uv run --no-dev python environments/scicode/scicode.py
  • uv run --no-dev python - <<'PY' ... vf.load_environment('scicode') + reward smoke checks ... PY
  • CHANGED_ENVS=scicode uv run --no-dev pytest tests/test_envs.py -q --tb=short was attempted, but the Windows host cannot execute the test's hard-coded /bin/bash subprocess path.

Algora bounty: https://algora.io/PrimeIntellect-ai/bounties/AG9a7bN3dkaFcVL3
Reference implementation: https://github.com/scicode-bench/SciCode
Dataset: https://huggingface.co/datasets/SciCode1/SciCode


Note

Low Risk
Additive example environment and docs; no changes to core auth, training, or shared runtime paths.

Overview
Adds a new scicode installable environment for SciCode-style scientific coding substeps, and lists it in the environments README under SingleTurnEnv examples.

The environment loads SciCode1/SciCode (with a small local fallback when HF is unavailable), expands each problem into per-step question / answer rows with SciCode-style prompts (prior steps, dependencies, exact header), and exposes load_environment as a SingleTurnEnv with a weighted rubric of static checks only—syntax, expected def/class name, fenced Python, no tests/prints/asserts, # Background: comment, and a return—without HDF5 numeric verification.

Reviewed by Cursor Bugbot for commit f24882d. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add SciCode environment for scientific research coding substep evaluation

  • Adds a new scicode environment that loads the SciCode1/SciCode dataset and presents individual substeps as single-turn prompts.
  • Implements prompt construction that chains prior substep descriptions and builds a structured instruction for the current step, including the function header and expected return.
  • Defines a static rubric with six reward functions: syntax validity, correct function/class name, fenced code block presence, no top-level test/debug statements, background comment, and return statement presence.
  • Falls back to a small inline problem list if the remote dataset cannot be loaded.

Macroscope summarized f24882d.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f24882d. Configure here.

ast.parse(code)
return 1.0
except SyntaxError:
return 0.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reward functions receive Messages list, not string

High Severity

All reward functions declare completion: str and directly operate on it as a string (calling extract_python_code(completion), re.search(..., completion), etc.), but the Verifiers framework passes state["completion"] which is a Messages list (e.g. [{"role": "assistant", "content": "..."}]). Other environments correctly use parser.parse_answer(completion) to extract text first. Since extract_python_code checks "```" not in completion (always True for a list of dicts) then calls completion.strip(), every call raises an AttributeError caught by the rubric's exception handler, causing all rewards to silently return 0.0 always.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f24882d. Configure here.

_ = answer, kwargs
code = strip_imports(extract_python_code(completion))
banned = [r"\bassert\b", r"if\s+__name__\s*==", r"print\s*\(", r"pytest", r"unittest"]
return 0.0 if any(re.search(pattern, code) for pattern in banned) else 1.0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing word boundary causes false matches on "print"

Low Severity

The banned pattern r"print\s*\(" lacks a word boundary (\b), so it matches substrings within identifiers like fingerprint(, blueprint(, or sprint(. In scientific computing code (chemistry, bioinformatics), fingerprint is a plausible function name. This causes valid code to be incorrectly penalized with a 0.0 reward.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f24882d. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant