Skip to content

benchmark with repl#2569

Open
Eugene Yurtsev (eyurtsev) wants to merge 5 commits intolangchain-ai:mainfrom
eyurtsev:eugene/update_evals_repl
Open

benchmark with repl#2569
Eugene Yurtsev (eyurtsev) wants to merge 5 commits intolangchain-ai:mainfrom
eyurtsev:eugene/update_evals_repl

Conversation

@eyurtsev
Copy link
Copy Markdown
Collaborator

@eyurtsev Eugene Yurtsev (eyurtsev) commented Apr 8, 2026

quick benchmark with repl

@github-actions github-actions Bot added dependencies Pull requests that update a dependency file evals Evaluation suite and Harbor integration internal User is a member of the `langchain-ai` GitHub organization repl REPL sandbox package size: S 50-199 LOC labels Apr 8, 2026
@eyurtsev Eugene Yurtsev (eyurtsev) marked this pull request as ready for review April 8, 2026 20:38
Copilot AI review requested due to automatic review settings April 8, 2026 20:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR appears to wire the new REPL middleware into the evals suite for benchmarking, while also strengthening the REPL system prompt guidance and extending the evals pytest reporter output with total runtime.

Changes:

  • Add stronger REPL language guidance + a full example program to the REPL system prompt (and update prompt tests/snapshots accordingly).
  • Add total_duration_s reporting to the evals pytest reporter and cover it with a new unit test.
  • Update evals dependencies/lockfiles to include langchain-repl / langchain-quickjs and pydantic-monty, and switch the relational tool-usage eval to use ReplMiddleware.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
libs/repl/langchain_repl/middleware.py Expands REPL system prompt constraints and examples.
libs/repl/tests/unit_tests/test_system_prompt.py Updates assertions for new prompt content.
libs/repl/tests/unit_tests/smoke_tests/snapshots/langchain_repl_system_prompt_no_tools.md Snapshot update for new prompt wording/examples.
libs/repl/tests/unit_tests/smoke_tests/snapshots/langchain_repl_system_prompt_mixed_foreign_functions.md Snapshot update for new prompt wording/examples.
libs/repl/pyproject.toml Adds pydantic-monty dependency.
libs/repl/uv.lock Lockfile update for repl package dependencies.
libs/evals/tests/evals/pytest_reporter.py Adds total_duration_s to the session summary payload + terminal output.
libs/evals/tests/unit_tests/test_pytest_reporter.py Adds coverage ensuring total duration is written to the report and terminal output.
libs/evals/tests/evals/test_tool_usage_relational.py Switches relational eval agent creation to ReplMiddleware (currently conflicts with existing tool-call expectations).
libs/evals/pyproject.toml Adds deepagents, langchain-repl, langchain-quickjs deps and uv sources.
libs/evals/uv.lock Lockfile update for evals package deps, including local editables + quickjs.
libs/evals/EVAL_CATALOG.md Updates catalog line links to match shifted test line numbers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -435,7 +437,8 @@ def _create_agent(model: BaseChatModel):
"""Create agent."""
return create_deep_agent(
model=model,
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_deep_agent is no longer passed tools=RELATIONAL_TOOLS, but the evals below still assert direct tool calls like list_user_ids, get_user_email, etc. With the current setup, only the repl tool is added by ReplMiddleware (foreign functions are not registered as agent tools), so these expectations will fail. Either keep passing tools=RELATIONAL_TOOLS (and optionally add the middleware) or update the scorer expectations to match repl tool calls + REPL code execution semantics.

Suggested change
model=model,
model=model,
tools=RELATIONAL_TOOLS,

Copilot uses AI. Check for mistakes.
import pytest
from deepagents import create_deep_agent
from langchain_core.tools import ToolException, tool
from langchain_quickjs import QuickJSMiddleware
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuickJSMiddleware is imported but only referenced in a commented-out line, so it will be flagged as an unused import (ruff F401) and fail lint. Remove the import or re-enable usage so the symbol is referenced at runtime.

Suggested change
from langchain_quickjs import QuickJSMiddleware

Copilot uses AI. Check for mistakes.
Comment thread libs/repl/pyproject.toml
requires-python = ">=3.11,<4.0"
dependencies = [
"deepagents",
"pydantic-monty>=0.0.9",
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo contribution guidelines prohibit adding dependencies to pyproject.toml without explicit maintainer permission. This PR adds pydantic-monty here; please confirm you have approval or remove the dependency.

Suggested change
"pydantic-monty>=0.0.9",

Copilot uses AI. Check for mistakes.
Comment thread libs/repl/pyproject.toml
requires-python = ">=3.11,<4.0"
dependencies = [
"deepagents",
"pydantic-monty>=0.0.9",
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic-monty is added as a runtime dependency, but there are no references to it anywhere under libs/repl/ (no imports/usages found). If this dependency is required transitively or for future work, it should be justified; otherwise remove it to avoid carrying an unused dependency.

Suggested change
"pydantic-monty>=0.0.9",

Copilot uses AI. Check for mistakes.
Comment thread libs/evals/pyproject.toml
Comment on lines +47 to +49
"deepagents",
"langchain-repl",
"langchain-quickjs",
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo contribution guidelines prohibit adding dependencies to pyproject.toml without explicit maintainer permission. This PR adds deepagents, langchain-repl, and langchain-quickjs to the project dependencies; please confirm you have approval or remove/revert these dependency changes.

Suggested change
"deepagents",
"langchain-repl",
"langchain-quickjs",

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file evals Evaluation suite and Harbor integration internal User is a member of the `langchain-ai` GitHub organization repl REPL sandbox package size: S 50-199 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants