Skip to content

Fix structured output leaks in tool-calling loops#5897

Merged
lorenzejay merged 16 commits into
mainfrom
codex/fix-oss-47-structured-output-tools
May 27, 2026
Merged

Fix structured output leaks in tool-calling loops#5897
lorenzejay merged 16 commits into
mainfrom
codex/fix-oss-47-structured-output-tools

Conversation

@lorenzejay
Copy link
Copy Markdown
Collaborator

@lorenzejay lorenzejay commented May 21, 2026

Note

Medium Risk
Changes core agent executor LLM call behavior for agents that combine tools and response_format; mitigated by targeted logic and new unit tests, but affects production kickoff paths.

Overview
Agents with tools and a task response format no longer ask the LLM for structured output on every step in ReAct or native tool loops—response_model is cleared whenever original_tools is set, and native tool rounds always pass response_model=None. Structured shaping is intended for the final answer path, not mid-loop tool turns.

Post-run formatting in Agent and LiteAgent now tries model_validate_json on the raw output first, then falls back to the existing Converter path if validation fails.

Test/VCR plumbing normalizes Bedrock regional hosts for cassette matching, redacts x-amz-security-token, skips recording HTTP ≥400 responses, and refreshes Bedrock/Anthropic cassettes (e.g. Claude Sonnet 4.6). New unit tests assert response_model is not passed during tool loops.

Reviewed by Cursor Bugbot for commit c853196. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • Bug Fixes

    • Structured-output validation now tries direct model validation first, then falls back to converter parsing; response-model enforcement is disabled during tool-calling paths and ReAct loops.
  • Tests

    • New unit tests verify structured-output behavior when tools are active.
  • Chores

    • Updated recorded LLM fixtures and bumped Gemini/Anthropic model references used in tests.

Review Change Stack

@linear
Copy link
Copy Markdown

linear Bot commented May 21, 2026

OSS-47

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Executor code now disables structured response_model when tools are involved; core and lite agents attempt direct JSON validation via response_format.model_validate_json(...) (catching ValidationError) before falling back to Converter-based parsing. Test cassettes and a unit test were updated for Gemini 2.5 and tool-driven flows.

Changes

Response Model Handling and Structured Output Parsing

Layer / File(s) Summary
Executor response_model conditioning for tool-aware paths
lib/crewai/src/crewai/agents/crew_agent_executor.py, lib/crewai/src/crewai/experimental/agent_executor.py
Both executor implementations compute an effective_response_model that becomes None when original_tools are present in ReAct loops; native tool loops always pass response_model=None to LLM calls to disable structured output mode during tool-calling.
Structured output parsing with ValidationError handling
lib/crewai/src/crewai/agent/core.py, lib/crewai/src/crewai/lite_agent.py
Both agent implementations import ValidationError and refactor response-format parsing to first attempt response_format.model_validate_json(raw_output), catching and ignoring validation errors; only when that fails do they fall back to Converter(...).to_pydantic() with existing converter error handling.
Test coverage for response_model in tool scenarios
lib/crewai/tests/agents/test_agent_executor.py
Adds StructuredResult BaseModel and two new unit tests asserting that response_model is not forwarded to the underlying LLM response call during tool-driven ReAct and native tool-calling flows.
Cassette updates for Gemini structured output with tools
lib/crewai/tests/cassettes/llms/google/test_gemini_agent_kickoff_structured_output_with_tools.yaml, lib/crewai/tests/cassettes/llms/google/test_gemini_crew_structured_output_with_tools.yaml, lib/crewai/tests/llms/google/test_google.py
Test cassettes and model config updated to gemini-2.5-flash: generation configs now include thinkingConfig with thought generation, model responses include expanded thought text, and final structured content is recorded inline as text rather than via structured function calls; headers and client metadata refreshed.
Cassette updates for Anthropic flows
lib/crewai/tests/cassettes/llms/anthropic/test_anthropic_agent_kickoff_structured_output_with_tools.yaml, lib/crewai/tests/llms/anthropic/test_anthropic.py
Anthropic cassette updated to reflect claude-sonnet-4-6 tool-calling traces and refreshed response/headers for the same calculation flow; the test instantiation model name updated accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Suggested reviewers

  • greysonlalonde

Poem

🐰 I hopped through agents, quiet and spry,

Switched off models when tools passed by,
ValidationError guarded the gate,
Converter waited to validate,
Gemini hummed a thoughtful reply.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix structured output leaks in tool-calling loops' clearly and concisely summarizes the main objective of the PR, which is to prevent structured output/response models from being incorrectly sent during tool-calling loops.
Docstring Coverage ✅ Passed Docstring coverage is 93.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-oss-47-structured-output-tools

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
lib/crewai/tests/agents/test_agent_executor.py (2)

263-263: ⚡ Quick win

Use .get() for more robust assertion.

Same concern as line 241: direct dictionary access with call_args.kwargs["response_model"] will raise KeyError if the key doesn't exist. Using .get() makes the test more robust.

🔧 Proposed refactor for robustness
-        assert get_llm_response_mock.call_args.kwargs["response_model"] is None
+        assert get_llm_response_mock.call_args.kwargs.get("response_model") is None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai/tests/agents/test_agent_executor.py` at line 263, The test is
asserting get_llm_response_mock.call_args.kwargs["response_model"] which will
raise KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (i.e., replace
direct indexing with call_args.kwargs.get("response_model")) so the test asserts
that the value is None and won't error if the key is absent.

241-241: ⚡ Quick win

Use .get() for more robust assertion.

Direct dictionary access with call_args.kwargs["response_model"] will raise KeyError if the key doesn't exist. Using .get() makes the test more robust and provides clearer failure messages if the implementation changes to omit the parameter entirely.

🔧 Proposed refactor for robustness
-        assert get_llm_response_mock.call_args.kwargs["response_model"] is None
+        assert get_llm_response_mock.call_args.kwargs.get("response_model") is None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai/tests/agents/test_agent_executor.py` at line 241, The test uses
direct dict access get_llm_response_mock.call_args.kwargs["response_model"],
which will raise KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (e.g., assert
get_llm_response_mock.call_args.kwargs.get("response_model") is None) so the
test is robust to the key being absent and yields clearer failure messages while
still verifying the intended None value for response_model.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/crewai/tests/agents/test_agent_executor.py`:
- Line 263: The test is asserting
get_llm_response_mock.call_args.kwargs["response_model"] which will raise
KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (i.e., replace
direct indexing with call_args.kwargs.get("response_model")) so the test asserts
that the value is None and won't error if the key is absent.
- Line 241: The test uses direct dict access
get_llm_response_mock.call_args.kwargs["response_model"], which will raise
KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (e.g., assert
get_llm_response_mock.call_args.kwargs.get("response_model") is None) so the
test is robust to the key being absent and yields clearer failure messages while
still verifying the intended None value for response_model.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4139e894-c081-432d-a14d-3b0314c8f948

📥 Commits

Reviewing files that changed from the base of the PR and between c3ef622 and 4b08a83.

📒 Files selected for processing (5)
  • lib/crewai/src/crewai/agent/core.py
  • lib/crewai/src/crewai/agents/crew_agent_executor.py
  • lib/crewai/src/crewai/experimental/agent_executor.py
  • lib/crewai/src/crewai/lite_agent.py
  • lib/crewai/tests/agents/test_agent_executor.py

Comment thread lib/crewai/src/crewai/agent/core.py Fixed
Comment thread lib/crewai/src/crewai/agent/core.py Fixed
Comment thread lib/crewai/src/crewai/lite_agent.py Fixed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/crewai/scripts/structured_output_with_tools_runner.py`:
- Around line 44-46: Replace the direct print call with the project-approved
logger: after calling StructuredOutputFlow().kickoff() and obtaining result (and
its result.pydantic), log the output via the configured logging facility instead
of print; locate the main block where StructuredOutputFlow.kickoff() is invoked
and swap the print(result.pydantic) for a logger call (e.g., use the module or
package logger) so the script satisfies the Ruff T201 lint rule.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a8037508-a1b0-47e9-9ba3-41a6c9aa0258

📥 Commits

Reviewing files that changed from the base of the PR and between 4b08a83 and 1700a38.

📒 Files selected for processing (3)
  • lib/crewai/scripts/structured_output_with_tools_runner.py
  • lib/crewai/src/crewai/agent/core.py
  • lib/crewai/src/crewai/lite_agent.py

Comment thread lib/crewai/scripts/structured_output_with_tools_runner.py Outdated
Comment thread lib/crewai/scripts/structured_output_with_tools_runner.py Outdated
- Changed the model from "claude-3-5-haiku-20241022" to "claude-sonnet-4-6" in the test setup.
- Updated the request and response formats in the YAML test cassette to reflect the new tool structure and improved content formatting.
- Adjusted the expected response body to match the new output format from the assistant, including changes in tool usage and response details.
- Increased rate limit values in the response headers for better testing scenarios.
- Introduced a custom matcher for Bedrock requests to normalize regional endpoints, ensuring consistent behavior across AWS regions.
- Updated the VCR configuration to utilize the new matcher.
- Adjusted test cassette to replace the original Bedrock endpoint with a placeholder for improved testing consistency.
- Modified response body and headers in the test cassette to reflect updated expected values.
content=f"Failed to parse output into response format after retries: {e.message}",
color="yellow",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LiteAgent still leaks structured output

Medium Severity

This PR disables passing response_model during tool loops in CrewAgentExecutor and the experimental AgentExecutor, but LiteAgent._invoke_loop still passes the full response_model on every get_llm_response call even when tools are configured. Structured-output enforcement can still run alongside ReAct tool steps in LiteAgent, which is the leak the rest of the change set is meant to prevent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ab93679. Configure here.

Comment thread conftest.py Outdated
Comment on lines +292 to +325
"match_on": ["method", "scheme", "host", "port", "path"],
"match_on": ["method", "scheme", "bedrock_host", "port", "path"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to make the cassettes more flaky

Comment thread conftest.py
"""Register custom VCR matchers for each test cassette session."""
vcr.register_matcher("bedrock_host", bedrock_host_matcher)


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bedrock VCR matcher unused

Medium Severity

A custom bedrock_host VCR matcher is registered in pytest_recording_configure, but vcr_config still uses default match_on with host. Replay therefore compares raw regional Bedrock hostnames instead of the normalized placeholder, so cassettes recorded in one AWS region fail in CI or locally when another region is used.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit efed5fc. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 88bd2b9. Configure here.

Comment thread conftest.py
status_code = response.get("status", {}).get("code")
if isinstance(status_code, int) and status_code >= 400:
# Avoid persisting auth/model errors when re-recording without valid AWS creds.
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering all 400+ responses breaks error-path cassettes

Medium Severity

The new _filter_response_headers logic unconditionally drops all HTTP responses with status code >= 400 from VCR cassette recordings, not just Bedrock auth errors. Any test that exercises error-handling paths (rate limiting, validation errors, 404s, etc.) across any provider will silently lose those recorded responses. When tests replay, the missing interaction will cause CannotSendRequest or cassette-mismatch failures. The comment says this is to "avoid persisting auth/model errors when re-recording without valid AWS creds," but the filter applies globally to all providers and all error types.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 88bd2b9. Configure here.

@lorenzejay lorenzejay requested a review from greysonlalonde May 27, 2026 16:15
@lorenzejay lorenzejay merged commit a1033e4 into main May 27, 2026
56 checks passed
@lorenzejay lorenzejay deleted the codex/fix-oss-47-structured-output-tools branch May 27, 2026 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants