Fix structured output leaks in tool-calling loops by lorenzejay · Pull Request #5897 · crewAIInc/crewAI

lorenzejay · 2026-05-21T22:33:37Z

Note

Medium Risk
Changes core agent executor LLM call behavior for agents that combine tools and response_format; mitigated by targeted logic and new unit tests, but affects production kickoff paths.

Overview
Agents with tools and a task response format no longer ask the LLM for structured output on every step in ReAct or native tool loops—response_model is cleared whenever original_tools is set, and native tool rounds always pass response_model=None. Structured shaping is intended for the final answer path, not mid-loop tool turns.

Post-run formatting in Agent and LiteAgent now tries model_validate_json on the raw output first, then falls back to the existing Converter path if validation fails.

Test/VCR plumbing normalizes Bedrock regional hosts for cassette matching, redacts x-amz-security-token, skips recording HTTP ≥400 responses, and refreshes Bedrock/Anthropic cassettes (e.g. Claude Sonnet 4.6). New unit tests assert response_model is not passed during tool loops.

^{Reviewed by Cursor Bugbot for commit c853196. Bugbot is set up for automated code reviews on this repo. Configure here.}

Summary by CodeRabbit

Bug Fixes
- Structured-output validation now tries direct model validation first, then falls back to converter parsing; response-model enforcement is disabled during tool-calling paths and ReAct loops.
Tests
- New unit tests verify structured-output behavior when tools are active.
Chores
- Updated recorded LLM fixtures and bumped Gemini/Anthropic model references used in tests.

linear · 2026-05-21T22:33:41Z

OSS-47

coderabbitai · 2026-05-21T22:33:59Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Executor code now disables structured response_model when tools are involved; core and lite agents attempt direct JSON validation via response_format.model_validate_json(...) (catching ValidationError) before falling back to Converter-based parsing. Test cassettes and a unit test were updated for Gemini 2.5 and tool-driven flows.

Changes

Response Model Handling and Structured Output Parsing

Layer / File(s)	Summary
Executor response_model conditioning for tool-aware paths `lib/crewai/src/crewai/agents/crew_agent_executor.py`, `lib/crewai/src/crewai/experimental/agent_executor.py`	Both executor implementations compute an `effective_response_model` that becomes `None` when `original_tools` are present in ReAct loops; native tool loops always pass `response_model=None` to LLM calls to disable structured output mode during tool-calling.
Structured output parsing with ValidationError handling `lib/crewai/src/crewai/agent/core.py`, `lib/crewai/src/crewai/lite_agent.py`	Both agent implementations import `ValidationError` and refactor response-format parsing to first attempt `response_format.model_validate_json(raw_output)`, catching and ignoring validation errors; only when that fails do they fall back to `Converter(...).to_pydantic()` with existing converter error handling.
Test coverage for response_model in tool scenarios `lib/crewai/tests/agents/test_agent_executor.py`	Adds `StructuredResult` BaseModel and two new unit tests asserting that `response_model` is not forwarded to the underlying LLM response call during tool-driven ReAct and native tool-calling flows.
Cassette updates for Gemini structured output with tools `lib/crewai/tests/cassettes/llms/google/test_gemini_agent_kickoff_structured_output_with_tools.yaml`, `lib/crewai/tests/cassettes/llms/google/test_gemini_crew_structured_output_with_tools.yaml`, `lib/crewai/tests/llms/google/test_google.py`	Test cassettes and model config updated to `gemini-2.5-flash`: generation configs now include `thinkingConfig` with thought generation, model responses include expanded thought text, and final structured content is recorded inline as text rather than via structured function calls; headers and client metadata refreshed.
Cassette updates for Anthropic flows `lib/crewai/tests/cassettes/llms/anthropic/test_anthropic_agent_kickoff_structured_output_with_tools.yaml`, `lib/crewai/tests/llms/anthropic/test_anthropic.py`	Anthropic cassette updated to reflect `claude-sonnet-4-6` tool-calling traces and refreshed response/headers for the same calculation flow; the test instantiation model name updated accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Suggested reviewers

greysonlalonde

Poem

🐰 I hopped through agents, quiet and spry,

Switched off models when tools passed by,
ValidationError guarded the gate,
Converter waited to validate,
Gemini hummed a thoughtful reply.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix structured output leaks in tool-calling loops' clearly and concisely summarizes the main objective of the PR, which is to prevent structured output/response models from being incorrectly sent during tool-calling loops.
Docstring Coverage	✅ Passed	Docstring coverage is 93.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/fix-oss-47-structured-output-tools

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

lib/crewai/tests/agents/test_agent_executor.py (2)

263-263: ⚡ Quick win

Use .get() for more robust assertion.

Same concern as line 241: direct dictionary access with call_args.kwargs["response_model"] will raise KeyError if the key doesn't exist. Using .get() makes the test more robust.

🔧 Proposed refactor for robustness

-        assert get_llm_response_mock.call_args.kwargs["response_model"] is None
+        assert get_llm_response_mock.call_args.kwargs.get("response_model") is None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai/tests/agents/test_agent_executor.py` at line 263, The test is
asserting get_llm_response_mock.call_args.kwargs["response_model"] which will
raise KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (i.e., replace
direct indexing with call_args.kwargs.get("response_model")) so the test asserts
that the value is None and won't error if the key is absent.

241-241: ⚡ Quick win

Use .get() for more robust assertion.

Direct dictionary access with call_args.kwargs["response_model"] will raise KeyError if the key doesn't exist. Using .get() makes the test more robust and provides clearer failure messages if the implementation changes to omit the parameter entirely.

🔧 Proposed refactor for robustness

-        assert get_llm_response_mock.call_args.kwargs["response_model"] is None
+        assert get_llm_response_mock.call_args.kwargs.get("response_model") is None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/crewai/tests/agents/test_agent_executor.py` at line 241, The test uses
direct dict access get_llm_response_mock.call_args.kwargs["response_model"],
which will raise KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (e.g., assert
get_llm_response_mock.call_args.kwargs.get("response_model") is None) so the
test is robust to the key being absent and yields clearer failure messages while
still verifying the intended None value for response_model.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/crewai/tests/agents/test_agent_executor.py`:
- Line 263: The test is asserting
get_llm_response_mock.call_args.kwargs["response_model"] which will raise
KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (i.e., replace
direct indexing with call_args.kwargs.get("response_model")) so the test asserts
that the value is None and won't error if the key is absent.
- Line 241: The test uses direct dict access
get_llm_response_mock.call_args.kwargs["response_model"], which will raise
KeyError if the key is missing; change the assertion to use
.get("response_model") on get_llm_response_mock.call_args.kwargs (e.g., assert
get_llm_response_mock.call_args.kwargs.get("response_model") is None) so the
test is robust to the key being absent and yields clearer failure messages while
still verifying the intended None value for response_model.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4139e894-c081-432d-a14d-3b0314c8f948

📥 Commits

Reviewing files that changed from the base of the PR and between c3ef622 and 4b08a83.

📒 Files selected for processing (5)

lib/crewai/src/crewai/agent/core.py
lib/crewai/src/crewai/agents/crew_agent_executor.py
lib/crewai/src/crewai/experimental/agent_executor.py
lib/crewai/src/crewai/lite_agent.py
lib/crewai/tests/agents/test_agent_executor.py

…-47-structured-output-tools

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/crewai/scripts/structured_output_with_tools_runner.py`:
- Around line 44-46: Replace the direct print call with the project-approved
logger: after calling StructuredOutputFlow().kickoff() and obtaining result (and
its result.pydantic), log the output via the configured logging facility instead
of print; locate the main block where StructuredOutputFlow.kickoff() is invoked
and swap the print(result.pydantic) for a logger call (e.g., use the module or
package logger) so the script satisfies the Ruff T201 lint rule.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a8037508-a1b0-47e9-9ba3-41a6c9aa0258

📥 Commits

Reviewing files that changed from the base of the PR and between 4b08a83 and 1700a38.

📒 Files selected for processing (3)

lib/crewai/scripts/structured_output_with_tools_runner.py
lib/crewai/src/crewai/agent/core.py
lib/crewai/src/crewai/lite_agent.py

…and bump model version to 2.5-flash

- Changed the model from "claude-3-5-haiku-20241022" to "claude-sonnet-4-6" in the test setup. - Updated the request and response formats in the YAML test cassette to reflect the new tool structure and improved content formatting. - Adjusted the expected response body to match the new output format from the assistant, including changes in tool usage and response details. - Increased rate limit values in the response headers for better testing scenarios.

- Introduced a custom matcher for Bedrock requests to normalize regional endpoints, ensuring consistent behavior across AWS regions. - Updated the VCR configuration to utilize the new matcher. - Adjusted test cassette to replace the original Bedrock endpoint with a placeholder for improved testing consistency. - Modified response body and headers in the test cassette to reflect updated expected values.

cursor · 2026-05-26T17:48:09Z

+                            content=f"Failed to parse output into response format after retries: {e.message}",
+                            color="yellow",
+                        )
+


LiteAgent still leaks structured output

Medium Severity

This PR disables passing response_model during tool loops in CrewAgentExecutor and the experimental AgentExecutor, but LiteAgent._invoke_loop still passes the full response_model on every get_llm_response call even when tools are configured. Structured-output enforcement can still run alongside ReAct tool steps in LiteAgent, which is the leak the rest of the change set is meant to prevent.

^{Reviewed by Cursor Bugbot for commit ab93679. Configure here.}

greysonlalonde · 2026-05-26T18:34:25Z

-        "match_on": ["method", "scheme", "host", "port", "path"],
+        "match_on": ["method", "scheme", "bedrock_host", "port", "path"],


This is going to make the cassettes more flaky

…-47-structured-output-tools

…request matching

cursor · 2026-05-26T20:25:40Z

+    """Register custom VCR matchers for each test cassette session."""
+    vcr.register_matcher("bedrock_host", bedrock_host_matcher)
+
+


Bedrock VCR matcher unused

Medium Severity

A custom bedrock_host VCR matcher is registered in pytest_recording_configure, but vcr_config still uses default match_on with host. Replay therefore compares raw regional Bedrock hostnames instead of the normalized placeholder, so cassettes recorded in one AWS region fail in CI or locally when another region is used.

^{Reviewed by Cursor Bugbot for commit efed5fc. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 88bd2b9. Configure here.}

cursor · 2026-05-27T16:14:07Z

+    status_code = response.get("status", {}).get("code")
+    if isinstance(status_code, int) and status_code >= 400:
+        # Avoid persisting auth/model errors when re-recording without valid AWS creds.
+        return None


Filtering all 400+ responses breaks error-path cassettes

Medium Severity

The new _filter_response_headers logic unconditionally drops all HTTP responses with status code >= 400 from VCR cassette recordings, not just Bedrock auth errors. Any test that exercises error-handling paths (rate limiting, validation errors, 404s, etc.) across any provider will silently lose those recorded responses. When tests replay, the missing interaction will cause CannotSendRequest or cassette-mismatch failures. The comment says this is to "avoid persisting auth/model errors when re-recording without valid AWS creds," but the filter applies globally to all providers and all error types.

^{Reviewed by Cursor Bugbot for commit 88bd2b9. Configure here.}

Fix structured output leaks in tool-calling loops

4b08a83

github-actions Bot added the size/M label May 21, 2026

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

coderabbitai Bot approved these changes May 21, 2026

View reviewed changes

github-code-quality Bot found potential problems May 21, 2026

View reviewed changes

Comment thread lib/crewai/src/crewai/agent/core.py Fixed

Comment thread lib/crewai/src/crewai/agent/core.py Fixed

Comment thread lib/crewai/src/crewai/lite_agent.py Fixed

lorenzejay added 2 commits May 21, 2026 16:11

addressing comments

3349cfc

Merge branch 'main' of github.com:crewAIInc/crewAI into codex/fix-oss…

1700a38

…-47-structured-output-tools

coderabbitai Bot requested changes May 21, 2026

View reviewed changes

Comment thread lib/crewai/scripts/structured_output_with_tools_runner.py Outdated

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread lib/crewai/scripts/structured_output_with_tools_runner.py Outdated

drop scripts

8b2e7e6

coderabbitai Bot approved these changes May 21, 2026

View reviewed changes

lorenzejay added 3 commits May 21, 2026 17:56

Update Gemini agent tests to include structured output with thoughts …

584baae

…and bump model version to 2.5-flash

merge

e9b1dd9

merge

ab82fda

greysonlalonde approved these changes May 26, 2026

View reviewed changes

lorenzejay added 5 commits May 26, 2026 09:30

adjusted bedrock cassettes

d883e6e

adjusting cassettes for bedrock

4e93aa6

fix test

ab93679

cursor Bot reviewed May 26, 2026

View reviewed changes

greysonlalonde requested changes May 26, 2026

View reviewed changes

lorenzejay added 2 commits May 26, 2026 13:22

Merge branch 'main' of github.com:crewAIInc/crewAI into codex/fix-oss…

8be5c7a

…-47-structured-output-tools

Update VCR configuration to use 'host' instead of 'bedrock_host' for …

efed5fc

…request matching

cursor Bot reviewed May 26, 2026

View reviewed changes

Merge branch 'main' into codex/fix-oss-47-structured-output-tools

88bd2b9

cursor Bot reviewed May 27, 2026

View reviewed changes

lorenzejay requested a review from greysonlalonde May 27, 2026 16:15

greysonlalonde approved these changes May 27, 2026

View reviewed changes

Merge branch 'main' into codex/fix-oss-47-structured-output-tools

c853196

lorenzejay merged commit a1033e4 into main May 27, 2026
56 checks passed

lorenzejay deleted the codex/fix-oss-47-structured-output-tools branch May 27, 2026 20:20

		"match_on": ["method", "scheme", "host", "port", "path"],
		"match_on": ["method", "scheme", "bedrock_host", "port", "path"],

		"""Register custom VCR matchers for each test cassette session."""
		vcr.register_matcher("bedrock_host", bedrock_host_matcher)

Conversation

lorenzejay commented May 21, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

linear Bot commented May 21, 2026

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

LiteAgent still leaks structured output

Uh oh!

greysonlalonde May 26, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

Bedrock VCR matcher unused

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

Filtering all 400+ responses breaks error-path cassettes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lorenzejay commented May 21, 2026 •

edited by cursor Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading