Revert "fix(llm): strip temperature/top_p for claude-opus-4-8 (#3427)"#3441
Conversation
This reverts commit 45697f5. #3427 routed claude-opus-4-8 through the EXTENDED_THINKING_MODELS path to strip temperature/top_p, but that path also emits the legacy `thinking: {"type": "enabled", "budget_tokens": N}` block plus the `anthropic-beta: interleaved-thinking-2025-05-14` header. Anthropic now rejects that schema for opus-4-8: invalid_request_error: "thinking.type.enabled" is not supported for this model. Use "thinking.type.adaptive" and "output_config.effort" to control thinking behavior. The stripping was never necessary in the first place. litellm (>= 1.84.1, the SDK pin) already reports opus-4-8 as supporting `reasoning_effort`, so `get_features("claude-opus-4-8").supports_reasoning_effort` is True, and the reasoning-model branch in `chat_options.py` already strips `temperature`/`top_p` and forwards `reasoning_effort` -- which is exactly how opus-4-5/4-6/4-7 are handled. Reverting #3427 makes opus-4-8 take the same path. Verified end-to-end: select_chat_options(opus-4-7) and select_chat_options(opus-4-8) both produce: temperature absent, top_p absent, reasoning_effort forwarded, no `thinking` block, no `anthropic-beta` header. Companion benchmarks revert: OpenHands/benchmarks#729 (so the proxy forwards `reasoning_effort` to Anthropic unmodified). Failing eval evidence: SWE-bench run 26641824100 (request_id req_011CbX18CQeQxYJ99w7nUorj) -- every instance errored with the schema mismatch above, leaving output.jsonl empty and 380+ instances in output_errors.jsonl. Co-authored-by: openhands <openhands@all-hands.dev>
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Summary
This is a clean, well-reasoned revert of #3427 that correctly fixes a critical regression. The PR description's root-cause analysis is excellent: adding claude-opus-4-8 to EXTENDED_THINKING_MODELS was meant to inherit the temperature/top_p stripping logic, but that same code path also injects the legacy thinking block (thinking.type.enabled + anthropic-beta header) that Anthropic now rejects for this model. The fix is correct — claude-opus-4-8 already goes through the reasoning-effort path via LiteLLM capability detection, which strips temp/top_p without injecting the thinking block. Two follow-up suggestions inline — neither blocks merge.
Coordination note: The companion benchmarks PR (OpenHands/benchmarks#729) should land alongside or before this one to avoid a window where the litellm proxy rejects reasoning_effort upfront.
Draft status: Given the urgency (100% SWE-bench evaluation error rate on every instance), consider un-drafting once CI goes green.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
This is a correct, minimal revert that directly addresses a critical regression introduced by #3427. The root-cause analysis in the PR description is thorough and accurate.
What Changed and Why It's Correct
model_features.py — Removes claude-opus-4-8 from EXTENDED_THINKING_MODELS. The original #3427 added it there solely to inherit the temperature/top_p stripping logic, but the EXTENDED_THINKING_MODELS branch in chat_options.py also unconditionally injects the legacy thinking.type.enabled block and anthropic-beta: interleaved-thinking-2025-05-14 header. Anthropic now rejects this schema for claude-opus-4-8 with a 400 error ("thinking.type.enabled" is not supported for this model), causing 100% eval failures.
With this revert, claude-opus-4-8 routes through the supports_reasoning_effort branch instead (via _normalized_supported_openai_params → LiteLLM capability detection), which strips temperature/top_p and sets reasoning_effort without injecting the legacy thinking block. This is the same path used for claude-opus-4-5/4-6/4-7, which is exactly the correct behavior.
Test changes — The removed tests (test_claude_opus_4_8_strips_temp_and_top_p and the EXTENDED_THINKING_MODELS parametrize entries) were asserting wrong behavior from #3427. Their removal is correct.
CI Status
All relevant checks pass: sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests, and the build matrix (Linux/macOS/Windows). The qa-changes failure is the OpenHands review automation, not a test regression.
Minor Observation
The diff also removes the ("claude-opus-4-7", False) parametrize entry from test_extended_thinking_support. This entry was added by #3427 as a new test and its removal reduces a small amount of explicit coverage (though the behavior is still correct since claude-opus-4-7 is not in EXTENDED_THINKING_MODELS). This is a cosmetic concern, not a correctness issue.
Coordination Note
The companion benchmarks PR (OpenHands/benchmarks#729) should land alongside or before this to avoid a window where the LiteLLM proxy rejects reasoning_effort before Anthropic sees the request.
Summary
The change is correct, well-scoped, and well-documented. The CI suite passes. Ready to merge once the companion benchmarks PR is coordinated.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, minimal revert that fixes a critical regression. This is a clean rollback of #3427 with a solid root-cause analysis in the PR description.
Analysis
Root cause confirmed. #3427 added claude-opus-4-8 to EXTENDED_THINKING_MODELS to inherit the temperature/top_p stripping logic. Unfortunately, that branch in chat_options.py unconditionally injects the legacy thinking.type.enabled block + anthropic-beta: interleaved-thinking-2025-05-14 header — which Anthropic now rejects for this model with a 400 error, causing 100% eval failure rate.
The fix is correct. claude-opus-4-8 is already handled by _supports_reasoning_effort via LiteLLM capability detection (get_supported_openai_params returns reasoning_effort for it). That path strips temperature/top_p without injecting the legacy thinking block — identical to how claude-opus-4-5/4-6/4-7 are handled.
Test deletions are correct. The removed tests (test_claude_opus_4_8_strips_temp_and_top_p and the extended-thinking parametrize entries for opus-4-8) were asserting the now-reverted wrong behavior. The diff also removes the ("claude-opus-4-7", False) entry from test_extended_thinking_support — that entry was added by #3427 as extra coverage; its removal with the revert is reasonable since claude-opus-4-7 is still not in EXTENDED_THINKING_MODELS and the existing test matrix implicitly covers this.
Risk Assessment: Low (with one coordination dependency)
The change is confined to removing a single list entry and its associated tests. No new logic is introduced. The CI suite (sdk-tests, tools-tests, agent-server-tests, cross-tests, workspace-tests) passes.
Coordination Dependency
The companion PR OpenHands/benchmarks#729 (reverting the litellm 1.84.1 bump in the benchmarks runtime) must land alongside or before this PR. Without it, the proxy's litellm rejects reasoning_effort with UnsupportedParamsError before Anthropic ever sees the request — different failure mode, same broken evals.
Verdict
Ready to merge once the benchmarks companion PR is coordinated. No blocking issues.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
Ready to merge now, current main blocks evals, companion PR can wait. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, minimal fix for a production-breaking regression. Previous reviews have covered the root cause thoroughly; this review focuses on aspects not yet addressed.
What the change does
Removes claude-opus-4-8 from EXTENDED_THINKING_MODELS. That list membership caused chat_options.py to inject the legacy thinking.type.enabled block and anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for this model with a 400 error. After the revert, claude-opus-4-8 routes through the _supports_reasoning_effort → LiteLLM branch, which strips temperature/top_p without injecting the thinking block (identical to the claude-opus-4-5/4-6/4-7 path).
One latent fragility worth tracking
_supports_reasoning_effort now depends entirely on litellm.get_supported_openai_params at runtime — no hardcoded fallback. If a future LiteLLM release reclassifies claude-opus-4-8 (e.g., drops reasoning_effort from its capability set), the model will silently stop stripping temperature/top_p without any test catching it. This isn't a blocker for this PR (the companion benchmarks PR pins a known-good LiteLLM version), but it's worth keeping in mind when the LiteLLM dependency is next bumped: re-run the eval smoke test or add a lightweight integration check.
PROMPT_CACHE_MODELS is unaffected
claude-opus-4-8 correctly remains in PROMPT_CACHE_MODELS — prompt caching and extended thinking are orthogonal concerns, and this list should stay unchanged.
CI
All required checks pass (sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests). The revert is safe to merge once the companion benchmarks PR (OpenHands/benchmarks#729) is coordinated.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
Good let's ship it now so we can continue with evals! |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, minimal revert that resolves a production-breaking regression.
Previous automated reviews have covered the root-cause analysis thoroughly. This review focuses on the current state of the PR and confirms readiness for merge.
Change Summary
The diff is a clean revert of #3427 across three files:
| File | Change |
|---|---|
openhands-sdk/openhands/sdk/llm/utils/model_features.py |
Removes "claude-opus-4-8" from EXTENDED_THINKING_MODELS (-5 lines) |
tests/sdk/llm/test_chat_options.py |
Removes test_claude_opus_4_8_strips_temp_and_top_p (-19 lines) |
tests/sdk/llm/test_model_features.py |
Removes the three claude-opus-4-8 and one claude-opus-4-7 parametrize entries (-4 lines) |
No new code is introduced — this is a pure deletion, which is the safest possible change profile.
Correctness
The revert is correct. Placing claude-opus-4-8 in EXTENDED_THINKING_MODELS caused chat_options.py to unconditionally emit thinking.type.enabled + the anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for this model (HTTP 400, "thinking.type.enabled" is not supported for this model). The model is already handled by the supports_reasoning_effort branch via LiteLLM capability detection, which strips temperature/top_p without injecting the legacy thinking block. The PR description's end-to-end verification and 161 passed test run confirm bit-for-bit identical output between claude-opus-4-7 and claude-opus-4-8 after the revert.
Inline Review Thread Status
The two previous inline suggestions (adding an explicit False entry for claude-opus-4-8 in extended-thinking tests, and a replacement test for the reasoning-effort path) have been explicitly declined by @juanmichelini. Those threads have no outstanding blockers.
CI Status
All required checks pass: sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests, build-binary-and-test (Linux/macOS/Windows). The two qa-changes failures are OpenHands review automation, not test regressions. Docker image builds are in progress (non-blocking for merge).
Remaining Coordination Dependency
The companion PR OpenHands/benchmarks#729 must land alongside or before this one. Without it the LiteLLM proxy in the benchmarks runtime rejects reasoning_effort upfront with UnsupportedParamsError, which is a different failure mode but the same broken eval outcome.
Risk: Low
Scoped to removing one list entry and its associated tests. No new logic, no behavioural changes for any other model, and a comprehensive test suite confirms the desired routing.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, minimal revert that resolves a production-breaking regression.
Previous automated reviews have covered the root-cause analysis thoroughly. This review focuses on the current state of the PR and confirms readiness for merge.
Change Summary
The diff is a clean revert of #3427 across three files:
| File | Change |
|---|---|
openhands-sdk/openhands/sdk/llm/utils/model_features.py |
Removes "claude-opus-4-8" from EXTENDED_THINKING_MODELS (-5 lines) |
tests/sdk/llm/test_chat_options.py |
Removes test_claude_opus_4_8_strips_temp_and_top_p (-19 lines) |
tests/sdk/llm/test_model_features.py |
Removes the three claude-opus-4-8 and one claude-opus-4-7 parametrize entries (-4 lines) |
No new code is introduced — this is a pure deletion, which is the safest possible change profile.
Correctness
The revert is correct. Placing claude-opus-4-8 in EXTENDED_THINKING_MODELS caused chat_options.py to unconditionally emit thinking.type.enabled + the anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for this model (HTTP 400, "thinking.type.enabled" is not supported for this model). The model is already handled by the supports_reasoning_effort branch via LiteLLM capability detection, which strips temperature/top_p without injecting the legacy thinking block. The PR description's end-to-end verification and 161 passed test run confirm bit-for-bit identical output between claude-opus-4-7 and claude-opus-4-8 after the revert.
Inline Review Thread Status
The two previous inline suggestions (adding an explicit False entry for claude-opus-4-8 in extended-thinking tests, and a replacement test for the reasoning-effort path) have been explicitly declined by @juanmichelini. Those threads have no outstanding blockers.
CI Status
All required checks pass: sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests, build-binary-and-test (Linux/macOS/Windows). The two qa-changes failures are OpenHands review automation, not test regressions. Docker image builds are in progress (non-blocking for merge).
Remaining Coordination Dependency
The companion PR OpenHands/benchmarks#729 must land alongside or before this one. Without it the LiteLLM proxy in the benchmarks runtime rejects reasoning_effort upfront with UnsupportedParamsError, which is a different failure mode but the same broken eval outcome.
Risk: Low
Scoped to removing one list entry and its associated tests. No new logic, no behavioural changes for any other model, and a comprehensive test suite confirms the desired routing.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — ready to merge pending companion PR coordination.
Previous automated reviews on this PR have thoroughly analyzed the root cause, correctness, and test coverage. This review confirms no new issues were found on the current state of the diff.
Summary of Change
Pure deletion across three files (-28 lines total, no additions):
model_features.py— removes"claude-opus-4-8"fromEXTENDED_THINKING_MODELS. This was the root cause of the regression: membership in that list causedchat_options.pyto unconditionally inject the legacythinking.type.enabledblock andanthropic-beta: interleaved-thinking-2025-05-14header, both of which Anthropic rejects with a 400 error for this model.test_chat_options.py— removes the now-incorrecttest_claude_opus_4_8_strips_temp_and_top_ptest that validated the broken behavior.test_model_features.py— removes the threeclaude-opus-4-8parametrize entries and oneclaude-opus-4-7entry from reasoning-effort support tests.
Verdict
The revert is correct and minimal. claude-opus-4-8 is already handled by the supports_reasoning_effort branch (via LiteLLM's get_supported_openai_params), which strips temperature/top_p without injecting the legacy thinking schema — identical to the claude-opus-4-5/4-6/4-7 path. The PR's end-to-end verification confirms bit-for-bit identical output.
Merge Dependency
Coordinate with OpenHands/benchmarks#729 — without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront before Anthropic ever sees the request.
Risk: Low
No new logic, no behavior changes for any other model. All CI checks pass.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
Need approval first to coordinate merge. Approve |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, focused revert. Safe to merge pending companion PR.
Prior automated reviews have thoroughly covered the root cause and correctness. This review confirms the current state of the diff and summarises what is relevant for the merge decision.
What the diff does
| File | Change |
|---|---|
openhands-sdk/openhands/sdk/llm/utils/model_features.py |
Removes "claude-opus-4-8" from EXTENDED_THINKING_MODELS (the root cause) |
tests/sdk/llm/test_chat_options.py |
Removes test_claude_opus_4_8_strips_temp_and_top_p, which validated the now-reverted broken behaviour |
tests/sdk/llm/test_model_features.py |
Removes claude-opus-4-8 entries and the stale ("claude-opus-4-7", False) parametrize entry |
Why the removal is correct
EXTENDED_THINKING_MODELS membership causes chat_options.py to unconditionally inject thinking.type.enabled + anthropic-beta: interleaved-thinking-2025-05-14. Anthropic now rejects that schema for claude-opus-4-8 with a 400. claude-opus-4-8 is already handled by the supports_reasoning_effort branch (same as claude-opus-4-5/4-6/4-7), which strips temperature/top_p without injecting the legacy thinking block — the PR's end-to-end verification confirms bit-for-bit identical output vs. claude-opus-4-7.
Test coverage note (non-blocking)
Prior inline suggestions to add new test cases for claude-opus-4-8 have been reviewed and explicitly declined by the author. The 161 passing CI tests confirm the post-revert state is correct. Any future regressions here would be caught by the SWE-bench eval pipeline.
Risk
Low. Pure deletion — no new logic, no behaviour changes to any other model. All CI checks pass.
Merge dependency
Coordinate with OpenHands/benchmarks#729 before running evals: without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront (via UnsupportedParamsError) before the request reaches Anthropic.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, focused revert. Safe to merge pending companion PR.
Prior automated reviews have thoroughly covered the root cause and correctness. This review confirms the current state of the diff and summarises what is relevant for the merge decision.
What the diff does
| File | Change |
|---|---|
openhands-sdk/openhands/sdk/llm/utils/model_features.py |
Removes "claude-opus-4-8" from EXTENDED_THINKING_MODELS (the root cause) |
tests/sdk/llm/test_chat_options.py |
Removes test_claude_opus_4_8_strips_temp_and_top_p, which validated the now-reverted broken behaviour |
tests/sdk/llm/test_model_features.py |
Removes claude-opus-4-8 entries and the stale ("claude-opus-4-7", False) parametrize entry |
Why the removal is correct
EXTENDED_THINKING_MODELS membership causes chat_options.py to unconditionally inject thinking.type.enabled + anthropic-beta: interleaved-thinking-2025-05-14. Anthropic now rejects that schema for claude-opus-4-8 with a 400. claude-opus-4-8 is already handled by the supports_reasoning_effort branch (same as claude-opus-4-5/4-6/4-7), which strips temperature/top_p without injecting the legacy thinking block — the PR's end-to-end verification confirms bit-for-bit identical output vs. claude-opus-4-7.
Test coverage note (non-blocking)
Prior inline suggestions to add new test cases for claude-opus-4-8 have been reviewed and explicitly declined by the author. The 161 passing CI tests confirm the post-revert state is correct. Any future regressions here would be caught by the SWE-bench eval pipeline.
Risk
Low. Pure deletion — no new logic, no behaviour changes to any other model. All CI checks pass.
Merge dependency
Coordinate with OpenHands/benchmarks#729 before running evals: without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront (via UnsupportedParamsError) before the request reaches Anthropic.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
@all-hands-bot There is no Merge dependency. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — this is a correct, minimal revert. Prior reviews have been comprehensive; this review focuses on confirming the current state and any open items.
What the PR Does
Reverts #3427, which incorrectly added claude-opus-4-8 to EXTENDED_THINKING_MODELS. The intent of #3427 (strip temperature/top_p) was sound, but the mechanism was wrong: that code path also injects the legacy thinking.type.enabled block, which Anthropic now rejects for opus-4-8 with HTTP 400. The correct path for opus-4-8 is the reasoning-effort branch, which already strips those params.
Change Quality
| File | Change | Assessment |
|---|---|---|
model_features.py |
Remove claude-opus-4-8 from EXTENDED_THINKING_MODELS |
✅ Correct — opus-4-8 belongs to the reasoning-effort path |
test_chat_options.py |
Remove test_claude_opus_4_8_strips_temp_and_top_p |
✅ Correct — test was validating behaviour through the wrong code path |
test_model_features.py |
Remove claude-opus-4-8 from @pytest.mark.parametrize and drop the stale ("claude-opus-4-7", False) entry |
✅ Correct cleanup |
Open Inline Suggestions
Two previous inline suggestions were made (add an explicit False entry for claude-opus-4-8 in the extended-thinking parametrize list, and add a replacement reasoning-effort test for opus-4-8). The PR author explicitly declined both. I agree with the author's decision:
claude-opus-4-8is already implicitly covered by the reasoning-effort parametrize set (the test iterates over_supports_reasoning_effort).- Adding a redundant entry or a bespoke stripping test for a single model variant adds noise without meaningfully increasing coverage.
Companion PR Dependency
The PR description calls out OpenHands/benchmarks#729 as a required companion (revert the litellm 1.84.1 bump in the benchmarks runtime). This SDK revert is safe to merge on its own, but benchmark evaluation will remain broken until both are merged and the eval runtime is rebuilt. Coordinate the merge order accordingly.
Verdict
Approve. The revert is minimal, well-motivated, and backed by end-to-end verification. No blocking concerns.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, minimal revert. Prior automated reviews have been comprehensive; this review confirms the current state and addresses open threads.
What the PR Does
Reverts #3427, which incorrectly added claude-opus-4-8 to EXTENDED_THINKING_MODELS. The intent of #3427 (strip temperature/top_p) was sound, but the mechanism was wrong: that code path also injects the legacy thinking.type.enabled block, which Anthropic now rejects for claude-opus-4-8 with HTTP 400. The correct path for claude-opus-4-8 is the reasoning-effort branch, which already handles parameter stripping.
Change Quality
| File | Change | Assessment |
|---|---|---|
model_features.py |
Remove claude-opus-4-8 from EXTENDED_THINKING_MODELS |
✅ Correct — opus-4-8 routes through the reasoning-effort path |
test_chat_options.py |
Remove test_claude_opus_4_8_strips_temp_and_top_p |
✅ Correct — the removed test validated behaviour through the wrong code path |
test_model_features.py |
Remove opus-4-8 entries from @pytest.mark.parametrize and drop the stale ("claude-opus-4-7", False) entry |
✅ Correct cleanup |
Open Inline Suggestions
Two prior inline suggestions proposed (1) adding an explicit False entry for claude-opus-4-8 in the extended-thinking parametrize list and (2) adding a reasoning-effort replacement test for claude-opus-4-8. The PR author explicitly declined both — that decision is reasonable:
claude-opus-4-8is already covered by the existing reasoning-effort parametrize entries via_supports_reasoning_effort.- A bespoke stripping test for a single model variant would add noise without meaningfully increasing coverage beyond what the existing sonnet-4-5/4-6 tests already provide.
No further action required on these threads.
Companion PR Dependency
OpenHands/benchmarks#729 must also be merged (revert the litellm 1.84.1 bump in the benchmarks runtime) before SWE-bench evaluations recover. This SDK revert is independently safe to merge, but benchmark runs will remain broken until both are in. Coordinate merge order accordingly.
Verdict
Approve. Minimal, correct revert backed by solid root-cause analysis and end-to-end verification. 161 tests pass. No blocking concerns.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, minimal revert. Prior automated reviews have been comprehensive; this review confirms the current state and addresses open threads.
What the PR Does
Reverts #3427, which incorrectly added claude-opus-4-8 to EXTENDED_THINKING_MODELS. The intent of #3427 (strip temperature/top_p) was sound, but the mechanism was wrong: that code path also injects the legacy thinking.type.enabled block, which Anthropic now rejects for claude-opus-4-8 with HTTP 400. The correct path for claude-opus-4-8 is the reasoning-effort branch, which already handles parameter stripping without the side-effect of injecting the legacy thinking schema.
Change Quality
| File | Change | Assessment |
|---|---|---|
model_features.py |
Remove claude-opus-4-8 from EXTENDED_THINKING_MODELS |
✅ Correct — opus-4-8 routes through the reasoning-effort path |
test_chat_options.py |
Remove test_claude_opus_4_8_strips_temp_and_top_p |
✅ Correct — the removed test validated behaviour through the wrong code path |
test_model_features.py |
Remove opus-4-8 entries from @pytest.mark.parametrize and drop the stale ("claude-opus-4-7", False) entry |
✅ Correct cleanup |
Open Inline Suggestions
Two prior inline suggestions proposed (1) adding an explicit False entry for claude-opus-4-8 in the extended-thinking parametrize list and (2) adding a reasoning-effort replacement test for claude-opus-4-8. The PR author explicitly declined both — that decision is reasonable:
claude-opus-4-8is already implicitly covered by the reasoning-effort parametrize entries via_supports_reasoning_effort.- A bespoke stripping test for a single model variant would add noise without meaningfully increasing coverage beyond what the existing sonnet-4-5/4-6 tests already provide for the reasoning-effort path.
No further action required on these threads.
Companion PR Dependency
OpenHands/benchmarks#729 must also be merged (revert the litellm 1.84.1 bump in the benchmarks runtime) before SWE-bench evaluations fully recover. This SDK revert is independently safe to merge, but benchmark runs will remain broken until both are in. Coordinate merge order accordingly.
Verdict
Approve. Minimal, correct revert backed by solid root-cause analysis and end-to-end verification (161 passed). No blocking concerns.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
thank you can you press approve now then? |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review
LGTM — correct, minimal revert of a production-breaking regression.
Prior automated reviews have comprehensively covered this change. This review confirms the current state and adds no new blocking concerns.
What the diff does
| File | Change |
|---|---|
openhands-sdk/openhands/sdk/llm/utils/model_features.py |
Removes "claude-opus-4-8" from EXTENDED_THINKING_MODELS — the root cause of the regression |
tests/sdk/llm/test_chat_options.py |
Removes test_claude_opus_4_8_strips_temp_and_top_p, which validated the now-reverted broken behavior |
tests/sdk/llm/test_model_features.py |
Removes claude-opus-4-8 parametrize entries and the stale ("claude-opus-4-7", False) entry |
Why this is correct
EXTENDED_THINKING_MODELS membership causes chat_options.py to unconditionally inject the legacy thinking.type.enabled block and anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for claude-opus-4-8 with HTTP 400. After this revert, claude-opus-4-8 routes through the supports_reasoning_effort branch (via LiteLLM capability detection), which strips temperature/top_p without injecting the legacy thinking schema. This is identical to how claude-opus-4-5/4-6/4-7 are handled. The PR's end-to-end verification (161 passed) confirms bit-for-bit identical output.
Inline suggestions
The two previously suggested additions (explicit False entry for claude-opus-4-8 in extended-thinking tests; replacement reasoning-effort test for opus-4-8) have been explicitly declined by the author. That is reasonable — claude-opus-4-8 is already covered by the existing reasoning-effort parametrize entries and test suite.
Merge dependency
Coordinate with OpenHands/benchmarks#729 before running evals. Without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront (UnsupportedParamsError) before the request reaches Anthropic.
Risk: Low
Pure deletion (-28 lines, no additions). No new logic, no behavior changes for any other model. All CI checks pass.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
enyst
left a comment
There was a problem hiding this comment.
Thank you for the investigation!
Reverts #3427.
Supersedes (and closes) #3440, which tried to surgically skip the legacy thinking block for opus-4-8 — turns out that's unnecessary; the right fix is just to undo #3427.
Why #3427 caused the regression
#3427 added
claude-opus-4-8toEXTENDED_THINKING_MODELSpurely to inherit that path'stemperature/top_pstripping. But the same branch inchat_options.pyalso emits the legacy thinking block:Anthropic now rejects that schema for
claude-opus-4-8:Result: SWE-bench eval run
26641824100errored on every single instance,output.jsonlstayed empty, and the dashboard read "0% Critic 1 acceptance" (which was misleading — the critic never ran; the LLM call itself 400'd).Why the stripping was never necessary
opus-4-8 is already covered by the existing reasoning-model branch in
chat_options.py, the same one that handles opus-4-5/4-6/4-7. With the SDK's pinned litellm (>= 1.84.1):So
get_features("claude-opus-4-8").supports_reasoning_effortisTrue, and the reasoning branch:does exactly what #3427 was trying to achieve, without the side effect of injecting the legacy thinking block.
End-to-end verification (post-revert)
Calling
select_chat_optionswithreasoning_effort="high",extended_thinking_budget=200000,temperature=0.0,top_p=1.0for both models:Bit-for-bit identical, which is the desired behavior — opus-4-8 now goes through the same path as opus-4-7.
pytest tests/sdk/llm/test_chat_options.py tests/sdk/llm/test_model_features.py:161 passed.Companion PR
OpenHands/benchmarks#729 — revert the litellm 1.84.1 bump in the benchmarks runtime so the proxy forwards
reasoning_effortto Anthropic unmodified. Without it the proxy's own litellm rejects the param upfront (UnsupportedParamsError) before Anthropic ever sees the request.Follow-ups (out of scope here)
output.jsonl == 0andoutput_errors.jsonl > 0, surface "all completed instances errored" rather than "0% critic acceptance" so this failure mode isn't mistaken for a critic problem in the future.This PR was opened by an AI agent (OpenHands) on behalf of @juanmichelini.
@juanmichelini can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:432e431-pythonRun
All tags pushed for this build
About Multi-Architecture Support
432e431-python) is a multi-arch manifest supporting both amd64 and arm64432e431-python-amd64) are also available if needed