Revert "fix(llm): strip temperature/top_p for claude-opus-4-8 (#3427)" by juanmichelini · Pull Request #3441 · OpenHands/software-agent-sdk

juanmichelini · 2026-05-29T16:51:50Z

Reverts #3427.

Supersedes (and closes) #3440, which tried to surgically skip the legacy thinking block for opus-4-8 — turns out that's unnecessary; the right fix is just to undo #3427.

Why #3427 caused the regression

#3427 added claude-opus-4-8 to EXTENDED_THINKING_MODELS purely to inherit that path's temperature/top_p stripping. But the same branch in chat_options.py also emits the legacy thinking block:

out["thinking"] = {"type": "enabled", "budget_tokens": N}
out["extra_headers"]["anthropic-beta"] = "interleaved-thinking-2025-05-14"

Anthropic now rejects that schema for claude-opus-4-8:

invalid_request_error: "thinking.type.enabled" is not supported for this model.
Use "thinking.type.adaptive" and "output_config.effort" to control thinking behavior.
request_id: req_011CbX18CQeQxYJ99w7nUorj

Result: SWE-bench eval run 26641824100 errored on every single instance, output.jsonl stayed empty, and the dashboard read "0% Critic 1 acceptance" (which was misleading — the critic never ran; the LLM call itself 400'd).

Why the stripping was never necessary

opus-4-8 is already covered by the existing reasoning-model branch in chat_options.py, the same one that handles opus-4-5/4-6/4-7. With the SDK's pinned litellm (>= 1.84.1):

>>> from litellm import get_supported_openai_params
>>> "reasoning_effort" in get_supported_openai_params(model="anthropic/claude-opus-4-8")
True
>>> "reasoning_effort" in get_supported_openai_params(model="anthropic/claude-opus-4-7")
True

So get_features("claude-opus-4-8").supports_reasoning_effort is True, and the reasoning branch:

if supports_reasoning_effort:
    if llm.reasoning_effort is not None:
        out["reasoning_effort"] = llm.reasoning_effort
    if "gemini" not in llm.model.lower():
        out.pop("temperature", None)
        out.pop("top_p", None)

does exactly what #3427 was trying to achieve, without the side effect of injecting the legacy thinking block.

End-to-end verification (post-revert)

Calling select_chat_options with reasoning_effort="high", extended_thinking_budget=200000, temperature=0.0, top_p=1.0 for both models:

--- litellm_proxy/anthropic/claude-opus-4-7 ---
  temperature: <absent>
  top_p: <absent>
  reasoning_effort: high
  thinking: <absent>
  extra_headers: <absent>
--- litellm_proxy/anthropic/claude-opus-4-8 ---
  temperature: <absent>
  top_p: <absent>
  reasoning_effort: high
  thinking: <absent>
  extra_headers: <absent>

Bit-for-bit identical, which is the desired behavior — opus-4-8 now goes through the same path as opus-4-7. pytest tests/sdk/llm/test_chat_options.py tests/sdk/llm/test_model_features.py: 161 passed.

Companion PR

OpenHands/benchmarks#729 — revert the litellm 1.84.1 bump in the benchmarks runtime so the proxy forwards reasoning_effort to Anthropic unmodified. Without it the proxy's own litellm rejects the param upfront (UnsupportedParamsError) before Anthropic ever sees the request.

Follow-ups (out of scope here)

Eval-monitor UX: when output.jsonl == 0 and output_errors.jsonl > 0, surface "all completed instances errored" rather than "0% critic acceptance" so this failure mode isn't mistaken for a critic problem in the future.

This PR was opened by an AI agent (OpenHands) on behalf of @juanmichelini.

@juanmichelini can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:432e431-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-432e431-python \
  ghcr.io/openhands/agent-server:432e431-python

All tags pushed for this build

ghcr.io/openhands/agent-server:432e431-golang-amd64
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-golang-amd64
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-golang-amd64
ghcr.io/openhands/agent-server:432e431-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:432e431-golang-arm64
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-golang-arm64
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-golang-arm64
ghcr.io/openhands/agent-server:432e431-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:432e431-java-amd64
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-java-amd64
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-java-amd64
ghcr.io/openhands/agent-server:432e431-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:432e431-java-arm64
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-java-arm64
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-java-arm64
ghcr.io/openhands/agent-server:432e431-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:432e431-python-amd64
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-python-amd64
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-python-amd64
ghcr.io/openhands/agent-server:432e431-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:432e431-python-arm64
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-python-arm64
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-python-arm64
ghcr.io/openhands/agent-server:432e431-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:432e431-golang
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-golang
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-golang
ghcr.io/openhands/agent-server:432e431-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:432e431-java
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-java
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-java
ghcr.io/openhands/agent-server:432e431-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:432e431-python
ghcr.io/openhands/agent-server:432e431733f78577fb35d506c3010f5385e01390-python
ghcr.io/openhands/agent-server:revert-3427-opus-4-8-extended-thinking-python
ghcr.io/openhands/agent-server:432e431-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 432e431-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 432e431-python-amd64) are also available if needed

This reverts commit 45697f5. #3427 routed claude-opus-4-8 through the EXTENDED_THINKING_MODELS path to strip temperature/top_p, but that path also emits the legacy `thinking: {"type": "enabled", "budget_tokens": N}` block plus the `anthropic-beta: interleaved-thinking-2025-05-14` header. Anthropic now rejects that schema for opus-4-8: invalid_request_error: "thinking.type.enabled" is not supported for this model. Use "thinking.type.adaptive" and "output_config.effort" to control thinking behavior. The stripping was never necessary in the first place. litellm (>= 1.84.1, the SDK pin) already reports opus-4-8 as supporting `reasoning_effort`, so `get_features("claude-opus-4-8").supports_reasoning_effort` is True, and the reasoning-model branch in `chat_options.py` already strips `temperature`/`top_p` and forwards `reasoning_effort` -- which is exactly how opus-4-5/4-6/4-7 are handled. Reverting #3427 makes opus-4-8 take the same path. Verified end-to-end: select_chat_options(opus-4-7) and select_chat_options(opus-4-8) both produce: temperature absent, top_p absent, reasoning_effort forwarded, no `thinking` block, no `anthropic-beta` header. Companion benchmarks revert: OpenHands/benchmarks#729 (so the proxy forwards `reasoning_effort` to Anthropic unmodified). Failing eval evidence: SWE-bench run 26641824100 (request_id req_011CbX18CQeQxYJ99w7nUorj) -- every instance errored with the schema mismatch above, leaving output.jsonl empty and 380+ instances in output_errors.jsonl. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-05-29T16:52:23Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-29T16:52:31Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-29T16:54:46Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/llm/utils
model_features.py	58	1	98%	35
TOTAL	28666	8308	71%

all-hands-bot · 2026-05-29T17:32:31Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Summary

This is a clean, well-reasoned revert of #3427 that correctly fixes a critical regression. The PR description's root-cause analysis is excellent: adding claude-opus-4-8 to EXTENDED_THINKING_MODELS was meant to inherit the temperature/top_p stripping logic, but that same code path also injects the legacy thinking block (thinking.type.enabled + anthropic-beta header) that Anthropic now rejects for this model. The fix is correct — claude-opus-4-8 already goes through the reasoning-effort path via LiteLLM capability detection, which strips temp/top_p without injecting the thinking block. Two follow-up suggestions inline — neither blocks merge.

Coordination note: The companion benchmarks PR (OpenHands/benchmarks#729) should land alongside or before this one to avoid a window where the litellm proxy rejects reasoning_effort upfront.

Draft status: Given the urgency (100% SWE-bench evaluation error rate on every instance), consider un-drafting once CI goes green.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-05-29T17:43:27Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

This is a correct, minimal revert that directly addresses a critical regression introduced by #3427. The root-cause analysis in the PR description is thorough and accurate.

What Changed and Why It's Correct

model_features.py — Removes claude-opus-4-8 from EXTENDED_THINKING_MODELS. The original #3427 added it there solely to inherit the temperature/top_p stripping logic, but the EXTENDED_THINKING_MODELS branch in chat_options.py also unconditionally injects the legacy thinking.type.enabled block and anthropic-beta: interleaved-thinking-2025-05-14 header. Anthropic now rejects this schema for claude-opus-4-8 with a 400 error ("thinking.type.enabled" is not supported for this model), causing 100% eval failures.

With this revert, claude-opus-4-8 routes through the supports_reasoning_effort branch instead (via _normalized_supported_openai_params → LiteLLM capability detection), which strips temperature/top_p and sets reasoning_effort without injecting the legacy thinking block. This is the same path used for claude-opus-4-5/4-6/4-7, which is exactly the correct behavior.

Test changes — The removed tests (test_claude_opus_4_8_strips_temp_and_top_p and the EXTENDED_THINKING_MODELS parametrize entries) were asserting wrong behavior from #3427. Their removal is correct.

CI Status

All relevant checks pass: sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests, and the build matrix (Linux/macOS/Windows). The qa-changes failure is the OpenHands review automation, not a test regression.

Minor Observation

The diff also removes the ("claude-opus-4-7", False) parametrize entry from test_extended_thinking_support. This entry was added by #3427 as a new test and its removal reduces a small amount of explicit coverage (though the behavior is still correct since claude-opus-4-7 is not in EXTENDED_THINKING_MODELS). This is a cosmetic concern, not a correctness issue.

Coordination Note

The companion benchmarks PR (OpenHands/benchmarks#729) should land alongside or before this to avoid a window where the LiteLLM proxy rejects reasoning_effort before Anthropic sees the request.

Summary

The change is correct, well-scoped, and well-documented. The CI suite passes. Ready to merge once the companion benchmarks PR is coordinated.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-05-29T18:26:56Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

LGTM — correct, minimal revert that fixes a critical regression. This is a clean rollback of #3427 with a solid root-cause analysis in the PR description.

Analysis

Root cause confirmed. #3427 added claude-opus-4-8 to EXTENDED_THINKING_MODELS to inherit the temperature/top_p stripping logic. Unfortunately, that branch in chat_options.py unconditionally injects the legacy thinking.type.enabled block + anthropic-beta: interleaved-thinking-2025-05-14 header — which Anthropic now rejects for this model with a 400 error, causing 100% eval failure rate.

The fix is correct. claude-opus-4-8 is already handled by _supports_reasoning_effort via LiteLLM capability detection (get_supported_openai_params returns reasoning_effort for it). That path strips temperature/top_p without injecting the legacy thinking block — identical to how claude-opus-4-5/4-6/4-7 are handled.

Test deletions are correct. The removed tests (test_claude_opus_4_8_strips_temp_and_top_p and the extended-thinking parametrize entries for opus-4-8) were asserting the now-reverted wrong behavior. The diff also removes the ("claude-opus-4-7", False) entry from test_extended_thinking_support — that entry was added by #3427 as extra coverage; its removal with the revert is reasonable since claude-opus-4-7 is still not in EXTENDED_THINKING_MODELS and the existing test matrix implicitly covers this.

Risk Assessment: Low (with one coordination dependency)

The change is confined to removing a single list entry and its associated tests. No new logic is introduced. The CI suite (sdk-tests, tools-tests, agent-server-tests, cross-tests, workspace-tests) passes.

Coordination Dependency

The companion PR OpenHands/benchmarks#729 (reverting the litellm 1.84.1 bump in the benchmarks runtime) must land alongside or before this PR. Without it, the proxy's litellm rejects reasoning_effort with UnsupportedParamsError before Anthropic ever sees the request — different failure mode, same broken evals.

Verdict

Ready to merge once the benchmarks companion PR is coordinated. No blocking issues.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

juanmichelini · 2026-05-29T18:30:58Z

Ready to merge now, current main blocks evals, companion PR can wait.

all-hands-bot · 2026-05-29T18:31:59Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

LGTM — correct, minimal fix for a production-breaking regression. Previous reviews have covered the root cause thoroughly; this review focuses on aspects not yet addressed.

What the change does

Removes claude-opus-4-8 from EXTENDED_THINKING_MODELS. That list membership caused chat_options.py to inject the legacy thinking.type.enabled block and anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for this model with a 400 error. After the revert, claude-opus-4-8 routes through the _supports_reasoning_effort → LiteLLM branch, which strips temperature/top_p without injecting the thinking block (identical to the claude-opus-4-5/4-6/4-7 path).

One latent fragility worth tracking

_supports_reasoning_effort now depends entirely on litellm.get_supported_openai_params at runtime — no hardcoded fallback. If a future LiteLLM release reclassifies claude-opus-4-8 (e.g., drops reasoning_effort from its capability set), the model will silently stop stripping temperature/top_p without any test catching it. This isn't a blocker for this PR (the companion benchmarks PR pins a known-good LiteLLM version), but it's worth keeping in mind when the LiteLLM dependency is next bumped: re-run the eval smoke test or add a lightweight integration check.

`PROMPT_CACHE_MODELS` is unaffected

claude-opus-4-8 correctly remains in PROMPT_CACHE_MODELS — prompt caching and extended thinking are orthogonal concerns, and this list should stay unchanged.

CI

All required checks pass (sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests). The revert is safe to merge once the companion benchmarks PR (OpenHands/benchmarks#729) is coordinated.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

juanmichelini · 2026-05-29T18:37:25Z

Good let's ship it now so we can continue with evals!

all-hands-bot · 2026-05-29T18:38:30Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

LGTM — correct, minimal revert that resolves a production-breaking regression.

Previous automated reviews have covered the root-cause analysis thoroughly. This review focuses on the current state of the PR and confirms readiness for merge.

Change Summary

The diff is a clean revert of #3427 across three files:

File	Change
`openhands-sdk/openhands/sdk/llm/utils/model_features.py`	Removes `"claude-opus-4-8"` from `EXTENDED_THINKING_MODELS` (-5 lines)
`tests/sdk/llm/test_chat_options.py`	Removes `test_claude_opus_4_8_strips_temp_and_top_p` (-19 lines)
`tests/sdk/llm/test_model_features.py`	Removes the three `claude-opus-4-8` and one `claude-opus-4-7` parametrize entries (-4 lines)

No new code is introduced — this is a pure deletion, which is the safest possible change profile.

Correctness

The revert is correct. Placing claude-opus-4-8 in EXTENDED_THINKING_MODELS caused chat_options.py to unconditionally emit thinking.type.enabled + the anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for this model (HTTP 400, "thinking.type.enabled" is not supported for this model). The model is already handled by the supports_reasoning_effort branch via LiteLLM capability detection, which strips temperature/top_p without injecting the legacy thinking block. The PR description's end-to-end verification and 161 passed test run confirm bit-for-bit identical output between claude-opus-4-7 and claude-opus-4-8 after the revert.

Inline Review Thread Status

The two previous inline suggestions (adding an explicit False entry for claude-opus-4-8 in extended-thinking tests, and a replacement test for the reasoning-effort path) have been explicitly declined by @juanmichelini. Those threads have no outstanding blockers.

CI Status

All required checks pass: sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests, build-binary-and-test (Linux/macOS/Windows). The two qa-changes failures are OpenHands review automation, not test regressions. Docker image builds are in progress (non-blocking for merge).

Remaining Coordination Dependency

The companion PR OpenHands/benchmarks#729 must land alongside or before this one. Without it the LiteLLM proxy in the benchmarks runtime rejects reasoning_effort upfront with UnsupportedParamsError, which is a different failure mode but the same broken eval outcome.

Risk: Low

Scoped to removing one list entry and its associated tests. No new logic, no behavioural changes for any other model, and a comprehensive test suite confirms the desired routing.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

Code Review

LGTM — correct, minimal revert that resolves a production-breaking regression.

Previous automated reviews have covered the root-cause analysis thoroughly. This review focuses on the current state of the PR and confirms readiness for merge.

Change Summary

The diff is a clean revert of #3427 across three files:

File	Change
`openhands-sdk/openhands/sdk/llm/utils/model_features.py`	Removes `"claude-opus-4-8"` from `EXTENDED_THINKING_MODELS` (-5 lines)
`tests/sdk/llm/test_chat_options.py`	Removes `test_claude_opus_4_8_strips_temp_and_top_p` (-19 lines)
`tests/sdk/llm/test_model_features.py`	Removes the three `claude-opus-4-8` and one `claude-opus-4-7` parametrize entries (-4 lines)

No new code is introduced — this is a pure deletion, which is the safest possible change profile.

Correctness

The revert is correct. Placing claude-opus-4-8 in EXTENDED_THINKING_MODELS caused chat_options.py to unconditionally emit thinking.type.enabled + the anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for this model (HTTP 400, "thinking.type.enabled" is not supported for this model). The model is already handled by the supports_reasoning_effort branch via LiteLLM capability detection, which strips temperature/top_p without injecting the legacy thinking block. The PR description's end-to-end verification and 161 passed test run confirm bit-for-bit identical output between claude-opus-4-7 and claude-opus-4-8 after the revert.

Inline Review Thread Status

The two previous inline suggestions (adding an explicit False entry for claude-opus-4-8 in extended-thinking tests, and a replacement test for the reasoning-effort path) have been explicitly declined by @juanmichelini. Those threads have no outstanding blockers.

CI Status

All required checks pass: sdk-tests, coverage-report, pre-commit, tools-tests, agent-server-tests, cross-tests, workspace-tests, build-binary-and-test (Linux/macOS/Windows). The two qa-changes failures are OpenHands review automation, not test regressions. Docker image builds are in progress (non-blocking for merge).

Remaining Coordination Dependency

The companion PR OpenHands/benchmarks#729 must land alongside or before this one. Without it the LiteLLM proxy in the benchmarks runtime rejects reasoning_effort upfront with UnsupportedParamsError, which is a different failure mode but the same broken eval outcome.

Risk: Low

Scoped to removing one list entry and its associated tests. No new logic, no behavioural changes for any other model, and a comprehensive test suite confirms the desired routing.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-05-29T18:42:01Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

LGTM — ready to merge pending companion PR coordination.

Previous automated reviews on this PR have thoroughly analyzed the root cause, correctness, and test coverage. This review confirms no new issues were found on the current state of the diff.

Summary of Change

Pure deletion across three files (-28 lines total, no additions):

model_features.py — removes "claude-opus-4-8" from EXTENDED_THINKING_MODELS. This was the root cause of the regression: membership in that list caused chat_options.py to unconditionally inject the legacy thinking.type.enabled block and anthropic-beta: interleaved-thinking-2025-05-14 header, both of which Anthropic rejects with a 400 error for this model.
test_chat_options.py — removes the now-incorrect test_claude_opus_4_8_strips_temp_and_top_p test that validated the broken behavior.
test_model_features.py — removes the three claude-opus-4-8 parametrize entries and one claude-opus-4-7 entry from reasoning-effort support tests.

Verdict

The revert is correct and minimal. claude-opus-4-8 is already handled by the supports_reasoning_effort branch (via LiteLLM's get_supported_openai_params), which strips temperature/top_p without injecting the legacy thinking schema — identical to the claude-opus-4-5/4-6/4-7 path. The PR's end-to-end verification confirms bit-for-bit identical output.

Merge Dependency

Coordinate with OpenHands/benchmarks#729 — without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront before Anthropic ever sees the request.

Risk: Low

No new logic, no behavior changes for any other model. All CI checks pass.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

juanmichelini · 2026-05-29T18:47:13Z

Need approval first to coordinate merge. Approve

all-hands-bot · 2026-05-29T18:48:12Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

LGTM — correct, focused revert. Safe to merge pending companion PR.

Prior automated reviews have thoroughly covered the root cause and correctness. This review confirms the current state of the diff and summarises what is relevant for the merge decision.

What the diff does

File	Change
`openhands-sdk/openhands/sdk/llm/utils/model_features.py`	Removes `"claude-opus-4-8"` from `EXTENDED_THINKING_MODELS` (the root cause)
`tests/sdk/llm/test_chat_options.py`	Removes `test_claude_opus_4_8_strips_temp_and_top_p`, which validated the now-reverted broken behaviour
`tests/sdk/llm/test_model_features.py`	Removes `claude-opus-4-8` entries and the stale `("claude-opus-4-7", False)` parametrize entry

Why the removal is correct

EXTENDED_THINKING_MODELS membership causes chat_options.py to unconditionally inject thinking.type.enabled + anthropic-beta: interleaved-thinking-2025-05-14. Anthropic now rejects that schema for claude-opus-4-8 with a 400. claude-opus-4-8 is already handled by the supports_reasoning_effort branch (same as claude-opus-4-5/4-6/4-7), which strips temperature/top_p without injecting the legacy thinking block — the PR's end-to-end verification confirms bit-for-bit identical output vs. claude-opus-4-7.

Test coverage note (non-blocking)

Prior inline suggestions to add new test cases for claude-opus-4-8 have been reviewed and explicitly declined by the author. The 161 passing CI tests confirm the post-revert state is correct. Any future regressions here would be caught by the SWE-bench eval pipeline.

Risk

Low. Pure deletion — no new logic, no behaviour changes to any other model. All CI checks pass.

Merge dependency

Coordinate with OpenHands/benchmarks#729 before running evals: without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront (via UnsupportedParamsError) before the request reaches Anthropic.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

Code Review

LGTM — correct, focused revert. Safe to merge pending companion PR.

Prior automated reviews have thoroughly covered the root cause and correctness. This review confirms the current state of the diff and summarises what is relevant for the merge decision.

What the diff does

File	Change
`openhands-sdk/openhands/sdk/llm/utils/model_features.py`	Removes `"claude-opus-4-8"` from `EXTENDED_THINKING_MODELS` (the root cause)
`tests/sdk/llm/test_chat_options.py`	Removes `test_claude_opus_4_8_strips_temp_and_top_p`, which validated the now-reverted broken behaviour
`tests/sdk/llm/test_model_features.py`	Removes `claude-opus-4-8` entries and the stale `("claude-opus-4-7", False)` parametrize entry

Why the removal is correct

EXTENDED_THINKING_MODELS membership causes chat_options.py to unconditionally inject thinking.type.enabled + anthropic-beta: interleaved-thinking-2025-05-14. Anthropic now rejects that schema for claude-opus-4-8 with a 400. claude-opus-4-8 is already handled by the supports_reasoning_effort branch (same as claude-opus-4-5/4-6/4-7), which strips temperature/top_p without injecting the legacy thinking block — the PR's end-to-end verification confirms bit-for-bit identical output vs. claude-opus-4-7.

Test coverage note (non-blocking)

Prior inline suggestions to add new test cases for claude-opus-4-8 have been reviewed and explicitly declined by the author. The 161 passing CI tests confirm the post-revert state is correct. Any future regressions here would be caught by the SWE-bench eval pipeline.

Risk

Low. Pure deletion — no new logic, no behaviour changes to any other model. All CI checks pass.

Merge dependency

Coordinate with OpenHands/benchmarks#729 before running evals: without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront (via UnsupportedParamsError) before the request reaches Anthropic.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

juanmichelini · 2026-05-29T18:53:01Z

@all-hands-bot There is no Merge dependency.
Both PRs need to be merged, that's it. So we need to merge this one. Do you realize that?

all-hands-bot · 2026-05-29T18:54:05Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

LGTM — this is a correct, minimal revert. Prior reviews have been comprehensive; this review focuses on confirming the current state and any open items.

What the PR Does

Reverts #3427, which incorrectly added claude-opus-4-8 to EXTENDED_THINKING_MODELS. The intent of #3427 (strip temperature/top_p) was sound, but the mechanism was wrong: that code path also injects the legacy thinking.type.enabled block, which Anthropic now rejects for opus-4-8 with HTTP 400. The correct path for opus-4-8 is the reasoning-effort branch, which already strips those params.

Change Quality

File	Change	Assessment
`model_features.py`	Remove `claude-opus-4-8` from `EXTENDED_THINKING_MODELS`	✅ Correct — opus-4-8 belongs to the reasoning-effort path
`test_chat_options.py`	Remove `test_claude_opus_4_8_strips_temp_and_top_p`	✅ Correct — test was validating behaviour through the wrong code path
`test_model_features.py`	Remove `claude-opus-4-8` from `@pytest.mark.parametrize` and drop the stale `("claude-opus-4-7", False)` entry	✅ Correct cleanup

Open Inline Suggestions

Two previous inline suggestions were made (add an explicit False entry for claude-opus-4-8 in the extended-thinking parametrize list, and add a replacement reasoning-effort test for opus-4-8). The PR author explicitly declined both. I agree with the author's decision:

claude-opus-4-8 is already implicitly covered by the reasoning-effort parametrize set (the test iterates over _supports_reasoning_effort).
Adding a redundant entry or a bespoke stripping test for a single model variant adds noise without meaningfully increasing coverage.

Companion PR Dependency

The PR description calls out OpenHands/benchmarks#729 as a required companion (revert the litellm 1.84.1 bump in the benchmarks runtime). This SDK revert is safe to merge on its own, but benchmark evaluation will remain broken until both are merged and the eval runtime is rebuilt. Coordinate the merge order accordingly.

Verdict

Approve. The revert is minimal, well-motivated, and backed by end-to-end verification. No blocking concerns.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

Code Review

LGTM — correct, minimal revert. Prior automated reviews have been comprehensive; this review confirms the current state and addresses open threads.

What the PR Does

Reverts #3427, which incorrectly added claude-opus-4-8 to EXTENDED_THINKING_MODELS. The intent of #3427 (strip temperature/top_p) was sound, but the mechanism was wrong: that code path also injects the legacy thinking.type.enabled block, which Anthropic now rejects for claude-opus-4-8 with HTTP 400. The correct path for claude-opus-4-8 is the reasoning-effort branch, which already handles parameter stripping.

Change Quality

File	Change	Assessment
`model_features.py`	Remove `claude-opus-4-8` from `EXTENDED_THINKING_MODELS`	✅ Correct — opus-4-8 routes through the reasoning-effort path
`test_chat_options.py`	Remove `test_claude_opus_4_8_strips_temp_and_top_p`	✅ Correct — the removed test validated behaviour through the wrong code path
`test_model_features.py`	Remove opus-4-8 entries from `@pytest.mark.parametrize` and drop the stale `("claude-opus-4-7", False)` entry	✅ Correct cleanup

Open Inline Suggestions

Two prior inline suggestions proposed (1) adding an explicit False entry for claude-opus-4-8 in the extended-thinking parametrize list and (2) adding a reasoning-effort replacement test for claude-opus-4-8. The PR author explicitly declined both — that decision is reasonable:

claude-opus-4-8 is already covered by the existing reasoning-effort parametrize entries via _supports_reasoning_effort.
A bespoke stripping test for a single model variant would add noise without meaningfully increasing coverage beyond what the existing sonnet-4-5/4-6 tests already provide.

No further action required on these threads.

Companion PR Dependency

OpenHands/benchmarks#729 must also be merged (revert the litellm 1.84.1 bump in the benchmarks runtime) before SWE-bench evaluations recover. This SDK revert is independently safe to merge, but benchmark runs will remain broken until both are in. Coordinate merge order accordingly.

Verdict

Approve. Minimal, correct revert backed by solid root-cause analysis and end-to-end verification. 161 tests pass. No blocking concerns.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

Code Review

LGTM — correct, minimal revert. Prior automated reviews have been comprehensive; this review confirms the current state and addresses open threads.

What the PR Does

Reverts #3427, which incorrectly added claude-opus-4-8 to EXTENDED_THINKING_MODELS. The intent of #3427 (strip temperature/top_p) was sound, but the mechanism was wrong: that code path also injects the legacy thinking.type.enabled block, which Anthropic now rejects for claude-opus-4-8 with HTTP 400. The correct path for claude-opus-4-8 is the reasoning-effort branch, which already handles parameter stripping without the side-effect of injecting the legacy thinking schema.

Change Quality

File	Change	Assessment
`model_features.py`	Remove `claude-opus-4-8` from `EXTENDED_THINKING_MODELS`	✅ Correct — opus-4-8 routes through the reasoning-effort path
`test_chat_options.py`	Remove `test_claude_opus_4_8_strips_temp_and_top_p`	✅ Correct — the removed test validated behaviour through the wrong code path
`test_model_features.py`	Remove opus-4-8 entries from `@pytest.mark.parametrize` and drop the stale `("claude-opus-4-7", False)` entry	✅ Correct cleanup

Open Inline Suggestions

Two prior inline suggestions proposed (1) adding an explicit False entry for claude-opus-4-8 in the extended-thinking parametrize list and (2) adding a reasoning-effort replacement test for claude-opus-4-8. The PR author explicitly declined both — that decision is reasonable:

claude-opus-4-8 is already implicitly covered by the reasoning-effort parametrize entries via _supports_reasoning_effort.
A bespoke stripping test for a single model variant would add noise without meaningfully increasing coverage beyond what the existing sonnet-4-5/4-6 tests already provide for the reasoning-effort path.

No further action required on these threads.

Companion PR Dependency

OpenHands/benchmarks#729 must also be merged (revert the litellm 1.84.1 bump in the benchmarks runtime) before SWE-bench evaluations fully recover. This SDK revert is independently safe to merge, but benchmark runs will remain broken until both are in. Coordinate merge order accordingly.

Verdict

Approve. Minimal, correct revert backed by solid root-cause analysis and end-to-end verification (161 passed). No blocking concerns.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

juanmichelini · 2026-05-29T19:00:21Z

Verdict
Approve. Minimal, correct revert backed by solid root-cause analysis and end-to-end verification (161 passed). No blocking concerns.

thank you can you press approve now then?

all-hands-bot · 2026-05-29T19:01:13Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review

LGTM — correct, minimal revert of a production-breaking regression.

Prior automated reviews have comprehensively covered this change. This review confirms the current state and adds no new blocking concerns.

What the diff does

File	Change
`openhands-sdk/openhands/sdk/llm/utils/model_features.py`	Removes `"claude-opus-4-8"` from `EXTENDED_THINKING_MODELS` — the root cause of the regression
`tests/sdk/llm/test_chat_options.py`	Removes `test_claude_opus_4_8_strips_temp_and_top_p`, which validated the now-reverted broken behavior
`tests/sdk/llm/test_model_features.py`	Removes `claude-opus-4-8` parametrize entries and the stale `("claude-opus-4-7", False)` entry

Why this is correct

EXTENDED_THINKING_MODELS membership causes chat_options.py to unconditionally inject the legacy thinking.type.enabled block and anthropic-beta: interleaved-thinking-2025-05-14 header — both of which Anthropic now rejects for claude-opus-4-8 with HTTP 400. After this revert, claude-opus-4-8 routes through the supports_reasoning_effort branch (via LiteLLM capability detection), which strips temperature/top_p without injecting the legacy thinking schema. This is identical to how claude-opus-4-5/4-6/4-7 are handled. The PR's end-to-end verification (161 passed) confirms bit-for-bit identical output.

Inline suggestions

The two previously suggested additions (explicit False entry for claude-opus-4-8 in extended-thinking tests; replacement reasoning-effort test for opus-4-8) have been explicitly declined by the author. That is reasonable — claude-opus-4-8 is already covered by the existing reasoning-effort parametrize entries and test suite.

Merge dependency

Coordinate with OpenHands/benchmarks#729 before running evals. Without that companion revert, the benchmarks runtime's LiteLLM proxy rejects reasoning_effort upfront (UnsupportedParamsError) before the request reaches Anthropic.

Risk: Low

Pure deletion (-28 lines, no additions). No new logic, no behavior changes for any other model. All CI checks pass.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

enyst

Thank you for the investigation!

juanmichelini added bug Something isn't working lm-support labels May 29, 2026 — with OpenHands AI

juanmichelini requested a review from all-hands-bot May 29, 2026 17:31

juanmichelini marked this pull request as ready for review May 29, 2026 17:35

all-hands-bot reviewed May 29, 2026

View reviewed changes

Comment thread tests/sdk/llm/test_model_features.py

Comment thread tests/sdk/llm/test_chat_options.py

juanmichelini requested a review from all-hands-bot May 29, 2026 17:42

all-hands-bot reviewed May 29, 2026

View reviewed changes

juanmichelini requested a review from all-hands-bot May 29, 2026 18:26

all-hands-bot reviewed May 29, 2026

View reviewed changes

Merge branch 'main' into revert-3427-opus-4-8-extended-thinking

432e431

juanmichelini requested a review from all-hands-bot May 29, 2026 18:31

all-hands-bot reviewed May 29, 2026

View reviewed changes

juanmichelini requested a review from all-hands-bot May 29, 2026 18:37

all-hands-bot reviewed May 29, 2026

View reviewed changes

juanmichelini requested a review from all-hands-bot May 29, 2026 18:41

all-hands-bot reviewed May 29, 2026

View reviewed changes

juanmichelini requested a review from all-hands-bot May 29, 2026 18:47

all-hands-bot reviewed May 29, 2026

View reviewed changes

juanmichelini requested a review from all-hands-bot May 29, 2026 18:53

all-hands-bot reviewed May 29, 2026

View reviewed changes

juanmichelini requested a review from all-hands-bot May 29, 2026 19:00

all-hands-bot reviewed May 29, 2026

View reviewed changes

enyst approved these changes May 29, 2026

View reviewed changes

juanmichelini merged commit 629e09a into main May 29, 2026
55 of 61 checks passed

juanmichelini deleted the revert-3427-opus-4-8-extended-thinking branch May 29, 2026 21:47

Conversation

juanmichelini commented May 29, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why #3427 caused the regression

Why the stripping was never necessary

End-to-end verification (post-revert)

Companion PR

Follow-ups (out of scope here)

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review

What Changed and Why It's Correct

CI Status

Minor Observation

Coordination Note

Summary

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review

Analysis

Risk Assessment: Low (with one coordination dependency)

Coordination Dependency

Verdict

Uh oh!

juanmichelini commented May 29, 2026

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review

What the change does

One latent fragility worth tracking

PROMPT_CACHE_MODELS is unaffected

CI

Uh oh!

juanmichelini commented May 29, 2026

Uh oh!

all-hands-bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review

Change Summary

Correctness

Inline Review Thread Status

CI Status

Remaining Coordination Dependency

Risk: Low

Uh oh!

juanmichelini commented May 29, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

`PROMPT_CACHE_MODELS` is unaffected

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading

all-hands-bot commented May 29, 2026 •

edited

Loading