fix(sdk): keep cancelled token observable after arun() interrupt by VascoSch92 · Pull Request #3395 · OpenHands/software-agent-sdk

VascoSch92 · 2026-05-27T07:26:51Z

A human has tested these changes.

Why

Addresses B3 of #3341.

arun()'s finally cleared self._cancel_token unconditionally. On interrupt(), the in-flight asyncio.Task is cancelled (breaking the await in astep), but tool calls dispatched via run_in_executor run in worker threads that can't be force-cancelled and may outlive arun(). A tool polling conversation.cancel_token — the documented cooperative-cancellation pattern — after the finally ran saw None and lost the cancellation signal, the opposite of what interrupt() intended.

What

Clear _cancel_token in arun()'s finally only when it was not cancelled. A cancelled token stays observable; the next run()/arun() creates a fresh one, so leaving it is safe.
Updated test_interrupt_sets_cancel_token (it encoded the old behaviour) and added test_cancel_token_stays_observable_after_interrupt.

The sync run() path is not affected: interrupt() falls back to pause() there and step() runs inline, so the finally only fires after tools return.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:e28addf-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-e28addf-python \
  ghcr.io/openhands/agent-server:e28addf-python

All tags pushed for this build

ghcr.io/openhands/agent-server:e28addf-golang-amd64
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-golang-amd64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-golang-amd64
ghcr.io/openhands/agent-server:e28addf-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:e28addf-golang-arm64
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-golang-arm64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-golang-arm64
ghcr.io/openhands/agent-server:e28addf-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:e28addf-java-amd64
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-java-amd64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-java-amd64
ghcr.io/openhands/agent-server:e28addf-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:e28addf-java-arm64
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-java-arm64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-java-arm64
ghcr.io/openhands/agent-server:e28addf-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:e28addf-python-amd64
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-python-amd64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-python-amd64
ghcr.io/openhands/agent-server:e28addf-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:e28addf-python-arm64
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-python-arm64
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-python-arm64
ghcr.io/openhands/agent-server:e28addf-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:e28addf-golang
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-golang
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-golang
ghcr.io/openhands/agent-server:e28addf-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:e28addf-java
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-java
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-java
ghcr.io/openhands/agent-server:e28addf-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:e28addf-python
ghcr.io/openhands/agent-server:e28addf637f7562fbec28c3d7359709db2960ad7-python
ghcr.io/openhands/agent-server:fix-cancel-token-after-interrupt-python
ghcr.io/openhands/agent-server:e28addf-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., e28addf-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., e28addf-python-amd64) are also available if needed

github-actions · 2026-05-27T07:27:21Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-27T07:27:34Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

all-hands-bot

Taste Rating: 🟢 Good taste — the fix is small and preserves the intended cooperative-cancellation contract. I didn’t find any blocking code issues in the diff; focused interrupt tests pass locally (uv run pytest tests/sdk/conversation/test_interrupt.py -q).

Because this touches agent interruption/tool execution behavior, I’m leaving a COMMENT rather than approval per repo eval-risk guidance; a human maintainer should decide whether lightweight eval coverage is needed.

Automated review generated by OpenHands on behalf of the requester.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM — small localized cancellation lifecycle change, but it affects agent run/interruption semantics and tool execution observability.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26497199123

github-actions · 2026-05-27T07:29:59Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/conversation/impl
local_conversation.py	556	44	92%	309, 314, 458, 504, 573, 589, 665, 955–956, 959, 1072, 1083–1086, 1093–1094, 1097, 1103–1104, 1107, 1113, 1128, 1131, 1135–1136, 1140–1142, 1149, 1239, 1244, 1354, 1356, 1360–1361, 1372–1373, 1398, 1593, 1597, 1667, 1674–1675
TOTAL	28280	8214	70%

arun()'s finally cleared self._cancel_token unconditionally. Tool calls dispatched via run_in_executor run in worker threads that can't be force-cancelled and may outlive arun(), so a tool polling conversation.cancel_token after an interrupt saw None and lost the cancellation signal. Retain a cancelled token; the next run() replaces it.

all-hands-bot

✅ QA Report: PASS

Verified the SDK async interrupt flow with a real Conversation.arun() call: the PR keeps the cancelled token observable after interrupt and replaces it with a fresh token on the next run.

Does this PR achieve its stated goal?

Yes. The PR's goal was to prevent arun() cleanup from clearing a cancelled token before late-running worker/thread code can observe cooperative cancellation. On origin/main, the same SDK script showed after_interrupt_token_present=False and late_worker_token_present=False; on the PR commit, both became True with is_cancelled=True, and the next run received a fresh uncancelled token.

Phase	Result
Environment Setup	✅ `make build` completed successfully; dependencies installed and build marked ready.
CI Status	🟡 `gh pr checks` reported 0 failing, 19 successful, 7 pending, 6 skipped at review time.
Functional Verification	✅ Reproduced the base bug and verified the PR fixes token observability plus fresh-token behavior.

Functional Verification

Test 1: Interrupted async run keeps cancellation observable for late worker-style polling

Step 1 — Reproduce / establish baseline (without the fix):
Checked out base origin/main (58771eeb) and ran a standalone SDK script that creates an Agent with a hanging async LLM, starts Conversation.arun(), calls conversation.interrupt(), then has a separate worker-style thread poll conversation.cancel_token after arun() has unwound.

Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_cancel_token.py 2>&1 | grep -E '^(during|after|late|next)':

58771eeb (HEAD, origin/main) fix(sdk): dedupe LLMs by usage_id in _ensure_agent_ready registration loop (#3392)
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=False
after_interrupt_token_cancelled=None
late_worker_token_present=False
late_worker_token_cancelled=None
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False

This confirms the bug exists on base: while the token exists during the run, after interrupt() and arun() cleanup a late poll sees None, so cooperative cancellation is no longer observable.

Step 2 — Apply the PR's changes:
Checked out fix/cancel-token-after-interrupt at d495eeeaab6631dbe182af8e8fbfa1e96f7965f7.

Step 3 — Re-run with the fix in place:
Ran the same command on the PR branch:

d495eeea (HEAD -> fix/cancel-token-after-interrupt, origin/fix/cancel-token-after-interrupt) fix(sdk): keep cancelled token observable after arun() interrupt
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=True
after_interrupt_token_cancelled=True
late_worker_token_present=True
late_worker_token_cancelled=True
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False

This confirms the fix works: after interrupt, both the direct public conversation.cancel_token poll and the late worker-style poll observe the retained cancelled token. The follow-up run also creates a fresh uncancelled token, matching the PR's safety claim.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

malhotra5 · 2026-05-27T14:57:54Z

+            # A cancelled token must stay observable: interrupted tool calls run
+            # in worker threads that can outlive arun() and still poll it. A
+            # fresh token is created on the next run().
+            if self._cancel_token is not None and not self._cancel_token.is_cancelled:
+                self._cancel_token = None


I think we should pass a cancel token to each tool call. If arun can be called without a tool call finishing, interrupts should stop everything that's currently in flight

wdyt?

The executor already receives the token and skips not-yet-started calls ( see parallel_executor.py line 222).

The gap you're pointing at is threading it into the tool body so a long-running tool can check mid-execution. Happy to do that, but it touches every ToolDefinition and seems like a separate PR.

The bug here (worker threads polling conversation.cancel_token after finally zeroed it) exists regardless of which pattern tools we use, so I'd land this minimal fix and follow up with explicit threading if you want it.

all-hands-bot reviewed May 27, 2026

View reviewed changes

VascoSch92 requested a review from malhotra5 May 27, 2026 07:30

VascoSch92 mentioned this pull request May 27, 2026

[Tracker] Async LLM support (#3284): correctness bugs + maintainability issues #3341

Open

10 tasks

VascoSch92 force-pushed the fix/cancel-token-after-interrupt branch from d495eee to 0998a62 Compare May 27, 2026 07:31

all-hands-bot reviewed May 27, 2026

View reviewed changes

malhotra5 reviewed May 27, 2026

View reviewed changes

Merge branch 'main' into fix/cancel-token-after-interrupt

e28addf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): keep cancelled token observable after arun() interrupt#3395

fix(sdk): keep cancelled token observable after arun() interrupt#3395
VascoSch92 wants to merge 2 commits into
mainfrom
fix/cancel-token-after-interrupt

VascoSch92 commented May 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

malhotra5 May 27, 2026 •

edited

Loading

Uh oh!

VascoSch92 May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

VascoSch92 commented May 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: Interrupted async run keeps cancellation observable for late worker-style polling

Issues Found

Uh oh!

malhotra5 May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VascoSch92 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

VascoSch92 commented May 27, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

github-actions Bot commented May 27, 2026 •

edited

Loading

malhotra5 May 27, 2026 •

edited

Loading