fix(sdk): keep cancelled token observable after arun() interrupt#3395
fix(sdk): keep cancelled token observable after arun() interrupt#3395VascoSch92 wants to merge 2 commits into
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste — the fix is small and preserves the intended cooperative-cancellation contract. I didn’t find any blocking code issues in the diff; focused interrupt tests pass locally (uv run pytest tests/sdk/conversation/test_interrupt.py -q).
Because this touches agent interruption/tool execution behavior, I’m leaving a COMMENT rather than approval per repo eval-risk guidance; a human maintainer should decide whether lightweight eval coverage is needed.
Automated review generated by OpenHands on behalf of the requester.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM — small localized cancellation lifecycle change, but it affects agent run/interruption semantics and tool execution observability.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26497199123
Coverage Report •
|
||||||||||||||||||||
arun()'s finally cleared self._cancel_token unconditionally. Tool calls dispatched via run_in_executor run in worker threads that can't be force-cancelled and may outlive arun(), so a tool polling conversation.cancel_token after an interrupt saw None and lost the cancellation signal. Retain a cancelled token; the next run() replaces it.
d495eee to
0998a62
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Verified the SDK async interrupt flow with a real Conversation.arun() call: the PR keeps the cancelled token observable after interrupt and replaces it with a fresh token on the next run.
Does this PR achieve its stated goal?
Yes. The PR's goal was to prevent arun() cleanup from clearing a cancelled token before late-running worker/thread code can observe cooperative cancellation. On origin/main, the same SDK script showed after_interrupt_token_present=False and late_worker_token_present=False; on the PR commit, both became True with is_cancelled=True, and the next run received a fresh uncancelled token.
| Phase | Result |
|---|---|
| Environment Setup | ✅ make build completed successfully; dependencies installed and build marked ready. |
| CI Status | 🟡 gh pr checks reported 0 failing, 19 successful, 7 pending, 6 skipped at review time. |
| Functional Verification | ✅ Reproduced the base bug and verified the PR fixes token observability plus fresh-token behavior. |
Functional Verification
Test 1: Interrupted async run keeps cancellation observable for late worker-style polling
Step 1 — Reproduce / establish baseline (without the fix):
Checked out base origin/main (58771eeb) and ran a standalone SDK script that creates an Agent with a hanging async LLM, starts Conversation.arun(), calls conversation.interrupt(), then has a separate worker-style thread poll conversation.cancel_token after arun() has unwound.
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_cancel_token.py 2>&1 | grep -E '^(during|after|late|next)':
58771eeb (HEAD, origin/main) fix(sdk): dedupe LLMs by usage_id in _ensure_agent_ready registration loop (#3392)
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=False
after_interrupt_token_cancelled=None
late_worker_token_present=False
late_worker_token_cancelled=None
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False
This confirms the bug exists on base: while the token exists during the run, after interrupt() and arun() cleanup a late poll sees None, so cooperative cancellation is no longer observable.
Step 2 — Apply the PR's changes:
Checked out fix/cancel-token-after-interrupt at d495eeeaab6631dbe182af8e8fbfa1e96f7965f7.
Step 3 — Re-run with the fix in place:
Ran the same command on the PR branch:
d495eeea (HEAD -> fix/cancel-token-after-interrupt, origin/fix/cancel-token-after-interrupt) fix(sdk): keep cancelled token observable after arun() interrupt
during_run_token_present=True
during_run_token_cancelled=False
after_interrupt_token_present=True
after_interrupt_token_cancelled=True
late_worker_token_present=True
late_worker_token_cancelled=True
next_run_token_present=True
next_run_token_cancelled=False
next_run_token_is_same_as_retained=False
This confirms the fix works: after interrupt, both the direct public conversation.cancel_token poll and the late worker-style poll observe the retained cancelled token. The follow-up run also creates a fresh uncancelled token, matching the PR's safety claim.
Issues Found
None.
This review was created by an AI agent (OpenHands) on behalf of the user.
| # A cancelled token must stay observable: interrupted tool calls run | ||
| # in worker threads that can outlive arun() and still poll it. A | ||
| # fresh token is created on the next run(). | ||
| if self._cancel_token is not None and not self._cancel_token.is_cancelled: | ||
| self._cancel_token = None |
There was a problem hiding this comment.
I think we should pass a cancel token to each tool call. If arun can be called without a tool call finishing, interrupts should stop everything that's currently in flight
wdyt?
There was a problem hiding this comment.
The executor already receives the token and skips not-yet-started calls ( see parallel_executor.py line 222).
The gap you're pointing at is threading it into the tool body so a long-running tool can check mid-execution. Happy to do that, but it touches every ToolDefinition and seems like a separate PR.
The bug here (worker threads polling conversation.cancel_token after finally zeroed it) exists regardless of which pattern tools we use, so I'd land this minimal fix and follow up with explicit threading if you want it.
Why
Addresses B3 of #3341.
arun()'sfinallyclearedself._cancel_tokenunconditionally. Oninterrupt(), the in-flightasyncio.Taskis cancelled (breaking theawaitinastep), but tool calls dispatched viarun_in_executorrun in worker threads that can't be force-cancelled and may outlivearun(). A tool pollingconversation.cancel_token— the documented cooperative-cancellation pattern — after thefinallyran sawNoneand lost the cancellation signal, the opposite of whatinterrupt()intended.What
_cancel_tokeninarun()'sfinallyonly when it was not cancelled. A cancelled token stays observable; the nextrun()/arun()creates a fresh one, so leaving it is safe.test_interrupt_sets_cancel_token(it encoded the old behaviour) and addedtest_cancel_token_stays_observable_after_interrupt.The sync
run()path is not affected:interrupt()falls back topause()there andstep()runs inline, so thefinallyonly fires after tools return.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:e28addf-pythonRun
All tags pushed for this build
About Multi-Architecture Support
e28addf-python) is a multi-arch manifest supporting both amd64 and arm64e28addf-python-amd64) are also available if needed