fix: make V1Channel re-subscribable after a failed subscribe#845
Open
pike00 wants to merge 1 commit into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes a regression where transient failures during device/channel setup could leave subscriptions partially active, preventing subsequent reconnects (e.g., “Only one subscription allowed at a time”).
Changes:
- Make
V1Channel.subscribe()teardown atomic and reusable via a shared_teardown()path. - Ensure
RoborockDevice.connect()always unsubscribes ifstart()fails (including non-RoborockExceptionerrors). - Add regression tests covering resubscribe/retry and atomic failure cleanup.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
roborock/devices/rpc/v1_channel.py |
Makes subscription setup atomic and introduces _teardown() to reliably reset internal state on error/unsubscribe. |
roborock/devices/device.py |
Broadens cleanup on start() failure to always unsubscribe before re-raising. |
tests/devices/test_v1_device.py |
Adds regression tests ensuring connect() unsubscribes on any start() failure and can retry cleanly. |
tests/devices/rpc/test_v1_channel.py |
Adds tests for subscribe→unsub→subscribe and for atomic cleanup after subscribe failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+205
to
210
| except BaseException: | ||
| # Any failure in start() must unsubscribe before propagating, so a | ||
| # retry by connect_loop() gets a clean channel. Broader than | ||
| # RoborockException so non-Roborock errors also release the channel. | ||
| unsub() | ||
| raise |
Comment on lines
+328
to
+332
| except BaseException: | ||
| # Any failure during setup must leave the channel re-subscribable: | ||
| # cancel the reconnect task, drop subscriptions, and clear _callback. | ||
| self._teardown() | ||
| raise |
Comment on lines
+344
to
+357
| if self._reconnect_task: | ||
| self._reconnect_task.cancel() | ||
| self._reconnect_task = None | ||
| if self._mqtt_unsub: | ||
| self._mqtt_unsub() | ||
| self._mqtt_unsub = None | ||
| if self._local_unsub: | ||
| self._local_unsub() | ||
| self._local_unsub = None | ||
| if self._local_channel: | ||
| self._local_channel.close() | ||
| self._local_channel = None | ||
| self._callback = None | ||
| self._logger.debug("Unsubscribed from device") |
|
|
||
| v1_channel = V1Channel( | ||
| device_uid=duid, | ||
| security_data=SecurityData(endpoint="test_endpoint", nonce=b"test_nonce_16byte"), |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A device whose first connect attempt hits a transient error is bricked for the rest of the session: it never gets entities and the log shows the misleading
ValueError: Only one subscription allowed at a time. This was observed with two V1 vacuums on one account — the second vacuum (sharing the MQTT session) reliably failed while the first connected fine.Root cause
The connect path is
RoborockDevice.start_connect()→connect_loop()→connect()→V1Channel.subscribe().V1Channel.subscribe()setself._callback = callbacklast, and the returnedunsub()closure tore down the reconnect task and MQTT/local subscriptions but never clearedself._callback.connect(), ifv1_properties.start()raised a transientRoborockException(e.g. the initial status fetch timing out for a second device contending for the MQTT session), theexcept RoborockExceptionblock calledunsub()and re-raised — leavingself._callbackstill set.connect_loop()caught theRoborockExceptionand retriedconnect(). The retry'ssubscribe()sawself._callback is not Noneand raisedValueError("Only one subscription allowed at a time"). ThatValueErroris not aRoborockException, so it escaped toconnect_loop()'s outerexcept Exception, which logsUncaught error during connectand returns — permanently giving up. No ready callbacks fire, so no entities are created.A secondary leak: if
subscribe()raised partway through (after starting_reconnect_task/ setting_mqtt_unsub), it returned nounsub, so the reconnect task and subscriptions leaked on every retry.Fix
V1Channel.subscribe()now claimsself._callbackup front and wraps the local-connect / reconnect-task / MQTT-subscribe setup intry/except BaseException: self._teardown(); raise._teardown()(also returned as the unsub function) that cancels the reconnect task, drops the MQTT/local subscriptions, closes the local channel, and clearsself._callback— leaving the channel cleanly re-subscribable.RoborockDevice.connect()'s cleanup is broadened fromexcept RoborockExceptiontoexcept BaseExceptionso anystart()failure (not just Roborock errors) releases the channel before propagating.Net effect: a retry after a transient first-attempt failure now re-subscribes cleanly instead of tripping the stale-callback guard. A genuinely persistent failure becomes a normal backoff-retry loop surfacing the real error, rather than a hard brick behind a misleading
ValueError.Tests
Added regression tests (all fail against the pre-fix code):
tests/devices/rpc/test_v1_channel.pysubscribe → unsub → subscribeno longer raises._reconnect_task is Noneand the channel re-callable (no leaked task/subscription).tests/devices/test_v1_device.pyconnect()unsubscribes whenstart()fails (parametrized:RoborockExceptionand a non-Roborock error).start()raises a transientRoborockExceptiononce then succeeds →connect_loop()reconnects and the device becomes ready, with noValueError.uv run pytest— full suite passes;ruff checkandmypyclean.