drivers/nutdrv_qx.c, drivers/blazer_usb.c: cypress_command()/phoenix_command(): clear endpoint halt and retry on LIBUSB_ERROR_OVERFLOW [#598]#3448
Draft
Pyker wants to merge 1 commit into
Conversation
…command(): clear endpoint halt and retry on LIBUSB_ERROR_OVERFLOW [networkupstools#598] Some firmware in the 0665:5161 Cypress USB-Serial bridge family (Salicru SPS ONE, Ippon, Belkin F6C1200-UNV, ViewPower, various Voltronic Power UPSes) occasionally emits an interrupt-IN frame larger than the declared wMaxPacketSize=8, mid-reply. The kernel xHCI driver flags this as LIBUSB_ERROR_OVERFLOW and discards the data; from that point on, every subsequent read on the endpoint also returns OVERFLOW until a USB-level reset is performed, which results in "Data stale" from upsd until the driver is restarted. Issue an immediate CLEAR_FEATURE(HALT) on EP 0x81 and retry the read once. CLEAR_FEATURE(HALT) unsticks the kernel-side endpoint state, and in the observed cases the firmware's TX FIFO state as well, so the next read returns a clean frame. If the retry also overflows, the existing error path is taken unchanged. Captured at byte level on a Salicru SPS 1500 ONE BL (Voltronic-QS H-protocol over 0665:5161, host xHCI on Linux 6.14): firmware emits chunks 3..6 of the QS reply collapsed into a single frame that exceeds wMaxPacketSize, kernel reports LIBUSB_ERROR_OVERFLOW, and all subsequent reads on EP 0x81 return OVERFLOW until a port reset. The same fix is applied to cypress_command() and phoenix_command() in both drivers because the read loops are byte-identical and the bug is at the USB transport layer, below either subdriver's protocol logic. Signed-off-by: Pedro Cunha <pedroagracio+nut@gmail.com>
|
A ZIP file with standard source tarball and another tarball with pre-built docs for commit 9eefcfd is temporarily available: NUT-tarballs-PR-3448.zip. |
|
✅ Build nut 2.8.5.4740-master completed (commit ca38a9e7d6 by @Pyker)
|
|
✅ Build nut 2.8.5.4740-master completed (commit ca38a9e7d6 by @Pyker) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
usb_clear_halt()+ retry onLIBUSB_ERROR_OVERFLOWfor the EP 0x81 interrupt read in bothcypress_command()andphoenix_command()innutdrv_qx.candblazer_usb.c, addressing the long-running #598 hang on the 0665:5161 Cypress USB-Serial bridge family (Salicru SPS ONE, Ippon, Belkin F6C1200-UNV, ViewPower, various Voltronic Power UPSes).The byte-level root cause analysis, captured via
usbmon+tsharkon a Salicru SPS 1500 ONE BL, is in this comment on #598. Short version: firmware in this bridge family occasionally emits an interrupt-IN frame larger than the declaredwMaxPacketSize=8, the kernel xHCI driver returnsLIBUSB_ERROR_OVERFLOWand discards the data, and every subsequent read on the endpoint also returnsOVERFLOWuntil a USB-level reset. Locally observed rate on a steady-state Salicru: 1-3 hangs/day withpollinterval=2; once stuck, only a port reset (orusb_resetter --reset-device) recovers it.The patch issues a single
usb_clear_halt(udev, 0x81)onOVERFLOWand re-reads the same 8-byte slice once.CLEAR_FEATURE(HALT)reliably unsticks both the kernel-side endpoint state and (empirically) the firmware's TX FIFO, so the next read returns the next clean frame and the protocol layer resumes without entering the existingMAXTRIESfailure cluster. If the retry also overflows, the existing error path is taken unchanged.The same fix is applied to
cypress_command()andphoenix_command()in both drivers because the read loops are byte-identical and the bug is at the USB transport layer, below either subdriver's protocol logic.Why draft
Marking as draft until I have a few days of post-patch uptime data from the affected hardware (currently mitigated by a local watchdog that triggers
usb_resetter --reset-deviceon stale status). I will convert to ready-for-review once I have soak data showing theOVERFLOWretries are being hit and absorbed without escalating into the existingMAXTRIESfailure cluster.Test plan
nutdrv_qxandblazer_usbwith the patch (verified locally:make -C drivers nutdrv_qx blazer_usbsucceeds on Linux x86_64, libusb-1.0)nut-driver@salicrufor at least 48h withdebug_min = 2and confirm:LIBUSB_ERROR_OVERFLOW on EP 0x81, clearing halt and retryingdebug log line appearsMAXTRIESclustersnut-salicru-watchdog.timerdoes not fire over the soak windowRelated issues