[PATCH v2 0/8] USB fixes: xHCI error handling
Hector Martin
marcan at marcan.st
Sun Oct 29 07:37:37 CET 2023
This series is the first of a few bundles of USB fixes we have been
carrying downstream on the Asahi U-Boot branch for a few months.
Patches #1-#6 fix a series of related robustness issues. Certain
conditions related to endpoint stalls revealed a chain of bugs
throughout the stack that caused U-Boot to completely fail when
encountering some errors (and errors are a fact of life with USB).
Example scenario:
- The device stalls a bulk endpoint due to an error
- The upper layer driver tries to use the endpoint again
- There is no endpoint stall clear wired up in the API, so for starters
this is doomed to fail (fix: #4)
- xHCI knows the endpoint is halted, but tries to queue the TRB anyway,
which can't work (fix: #5)
- Since the endpoint is halted nothing happens, so the transfer times
out.
- The timeout handling tries to abort the transfer
- abort_td() tries to stop the endpoint, but since it is already halted,
this results in a context state error. As the transfer never started,
there is no completion event, so xhci_wait_for_event() is waiting for
the wrong event type, and logs an error and returns NULL. (fix: #2)
- abort_td() crashes due to failing to guard against the NULL event
(fix: #1)
- Even after fixing all that, abort_td() would not handle the context
state error properly and BUG() (fix: #3). This also fixes a race
condition documented in the xHCI spec that could occur even in the
absence of all the other bugs.
Patch #6 addresses a related robustness issue where
xhci_wait_for_event() panics on event timeouts other than for transfers.
While this is clearly an unexpected condition and indicates a bug
elsewhere, it's no reason to outright crash. USB is notoriously
impossible to get 100% right, and we can't afford to be breaking users'
systems at any sign of trouble. Error conditions should always be dealt
with as gracefully as possible (even if that results in a completely
broken USB controller, that is much preferable to aborting the boot
process entirely, especially on devices with non-USB storage and
keyboards where USB support is effectively optional for most users).
Since after patch #1 we now guard all callers to xhci_wait_for_event()
with at least trivial NULL checks, it's okay to return and continue in
case of timeouts.
Patch #7 addresses an unrelated bug I ran into while debugging all this,
and patch #8 adds extra debug logs to make finding future issues less
painful.
I believe this should fix this Fedora bug too, which is either an
instance of the above sequence of events, or (more likely, given the
difficulty reproducing) the race condition documented in xHCI 4.6.9:
https://bugzilla.redhat.com/show_bug.cgi?id=2244305
Signed-off-by: Hector Martin <marcan at marcan.st>
---
Changes in v2:
- Squashed in a trivial fix for patch #1
- Removed spurious blank line
- Added a longer description to the cover letter
- Link to v1: https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd928@marcan.st
---
Hector Martin (8):
usb: xhci: Guard all calls to xhci_wait_for_event
usb: xhci: Better error handling in abort_td()
usb: xhci: Allow context state errors when halting an endpoint
usb: xhci: Recover from halted bulk endpoints
usb: xhci: Fail on attempt to queue TRBs to a halted endpoint
usb: xhci: Do not panic on event timeouts
usb: xhci: Fix DMA address calculation in queue_trb
usb: xhci: Add more debugging
drivers/usb/host/xhci-ring.c | 99 ++++++++++++++++++++++++++++++++++++--------
drivers/usb/host/xhci.c | 9 ++++
include/usb/xhci.h | 2 +
3 files changed, 92 insertions(+), 18 deletions(-)
---
base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529
change-id: 20231027-usb-fixes-1-83bfc7013012
Best regards,
--
Hector Martin <marcan at marcan.st>
More information about the U-Boot
mailing list