[PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd
marex at denx.de
Mon Mar 2 20:08:37 CET 2020
On 3/2/20 2:01 PM, Lukasz Majewski wrote:
> Hi Marek,
>> On 3/1/20 6:59 PM, Lukasz Majewski wrote:
>>> Hi Marek,
>>>>> The description of the fix written by the original author:
>>>>> states that the reduction of the transfer is done just for "spin
>>>>> up"/"detection" of the pen drive. The issue is that not all pen
>>>>> drive disks are able to be properly detected in the first place.
>>>> But the commit message claims this patch reduces the transfer rate
>>>> by 20% always. So that's a NAK right away, sorry,
>>> Marek, now we do have from time to time NOT functional USB (once per
>>> ten attempts on iMX6Q - tpc70 board). If somebody relies on 'usb
>>> start' for recovery/normal operation - it is bad.
>>> In such a use case - the 20% performance drop is acceptable.
>> I'm sure we can do better. 2020.04 is still quite far out.
>> Can you take a look ?
> There are several issues which are fixed by this patch.
Then it should be multiple patches ?
> I will reply to the actual patch set to point them out in the code, as
> this may be more readable.
>>>> you need to find a
>>>> better solution which doesn't have such an enormous impact.
>>>> I'm fine with the spin-up handling, but this auto-detection of
>>>> transfer size should be pulled into separate patch and is likely
>>>> wrong anyway.
>>> I cannot agree that is it "likely wrong anyway" - please correct me
>>> if I'm wrong, but up till now we were fiddling with transfer size
>>> up till having "good enough" results.
>> Correction, read the patches I linked above
> The above sha1 uses the observation (and heuristics) from other
> operating systems to deal with usb pendrives, which controllers cannot
> handle more than 256 sectors (as they have 8 bit counter).
>> and esp.
> This patch introduces the optimization - instead of setup/disable async
> transmissions - it enables it once and then just adds new tokens to be
> Is the transmission error (USB_ST_XACTERR 0x80) handled at this point?
I don't think so.
>> Those solve the "we were
>> fiddling around with transfer size" attempts, because I actually
>> invested the time to understand the issue with those devices that
>> failed and I was right about never letting any of those "fiddling
>> around with transfer size" patches into mainline.
>> The problem with those devices which needed "fiddling with transfer
>> size" was that their internal 16bit counters (or 8bit counters)
>> overflew when a very long transfer was started and thus those devices
>> failed. The fix there was to transfer less than that amount of data
>> which triggered the overflow, which on a wast majority of devices is
>> 240 blocks. However , this led to a massive slowdown, which is
>> unacceptable. A fix for that slowdown was to avoid turning async
>> schedule off-on between each transfer, but rather keep it running,
>> that saves A LOT of wasted transfer time. If you build a long
>> transfer up front, you would never do that many transfers to notice
>> the delays incurred by turning the async schedule off/on, but if you
>> do multiple shorter transfers, these delays add up.
> Thanks for the explanation.
>> Yes, these issues are convoluted and take time to solve properly, yet
>> the end result might not have such a performance hit.
>>> The above link describes what needs to be done in case of the "EHCI
>>> timed out on TD - token=0x1f8c80" error.
>>> Do we handle it now? In my opinion we don't.
>>>> So try to separate this into two patches, one for handling the spin
>>>> up, another one for this auto-detection of the transfer size. And I
>>>> think you will end up needed only the first one.
>>> This is an open question if only the spin up fixes the error. It
>>> would be great if you could look for the changes introduced in this
>>> patch and compare it with your experience of ehci driver
>> Surely you can separate those two patches and perform such a test ?
> I would prefer to first go through the actual patch code, before I
> spent time on things which may be not necessary.
How do you suggest we go through a patch which does multiple things,
which have effects on one another, and analyze it properly ?
>>>>> Last but not least - the mainline is still broken. The
>>>>> "token=0x1f8c80" errors pops up from time to time. However, after
>>>>> applying the approach from this fix - the error is gone (and pass
>>>>> some acceptance tests).
>>>> It works for me.
>>> Also works for Tom.
>>>> Do you have a specific device which fails ?
>>> Yes. On my desk - tpc70 with i.MX6Q. And hdc with i.MX53. I can
>>> confirm that from U-Boot 2018.04 up till now the rate of:
>>> "EHCI timed out on TD - token=0x1f8c80" error has been reduced
>>> significantly for their use case (reading ~16MiB of binary from
>>> several pen drives).
>> Can you run lsusb (to get IDs) on those pendrives which fail ?
>> I would like to add them to a list.
> This was working correctly on the beginning, but after some time it
> started to show errors in question:
> Bus 001 Device 040: ID 0930:6544 Toshiba Corp. TransMemory-Mini /
> Kingston DataTraveler 2.0 Stick
This one I believe has 16bit counter indeed.
> This one was causing issues from the very beginning:
> Bus 001 Device 041: ID 058f:6387 Alcor Micro Corp. Flash Drive
Silver INTENSO USB stick ? I have that one too (or at least one with
such ID), it was breaking DWC2 controller.
> This one - Verbatim
> Bus 001 Device 042: ID 21c4:8005
This one is new.
> Never caused issues - worked reliably even with old U-Boot (on which
> errors were observed very often).
Maybe the problems you see have something to do with the async schedule?
More information about the U-Boot