[PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Mon Mar 2 20:08:37 CET 2020

On 3/2/20 2:01 PM, Lukasz Majewski wrote:
> Hi Marek,

Hi,

>> On 3/1/20 6:59 PM, Lukasz Majewski wrote:
>>> Hi Marek,  
>>
>> Hi,
>>
>> [...]
>>
>>>>> The description of the fix written by the original author:
>>>>> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
>>>>>
>>>>> states that the reduction of the transfer is done just for "spin
>>>>> up"/"detection" of the pen drive. The issue is that not all pen
>>>>> drive disks are able to be properly detected in the first place.
>>>>>   
>>>>
>>>> But the commit message claims this patch reduces the transfer rate
>>>> by 20% always. So that's a NAK right away, sorry,   
>>>
>>> Marek, now we do have from time to time NOT functional USB (once per
>>> ten attempts on iMX6Q - tpc70 board). If somebody relies on 'usb
>>> start' for recovery/normal operation - it is bad.
>>>
>>> In such a use case - the 20% performance drop is acceptable.  
>>
>> I'm sure we can do better. 2020.04 is still quite far out.
>>
>> Can you take a look ?
> 
> There are several issues which are fixed by this patch. 

Then it should be multiple patches ?

> I will reply to the actual patch set to point them out in the code, as
> this may be more readable.
> 
>>
>>>> you need to find a
>>>> better solution which doesn't have such an enormous impact.
>>>>
>>>> I'm fine with the spin-up handling, but this auto-detection of
>>>> transfer size should be pulled into separate patch and is likely
>>>> wrong anyway.  
>>>
>>> I cannot agree that is it "likely wrong anyway" - please correct me
>>> if I'm wrong, but up till now we were fiddling with transfer size
>>> up till having "good enough" results.  
>>
>> Correction, read the patches I linked above
>> (7d6fd7f0ba71cd93d94079132f958d9630f27a89 
> 
> The above sha1 uses the observation (and heuristics) from other
> operating systems to deal with usb pendrives, which controllers cannot
> handle more than 256 sectors (as they have 8 bit counter).

Yes

>> and esp.
>> 02b0e1a36c5bc20174299312556ec4e266872bd6).
> 
> This patch introduces the optimization - instead of setup/disable async
> transmissions - it enables it once and then just adds new tokens to be
> transmitted.

Yes

> Is the transmission error (USB_ST_XACTERR 0x80) handled at this point?

I don't think so.

>> Those solve the "we were
>> fiddling around with transfer size" attempts, because I actually
>> invested the time to understand the issue with those devices that
>> failed and I was right about never letting any of those "fiddling
>> around with transfer size" patches into mainline.
>>
>> The problem with those devices which needed "fiddling with transfer
>> size" was that their internal 16bit counters (or 8bit counters)
>> overflew when a very long transfer was started and thus those devices
>> failed. The fix there was to transfer less than that amount of data
>> which triggered the overflow, which on a wast majority of devices is
>> 240 blocks. However , this led to a massive slowdown, which is
>> unacceptable. A fix for that slowdown was to avoid turning async
>> schedule off-on between each transfer, but rather keep it running,
>> that saves A LOT of wasted transfer time. If you build a long
>> transfer up front, you would never do that many transfers to notice
>> the delays incurred by turning the async schedule off/on, but if you
>> do multiple shorter transfers, these delays add up.
>>
> 
> Thanks for the explanation.
> 
>> Yes, these issues are convoluted and take time to solve properly, yet
>> the end result might not have such a performance hit.
>>
>>> The above link describes what needs to be done in case of the "EHCI
>>> timed out on TD - token=0x1f8c80" error.
>>>
>>> Do we handle it now? In my opinion we don't.
>>>   
>>>>
>>>> So try to separate this into two patches, one for handling the spin
>>>> up, another one for this auto-detection of the transfer size. And I
>>>> think you will end up needed only the first one.  
>>>
>>> This is an open question if only the spin up fixes the error. It
>>> would be great if you could look for the changes introduced in this
>>> patch and compare it with your experience of ehci driver
>>> development.  
>>
>> Surely you can separate those two patches and perform such a test ?
> 
> I would prefer to first go through the actual patch code, before I
> spent time on things which may be not necessary.

How do you suggest we go through a patch which does multiple things,
which have effects on one another, and analyze it properly ?

>>>>> Last but not least - the mainline is still broken. The
>>>>> "token=0x1f8c80" errors pops up from time to time. However, after
>>>>> applying the approach from this fix - the error is gone (and pass
>>>>> some acceptance tests).    
>>>>
>>>> It works for me.  
>>>
>>> Also works for Tom.
>>>   
>>>> Do you have a specific device which fails ?  
>>>
>>> Yes. On my desk - tpc70 with i.MX6Q. And hdc with i.MX53. I can
>>> confirm that from U-Boot 2018.04 up till now the rate of:
>>>
>>> "EHCI timed out on TD - token=0x1f8c80" error has been reduced
>>> significantly for their use case (reading ~16MiB of binary from
>>> several pen drives).   
>>
>> Can you run lsusb (to get IDs) on those pendrives which fail ?
>> I would like to add them to a list.
> 
> This was working correctly on the beginning, but after some time it
> started to show errors in question:
> 
> Bus 001 Device 040: ID 0930:6544 Toshiba Corp. TransMemory-Mini /
> Kingston DataTraveler 2.0 Stick

This one I believe has 16bit counter indeed.

> This one was causing issues from the very beginning:
> Bus 001 Device 041: ID 058f:6387 Alcor Micro Corp. Flash Drive

Silver INTENSO USB stick ? I have that one too (or at least one with
such ID), it was breaking DWC2 controller.

> This one - Verbatim 
> Bus 001 Device 042: ID 21c4:8005

This one is new.

> Never caused issues - worked reliably even with old U-Boot (on which
> errors were observed very often).

Maybe the problems you see have something to do with the async schedule?

[...]