[PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Sun Mar 1 19:39:21 CET 2020

On 3/1/20 6:59 PM, Lukasz Majewski wrote:
> Hi Marek,

Hi,

[...]

>>> The description of the fix written by the original author:
>>> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
>>>
>>> states that the reduction of the transfer is done just for "spin
>>> up"/"detection" of the pen drive. The issue is that not all pen
>>> drive disks are able to be properly detected in the first place.  
>>
>> But the commit message claims this patch reduces the transfer rate by
>> 20% always. So that's a NAK right away, sorry, 
> 
> Marek, now we do have from time to time NOT functional USB (once per
> ten attempts on iMX6Q - tpc70 board). If somebody relies on 'usb start'
> for recovery/normal operation - it is bad.
> 
> In such a use case - the 20% performance drop is acceptable.

I'm sure we can do better. 2020.04 is still quite far out.

Can you take a look ?

>> you need to find a
>> better solution which doesn't have such an enormous impact.
>>
>> I'm fine with the spin-up handling, but this auto-detection of
>> transfer size should be pulled into separate patch and is likely
>> wrong anyway.
> 
> I cannot agree that is it "likely wrong anyway" - please correct me if
> I'm wrong, but up till now we were fiddling with transfer size up till
> having "good enough" results.

Correction, read the patches I linked above
(7d6fd7f0ba71cd93d94079132f958d9630f27a89 and esp.
02b0e1a36c5bc20174299312556ec4e266872bd6). Those solve the "we were
fiddling around with transfer size" attempts, because I actually
invested the time to understand the issue with those devices that failed
and I was right about never letting any of those "fiddling around with
transfer size" patches into mainline.

The problem with those devices which needed "fiddling with transfer
size" was that their internal 16bit counters (or 8bit counters) overflew
when a very long transfer was started and thus those devices failed. The
fix there was to transfer less than that amount of data which triggered
the overflow, which on a wast majority of devices is 240 blocks. However
, this led to a massive slowdown, which is unacceptable. A fix for that
slowdown was to avoid turning async schedule off-on between each
transfer, but rather keep it running, that saves A LOT of wasted
transfer time. If you build a long transfer up front, you would never do
that many transfers to notice the delays incurred by turning the async
schedule off/on, but if you do multiple shorter transfers, these delays
add up.

Yes, these issues are convoluted and take time to solve properly, yet
the end result might not have such a performance hit.

> The above link describes what needs to be done in case of the "EHCI
> timed out on TD - token=0x1f8c80" error.
> 
> Do we handle it now? In my opinion we don't.
> 
>>
>> So try to separate this into two patches, one for handling the spin
>> up, another one for this auto-detection of the transfer size. And I
>> think you will end up needed only the first one.
> 
> This is an open question if only the spin up fixes the error. It would
> be great if you could look for the changes introduced in this patch and
> compare it with your experience of ehci driver development.

Surely you can separate those two patches and perform such a test ?

>>> Last but not least - the mainline is still broken. The
>>> "token=0x1f8c80" errors pops up from time to time. However, after
>>> applying the approach from this fix - the error is gone (and pass
>>> some acceptance tests).  
>>
>> It works for me.
> 
> Also works for Tom.
> 
>> Do you have a specific device which fails ?
> 
> Yes. On my desk - tpc70 with i.MX6Q. And hdc with i.MX53. I can confirm
> that from U-Boot 2018.04 up till now the rate of:
> 
> "EHCI timed out on TD - token=0x1f8c80" error has been reduced
> significantly for their use case (reading ~16MiB of binary from several
> pen drives). 

Can you run lsusb (to get IDs) on those pendrives which fail ?
I would like to add them to a list.

> Unfortunately the errors are still present from time to time.
> 
>> Which
>> device and on which hardware ?
>>
>> Also, what is "some acceptance test" ?
>>
> 
> The set of automated tests to assess if "usb start" don't break.

The usb start only reads out the partition table and does not do any
long transfers (above 240 blocks, which usually triggers the failure on
cheap sticks). So if long transfers fail for you, then this is a
separate issue from the 'usb start' spin-up problem. And so, if this
patch makes both work, then this indeed should be two patches.