[PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Tue Mar 3 13:29:11 CET 2020

On 3/3/20 12:25 AM, Lukasz Majewski wrote:
> Hi Marek,

Hi,

[...]

>>>> Every read operation starts at the maximum block size. When the USB
>>>> pendrive is not able to correctly serve this data read request, the
>>>> dynamic reduction of IO size is performed. Up to six tries (with
>>>> smaller IO block each time) are attempted.
>>>>
>>>> A related problem is that some drives are slow to come up. Linux
>>>> handles this by issuing a spinup command and allowing more time for
>>>> the drive to respond. The same idea is applied in this fix.
>>>>
>>>> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
>>>>
>>>> Bus usb at 2184200: USB EHCI 1.00
>>>> scanning bus usb at 2184200 for devices... 2 USB Device(s) found
>>>>        scanning usb for storage devices... 1 Storage Device(s)
>>>> found EHCI timed out on TD - token=0x1f8c80  
>>>
>>> This is how the error gets evident. The detailed explanation is in
>>> link [1].  
>>
>> Is there one specific post in that forum, or do I need to read through
>> the whole multi-page thread ?
> 
> There is a thread pointed out by the below link which explains the
> issue thoroughly:
> https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> 
> (The first 2 posts from 'rayvt' explains the problem).

So basically a drive spin-up problem and the problem with stick counter
overflow.

btw QH/qTD are not microcode instructions, they are just plain DMA
descriptors. And the USB controller in Marvell Kirkwood/Orion SoCs is
on-SoC, it's not a discrete chip.

>> I would expect that if you run -- I assume 'usb reset' (the command is
>> missing above) -- then the bus gets power-cycled, hence the USB device
>> also gets power-cycled.
> 
> Unfortunately, the 'usb reset' is not fixing the issue. The controller
> is unresponsive and the only way for recovering is a power cycle.

My guess would be it might be related to the async schedule, so that's
where I would start debugging.

[...]

>>>> +	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
>>>>  	if (!(QT_TOKEN_GET_STATUS(qhtoken) &
>>>> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
>>>> +		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {  
>>> 			      ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
>>> 			      before this patch was also not
>>> checked. 
>>>> +			if (--trynum >= 0) {
>>>> +				/*
>>>> +				 * It is necessary to do this,
>>>> otherwise the
>>>> +				 * disk is clagged.
>>>> +				 */
>>>> +				debug("reset the TD and redo,
>>>> because of XACTERR\n");
>>>> +				qhtoken &=
>>>> ~QT_TOKEN_STATUS_HALTED;
>>>> +				qhtoken |= QT_TOKEN_STATUS_ACTIVE
>>>> |
>>>> +					QT_TOKEN_CERR(2);
>>>> +				vtd->qt_token =
>>>> cpu_to_hc32(qhtoken);
>>>> +				qh->qh_overlay.qt_token =
>>>> cpu_to_hc32(qhtoken);
>>>> +				goto retry_xacterr;
>>>> +			}
>>>> +			dev->status = USB_ST_XACTERR;
>>>> +			dev->act_len = length -
>>>> +				QT_TOKEN_GET_TOTALBYTES(qhtoken);
>>>>  
>>>
>>> This may solve the issue - as it resets the controller in the case
>>> of transfer error.  
>>
>> Do you need to reset the controller ? Or is there some graceful way
>> out, i.e. a way to recover from the error ?
> 
> The original patch was developed when the async transmission was start
> and stopped for each transfer (before your optimization), hence the code
> for disabling it.
> 
> I need to check if it would be enough to just clear the
> QT_TOKEN_STATUS_HALTED and set QT_TOKEN_STATUS_ACTIVE again.

OK