[PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Mon Mar 2 20:19:22 CET 2020

On 3/2/20 2:21 PM, Lukasz Majewski wrote:
> Hi,

Hi,

>> This patch aims to improve robustness of 'usb' command operation on
>> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
>> described and provided in [1] (originally applicable to U-Boot
>> 2016.05).
>>
>> According to the fix author - "rayvt" (from [1]):
>>
>> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
>> micro-code instruction for the EHCI USB chip.
>> The "token" field is detailed information, control, and status of the
>> TD's data transfer. In particular, the rightmost byte is the status
>> field. 0x80 bit means the TD is active and the data transfer has not
>> yet completed. 0x08 bit means there was some sort of data transfer
>> error (XactErr).
> 	 ^^^^^^^ - this error is NOT handled now in mainline u-boot. Do
> 	 you have any experience with it?

No, but then I also didn't run into those yet.

>> If the drive gets a transfer error, it refuses to do any other I/O
>> transfer until the error is properly cleared and reset. U-boot did
>> not do this, so every subsequent disk read would get a timeout error
>> because the drive would not send any data. The fix is to properly
>> clear and reset the USB disk when it gets a transfer error.

So basically, what this says is, "we crashed the drive, because the
drive suffered counter overflow, thus we fix it by resetting the drive".
No, the fix is to not crash the drive in the first place.

>> Every read operation starts at the maximum block size. When the USB
>> pendrive is not able to correctly serve this data read request, the
>> dynamic reduction of IO size is performed. Up to six tries (with
>> smaller IO block each time) are attempted.
>>
>> A related problem is that some drives are slow to come up. Linux
>> handles this by issuing a spinup command and allowing more time for
>> the drive to respond. The same idea is applied in this fix.
>>
>> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
>>
>> Bus usb at 2184200: USB EHCI 1.00
>> scanning bus usb at 2184200 for devices... 2 USB Device(s) found
>>        scanning usb for storage devices... 1 Storage Device(s) found
>> EHCI timed out on TD - token=0x1f8c80
> 
> This is how the error gets evident. The detailed explanation is in link
> [1].

Is there one specific post in that forum, or do I need to read through
the whole multi-page thread ?

I would expect that if you run -- I assume 'usb reset' (the command is
missing above) -- then the bus gets power-cycled, hence the USB device
also gets power-cycled.

>> Performance impact:
>>
>> With U-Boot -master in mainline:
>> 16869440 bytes read in 979 ms (16.4 MiB/s)
>>
>> With this patch:
>> 16869440 bytes read in 1219 ms (13.2 MiB/s)
>>
>> Links:
>> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
>> [2] -
>> https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0
>>
>> Signed-off-by: Lukasz Majewski <lukma at denx.de>
>> [Unfortunately, the original patch [2] did not contain S-o-B from the
>> original author - "rayvt"]
>> ---
>>
>>  common/usb.c                |  10 +++-
>>  common/usb_storage.c        | 106
>> +++++++++++++++++++++++++++++++++--- drivers/usb/host/ehci-hcd.c |
>> 55 +++++++++++++++++-- include/usb.h               |   1 +
>>  include/usb_defs.h          |   1 +
>>  5 files changed, 160 insertions(+), 13 deletions(-)
>>
>> diff --git a/common/usb.c b/common/usb.c
>> index 349e838f1d..305482b5bb 100644
>> --- a/common/usb.c
>> +++ b/common/usb.c
>> @@ -925,14 +925,20 @@ static int get_descriptor_len(struct usb_device
>> *dev, int len, int expect_len) __maybe_unused struct
>> usb_device_descriptor *desc; ALLOC_CACHE_ALIGN_BUFFER(unsigned char,
>> tmpbuf, USB_BUFSIZ); int err;
>> +	int retry = 5;
>>  
>>  	desc = (struct usb_device_descriptor *)tmpbuf;
>>  
>> +again:
>>  	err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc, len);
>>  	if (err < expect_len) {
>>  		if (err < 0) {
>> -			printf("unable to get device descriptor
>> (error=%d)\n",
>> -				err);
>> +			printf("unable to get device descriptor
>> (error=%d) retry: %d\n",
>> +			       err, retry);
>> +			mdelay(50);
>> +			if (--retry >= 0)
>> +				/* Some drives are just slow to wake
>> up. */
>> +				goto again;
> 
> This I think is just to "wait" for the inserted device if it is rather
> slow (or probably its memory got degraded over time).
> 
> This shouldn't bring any transfer slow down.

Then this could be a separate patch ?

>>  			return err;
>>  		} else {
>>  			printf("USB device descriptor short read
>> (expected %i, got %i)\n", diff --git a/common/usb_storage.c
>> b/common/usb_storage.c index 097b6729c1..48c8c2ae64 100644
>> --- a/common/usb_storage.c
>> +++ b/common/usb_storage.c
>> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
>> struct us_data *us, struct blk_desc *dev_desc);
>>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
>>  		      struct us_data *ss);
>> +
>> +#ifdef CONFIG_USB_EHCI_HCD
>> +	/*
>> +	 * The U-Boot EHCI driver can handle any transfer length as
>> long as
>> +	 * there is enough free heap space left, but the SCSI
>> READ(10) and
>> +	 * WRITE(10) commands are limited to 65535 blocks.
>> +	 */
>> +int usb_max_xfer_blk = 4096;
> 
> Setting it to magic value - 4096 may be indeed problematic. However, I
> don't know what is the 'usual' transfer size.

240 blocks , see 7d6fd7f0ba71cd93d94079132f958d9630f27a89 .

>> +#else
>> +int usb_max_xfer_blk = 20;
> 
> This is also a magic value - probably took from some old U-Boot (when
> default transfer was set to this value).
> 
> I do guess that this snippet shall be removed from this patch set.

Probably.

>> +#endif
>> +
>>  #if CONFIG_IS_ENABLED(BLK)
>>  static unsigned long usb_stor_read(struct udevice *dev, lbaint_t
>> blknr, lbaint_t blkcnt, void *buffer);
>> @@ -729,6 +741,7 @@ static int usb_stor_BBB_transport(struct scsi_cmd
>> *srb, struct us_data *us) pipeout = usb_sndbulkpipe(us->pusb_dev,
>> us->ep_out); /* DATA phase + error handling */
>>  	data_actlen = 0;
>> +	mdelay(10);		/* Like linux does. */
>>  	/* no data, go immediately to the STATUS phase */
>>  	if (srb->datalen == 0)
>>  		goto st;
>> @@ -740,6 +753,16 @@ static int usb_stor_BBB_transport(struct
>> scsi_cmd *srb, struct us_data *us) 
>>  	result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata,
>> srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5);
>> +
>> +	/* special handling of XACTERR in DATA phase */
>> +	if (result < 0 && (us->pusb_dev->status & USB_ST_XACTERR)) {
>> +		debug("XACTERR in data phase - clr, reset, and
>> return fail.\n");
>> +		usb_stor_BBB_clear_endpt_stall(us, dir_in ?
>> +					       us->ep_in :
>> us->ep_out);
>> +		usb_stor_BBB_reset(us);
>> +		return USB_STOR_TRANSPORT_FAILED;
>> +	}
>> +
> 
> Here the case of XACTERR is handled. This is missing now in -master.

Surely this could be added as a separate patch ?

>>  	/* special handling of STALL in DATA phase */
>>  	if ((result < 0) && (us->pusb_dev->status & USB_ST_STALLED))
>> { debug("DATA:stall\n");
>> @@ -1013,9 +1036,32 @@ static int usb_request_sense(struct scsi_cmd
>> *srb, struct us_data *ss) return 0;
>>  }
>>  
>> +/*
>> + * This spins up the disk and also consumes the time that the
>> + * disk takes to become active and ready to read data.
>> + * Some drives (like Western Digital) can take more than 5 seconds.
>> + * The delay occurs on the 1st data read from the disk.
>> + * Extending the timeout here works better than handling the timeout
>> + * as an error on a "real" read.
>> + */
>> +static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss)
>> +{
>> +	memset(&srb->cmd[0], 0, 12);
>> +	srb->cmd[0] = SCSI_START_STP;
>> +	srb->cmd[1] = srb->lun << 5;
>> +	srb->cmd[4] = 1;	/* Start spinup. */
>> +	srb->datalen = 0;
>> +	srb->cmdlen = 6;
>> +	ss->pusb_dev->extra_timeout = 9876;
>> +	ss->transport(srb, ss);
>> +	ss->pusb_dev->extra_timeout = 0;
>> +	return 0;
>> +}
>> +
>>  static int usb_test_unit_ready(struct scsi_cmd *srb, struct us_data
>> *ss) {
>>  	int retries = 10;
>> +	int gave_extra_time = 0;
>>  
>>  	do {
>>  		memset(&srb->cmd[0], 0, 12);
>> @@ -1038,6 +1084,17 @@ static int usb_test_unit_ready(struct scsi_cmd
>> *srb, struct us_data *ss) if ((srb->sense_buf[2] == 0x02) &&
>>  		    (srb->sense_buf[12] == 0x3a))
>>  			return -1;
>> +		/*
>> +		 * If the status is "Not Ready - becoming ready",
>> give it
>> +		 * more time.  Linux issues a spinup command (once)
>> and gives
>> +		 * it 100 seconds.
>> +		 */
>> +		if (srb->sense_buf[2] == 0x02 &&
>> +		    srb->sense_buf[12] == 0x04 &&
>> +		    gave_extra_time == 0) {
>> +			retries = 100; /* Allow 10 seconds. */
>> +			gave_extra_time = retries;
>> +		}
> 
> The above code looks like targeting the HDDs, but some degraded pen
> drives may need this time as well for proper operation.

Hard drive spin up handling. This could be a separate patch too.

>>  		mdelay(100);
>>  	} while (retries--);
>>  
>> @@ -1166,22 +1223,57 @@ static unsigned long usb_stor_read(struct
>> blk_desc *block_dev, lbaint_t blknr, block_dev->devnum, start, blks,
>> buf_addr); 
>>  	do {
>> -		/* XXX need some comment here */
>> +		/*
>> +		 * Probably most errors are USB errors, not hard
>> disk error.
>> +		 * Many disks use a USB chip that is flaky when
>> doing large
>> +		 * transfers. The workaround is to dynamically
>> reduce the
>> +		 * transfer size and allow an additional try.
>> +		 * This should pick up flaky disks. Linux uses a
>> quirks table.
>> +		 * We'll use observation:
>> +		 * Give it 1 try very large, 1 try large, 2 tries
>> medium and 2
>> +		 * tries small(ish).
>> +		 * On a solid fail (actual disk error - which should
>> be rare),
>> +		 * this will give us 6 tries max, and only that many
>> if the read
>> +		 * is quite large.
>> +		 * A fail on a very short read obviously doesn't
>> have a
>> +		 * too-large max_blks.  Timeout due to spinup being
>> a case in
>> +		 * this point.
>> +		 */
>>  		retry = 2;
>>  		srb->pdata = (unsigned char *)buf_addr;
>> -		if (blks > ss->max_xfer_blk)
>> -			smallblks = ss->max_xfer_blk;
>> +retry_it:
>> +		if (blks > usb_max_xfer_blk)
>> +			smallblks = usb_max_xfer_blk;
>>  		else
>>  			smallblks = (unsigned short) blks;
>> -retry_it:
>> -		if (smallblks == ss->max_xfer_blk)
>> +
>> +		if (smallblks == usb_max_xfer_blk)
>>  			usb_show_progress();
>> +
>>  		srb->datalen = block_dev->blksz * smallblks;
>>  		srb->pdata = (unsigned char *)buf_addr;
>>  		if (usb_read_10(srb, ss, start, smallblks)) {
>>  			debug("Read ERROR\n");
>>  			ss->flags &= ~USB_READY;
>>  			usb_request_sense(srb, ss);
>> +
>> +			/* Dynamically reduce the I/O size. */
>> +			if (smallblks > 2047) {
>> +				usb_max_xfer_blk = 2047;
>> +				++retry;
>> +			} else if (smallblks > 512) {
>> +				usb_max_xfer_blk = 512;
>> +				++retry;
>> +			} else if (smallblks > 511) {
>> +				usb_max_xfer_blk = 511;
>> +				++retry;
>> +			} else if (smallblks > 63) {
>> +				usb_max_xfer_blk = 63;
>> +				retry += 2;
>> +			}
>> +			debug("step down usb_max_xfer_blk to %d\n",
>> +			      usb_max_xfer_blk);
>> +
> 
> This shall not be needed, as the ehci controller's driver will scatter
> the transfers up till 240 blocks max. Am I right?
> 
> The above code was supposed to work before the 240 block limitation was
> in place?

Correct, this code is wrong.

[...]

>> +	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
>>  	if (!(QT_TOKEN_GET_STATUS(qhtoken) &
>> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
>> +		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {
> 			      ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
> 			      before this patch was also not checked. 
> 
>> +			if (--trynum >= 0) {
>> +				/*
>> +				 * It is necessary to do this,
>> otherwise the
>> +				 * disk is clagged.
>> +				 */
>> +				debug("reset the TD and redo,
>> because of XACTERR\n");
>> +				qhtoken &= ~QT_TOKEN_STATUS_HALTED;
>> +				qhtoken |= QT_TOKEN_STATUS_ACTIVE |
>> +					QT_TOKEN_CERR(2);
>> +				vtd->qt_token = cpu_to_hc32(qhtoken);
>> +				qh->qh_overlay.qt_token =
>> cpu_to_hc32(qhtoken);
>> +				goto retry_xacterr;
>> +			}
>> +			dev->status = USB_ST_XACTERR;
>> +			dev->act_len = length -
>> +				QT_TOKEN_GET_TOTALBYTES(qhtoken);
> 
> This may solve the issue - as it resets the controller in the case of
> transfer error.

Do you need to reset the controller ? Or is there some graceful way out,
i.e. a way to recover from the error ?
[...]