[PATCH] usb: ehci: Fix "EHCI timed out on TD - token=XXXX" error on ehci-hcd

Lukasz Majewski lukma at denx.de
Tue Mar 3 00:25:26 CET 2020


Hi Marek,

> On 3/2/20 2:21 PM, Lukasz Majewski wrote:
> > Hi,  
> 
> Hi,
> 
> >> This patch aims to improve robustness of 'usb' command operation on
> >> the ehci-hcd IP block as it ports to contemporary U-Boot the patch
> >> described and provided in [1] (originally applicable to U-Boot
> >> 2016.05).
> >>
> >> According to the fix author - "rayvt" (from [1]):
> >>
> >> TD stands for "Queue Element Transfer Descriptor (qTD)", which is a
> >> micro-code instruction for the EHCI USB chip.
> >> The "token" field is detailed information, control, and status of
> >> the TD's data transfer. In particular, the rightmost byte is the
> >> status field. 0x80 bit means the TD is active and the data
> >> transfer has not yet completed. 0x08 bit means there was some sort
> >> of data transfer error (XactErr).  
> > 	 ^^^^^^^ - this error is NOT handled now in mainline
> > u-boot. Do you have any experience with it?  
> 
> No, but then I also didn't run into those yet.

Ok.

> 
> >> If the drive gets a transfer error, it refuses to do any other I/O
> >> transfer until the error is properly cleared and reset. U-boot did
> >> not do this, so every subsequent disk read would get a timeout
> >> error because the drive would not send any data. The fix is to
> >> properly clear and reset the USB disk when it gets a transfer
> >> error.  
> 
> So basically, what this says is, "we crashed the drive, because the
> drive suffered counter overflow, thus we fix it by resetting the
> drive". 

The transfer error may be also caused by some timeout from internal
controller when the USB disk is worn out. Counter overflow may not be
the only reason.

> No, the fix is to not crash the drive in the first place.

From 2019.0{14} a lot was done to make "usb" command more robust, but
still some mishaps happen.

> 
> >> Every read operation starts at the maximum block size. When the USB
> >> pendrive is not able to correctly serve this data read request, the
> >> dynamic reduction of IO size is performed. Up to six tries (with
> >> smaller IO block each time) are attempted.
> >>
> >> A related problem is that some drives are slow to come up. Linux
> >> handles this by issuing a spinup command and allowing more time for
> >> the drive to respond. The same idea is applied in this fix.
> >>
> >> On TPC70 (i.MX6Q) once per ~10 times (without this patch):
> >>
> >> Bus usb at 2184200: USB EHCI 1.00
> >> scanning bus usb at 2184200 for devices... 2 USB Device(s) found
> >>        scanning usb for storage devices... 1 Storage Device(s)
> >> found EHCI timed out on TD - token=0x1f8c80  
> > 
> > This is how the error gets evident. The detailed explanation is in
> > link [1].  
> 
> Is there one specific post in that forum, or do I need to read through
> the whole multi-page thread ?

There is a thread pointed out by the below link which explains the
issue thoroughly:
https://forum.doozan.com/read.php?3,35295,35295#msg-35295

(The first 2 posts from 'rayvt' explains the problem).

> 
> I would expect that if you run -- I assume 'usb reset' (the command is
> missing above) -- then the bus gets power-cycled, hence the USB device
> also gets power-cycled.

Unfortunately, the 'usb reset' is not fixing the issue. The controller
is unresponsive and the only way for recovering is a power cycle.

> 
> >> Performance impact:
> >>
> >> With U-Boot -master in mainline:
> >> 16869440 bytes read in 979 ms (16.4 MiB/s)
> >>
> >> With this patch:
> >> 16869440 bytes read in 1219 ms (13.2 MiB/s)
> >>
> >> Links:
> >> [1] - https://forum.doozan.com/read.php?3,35295,35295#msg-35295
> >> [2] -
> >> https://www.dropbox.com/s/nrkrd1no63viuu8/uboot-bodhi-2016.05-timeoutTD.patch?dl=0
> >>
> >> Signed-off-by: Lukasz Majewski <lukma at denx.de>
> >> [Unfortunately, the original patch [2] did not contain S-o-B from
> >> the original author - "rayvt"]
> >> ---
> >>
> >>  common/usb.c                |  10 +++-
> >>  common/usb_storage.c        | 106
> >> +++++++++++++++++++++++++++++++++--- drivers/usb/host/ehci-hcd.c |
> >> 55 +++++++++++++++++-- include/usb.h               |   1 +
> >>  include/usb_defs.h          |   1 +
> >>  5 files changed, 160 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/common/usb.c b/common/usb.c
> >> index 349e838f1d..305482b5bb 100644
> >> --- a/common/usb.c
> >> +++ b/common/usb.c
> >> @@ -925,14 +925,20 @@ static int get_descriptor_len(struct
> >> usb_device *dev, int len, int expect_len) __maybe_unused struct
> >> usb_device_descriptor *desc; ALLOC_CACHE_ALIGN_BUFFER(unsigned
> >> char, tmpbuf, USB_BUFSIZ); int err;
> >> +	int retry = 5;
> >>  
> >>  	desc = (struct usb_device_descriptor *)tmpbuf;
> >>  
> >> +again:
> >>  	err = usb_get_descriptor(dev, USB_DT_DEVICE, 0, desc,
> >> len); if (err < expect_len) {
> >>  		if (err < 0) {
> >> -			printf("unable to get device descriptor
> >> (error=%d)\n",
> >> -				err);
> >> +			printf("unable to get device descriptor
> >> (error=%d) retry: %d\n",
> >> +			       err, retry);
> >> +			mdelay(50);
> >> +			if (--retry >= 0)
> >> +				/* Some drives are just slow to
> >> wake up. */
> >> +				goto again;  
> > 
> > This I think is just to "wait" for the inserted device if it is
> > rather slow (or probably its memory got degraded over time).
> > 
> > This shouldn't bring any transfer slow down.  
> 
> Then this could be a separate patch ?

Yes. I think so.

> 
> >>  			return err;
> >>  		} else {
> >>  			printf("USB device descriptor short read
> >> (expected %i, got %i)\n", diff --git a/common/usb_storage.c
> >> b/common/usb_storage.c index 097b6729c1..48c8c2ae64 100644
> >> --- a/common/usb_storage.c
> >> +++ b/common/usb_storage.c
> >> @@ -111,6 +111,18 @@ int usb_stor_get_info(struct usb_device *dev,
> >> struct us_data *us, struct blk_desc *dev_desc);
> >>  int usb_storage_probe(struct usb_device *dev, unsigned int ifnum,
> >>  		      struct us_data *ss);
> >> +
> >> +#ifdef CONFIG_USB_EHCI_HCD
> >> +	/*
> >> +	 * The U-Boot EHCI driver can handle any transfer length
> >> as long as
> >> +	 * there is enough free heap space left, but the SCSI
> >> READ(10) and
> >> +	 * WRITE(10) commands are limited to 65535 blocks.
> >> +	 */
> >> +int usb_max_xfer_blk = 4096;  
> > 
> > Setting it to magic value - 4096 may be indeed problematic.
> > However, I don't know what is the 'usual' transfer size.  
> 
> 240 blocks , see 7d6fd7f0ba71cd93d94079132f958d9630f27a89 .

Ok.

> 
> >> +#else
> >> +int usb_max_xfer_blk = 20;  
> > 
> > This is also a magic value - probably took from some old U-Boot
> > (when default transfer was set to this value).
> > 
> > I do guess that this snippet shall be removed from this patch set.  
> 
> Probably.

Ok.

> 
> >> +#endif
> >> +
> >>  #if CONFIG_IS_ENABLED(BLK)
> >>  static unsigned long usb_stor_read(struct udevice *dev, lbaint_t
> >> blknr, lbaint_t blkcnt, void *buffer);
> >> @@ -729,6 +741,7 @@ static int usb_stor_BBB_transport(struct
> >> scsi_cmd *srb, struct us_data *us) pipeout =
> >> usb_sndbulkpipe(us->pusb_dev, us->ep_out); /* DATA phase + error
> >> handling */ data_actlen = 0;
> >> +	mdelay(10);		/* Like linux does. */
> >>  	/* no data, go immediately to the STATUS phase */
> >>  	if (srb->datalen == 0)
> >>  		goto st;
> >> @@ -740,6 +753,16 @@ static int usb_stor_BBB_transport(struct
> >> scsi_cmd *srb, struct us_data *us) 
> >>  	result = usb_bulk_msg(us->pusb_dev, pipe, srb->pdata,
> >> srb->datalen, &data_actlen, USB_CNTL_TIMEOUT * 5);
> >> +
> >> +	/* special handling of XACTERR in DATA phase */
> >> +	if (result < 0 && (us->pusb_dev->status &
> >> USB_ST_XACTERR)) {
> >> +		debug("XACTERR in data phase - clr, reset, and
> >> return fail.\n");
> >> +		usb_stor_BBB_clear_endpt_stall(us, dir_in ?
> >> +					       us->ep_in :
> >> us->ep_out);
> >> +		usb_stor_BBB_reset(us);
> >> +		return USB_STOR_TRANSPORT_FAILED;
> >> +	}
> >> +  
> > 
> > Here the case of XACTERR is handled. This is missing now in
> > -master.  
> 
> Surely this could be added as a separate patch ?

Ok.

> 
> >>  	/* special handling of STALL in DATA phase */
> >>  	if ((result < 0) && (us->pusb_dev->status &
> >> USB_ST_STALLED)) { debug("DATA:stall\n");
> >> @@ -1013,9 +1036,32 @@ static int usb_request_sense(struct scsi_cmd
> >> *srb, struct us_data *ss) return 0;
> >>  }
> >>  
> >> +/*
> >> + * This spins up the disk and also consumes the time that the
> >> + * disk takes to become active and ready to read data.
> >> + * Some drives (like Western Digital) can take more than 5
> >> seconds.
> >> + * The delay occurs on the 1st data read from the disk.
> >> + * Extending the timeout here works better than handling the
> >> timeout
> >> + * as an error on a "real" read.
> >> + */
> >> +static int usb_spinup(struct scsi_cmd *srb, struct us_data *ss)
> >> +{
> >> +	memset(&srb->cmd[0], 0, 12);
> >> +	srb->cmd[0] = SCSI_START_STP;
> >> +	srb->cmd[1] = srb->lun << 5;
> >> +	srb->cmd[4] = 1;	/* Start spinup. */
> >> +	srb->datalen = 0;
> >> +	srb->cmdlen = 6;
> >> +	ss->pusb_dev->extra_timeout = 9876;
> >> +	ss->transport(srb, ss);
> >> +	ss->pusb_dev->extra_timeout = 0;
> >> +	return 0;
> >> +}
> >> +
> >>  static int usb_test_unit_ready(struct scsi_cmd *srb, struct
> >> us_data *ss) {
> >>  	int retries = 10;
> >> +	int gave_extra_time = 0;
> >>  
> >>  	do {
> >>  		memset(&srb->cmd[0], 0, 12);
> >> @@ -1038,6 +1084,17 @@ static int usb_test_unit_ready(struct
> >> scsi_cmd *srb, struct us_data *ss) if ((srb->sense_buf[2] == 0x02)
> >> && (srb->sense_buf[12] == 0x3a))
> >>  			return -1;
> >> +		/*
> >> +		 * If the status is "Not Ready - becoming ready",
> >> give it
> >> +		 * more time.  Linux issues a spinup command
> >> (once) and gives
> >> +		 * it 100 seconds.
> >> +		 */
> >> +		if (srb->sense_buf[2] == 0x02 &&
> >> +		    srb->sense_buf[12] == 0x04 &&
> >> +		    gave_extra_time == 0) {
> >> +			retries = 100; /* Allow 10 seconds. */
> >> +			gave_extra_time = retries;
> >> +		}  
> > 
> > The above code looks like targeting the HDDs, but some degraded pen
> > drives may need this time as well for proper operation.  
> 
> Hard drive spin up handling. This could be a separate patch too.
> 

Ok.

> >>  		mdelay(100);
> >>  	} while (retries--);
> >>  
> >> @@ -1166,22 +1223,57 @@ static unsigned long usb_stor_read(struct
> >> blk_desc *block_dev, lbaint_t blknr, block_dev->devnum, start,
> >> blks, buf_addr); 
> >>  	do {
> >> -		/* XXX need some comment here */
> >> +		/*
> >> +		 * Probably most errors are USB errors, not hard
> >> disk error.
> >> +		 * Many disks use a USB chip that is flaky when
> >> doing large
> >> +		 * transfers. The workaround is to dynamically
> >> reduce the
> >> +		 * transfer size and allow an additional try.
> >> +		 * This should pick up flaky disks. Linux uses a
> >> quirks table.
> >> +		 * We'll use observation:
> >> +		 * Give it 1 try very large, 1 try large, 2 tries
> >> medium and 2
> >> +		 * tries small(ish).
> >> +		 * On a solid fail (actual disk error - which
> >> should be rare),
> >> +		 * this will give us 6 tries max, and only that
> >> many if the read
> >> +		 * is quite large.
> >> +		 * A fail on a very short read obviously doesn't
> >> have a
> >> +		 * too-large max_blks.  Timeout due to spinup
> >> being a case in
> >> +		 * this point.
> >> +		 */
> >>  		retry = 2;
> >>  		srb->pdata = (unsigned char *)buf_addr;
> >> -		if (blks > ss->max_xfer_blk)
> >> -			smallblks = ss->max_xfer_blk;
> >> +retry_it:
> >> +		if (blks > usb_max_xfer_blk)
> >> +			smallblks = usb_max_xfer_blk;
> >>  		else
> >>  			smallblks = (unsigned short) blks;
> >> -retry_it:
> >> -		if (smallblks == ss->max_xfer_blk)
> >> +
> >> +		if (smallblks == usb_max_xfer_blk)
> >>  			usb_show_progress();
> >> +
> >>  		srb->datalen = block_dev->blksz * smallblks;
> >>  		srb->pdata = (unsigned char *)buf_addr;
> >>  		if (usb_read_10(srb, ss, start, smallblks)) {
> >>  			debug("Read ERROR\n");
> >>  			ss->flags &= ~USB_READY;
> >>  			usb_request_sense(srb, ss);
> >> +
> >> +			/* Dynamically reduce the I/O size. */
> >> +			if (smallblks > 2047) {
> >> +				usb_max_xfer_blk = 2047;
> >> +				++retry;
> >> +			} else if (smallblks > 512) {
> >> +				usb_max_xfer_blk = 512;
> >> +				++retry;
> >> +			} else if (smallblks > 511) {
> >> +				usb_max_xfer_blk = 511;
> >> +				++retry;
> >> +			} else if (smallblks > 63) {
> >> +				usb_max_xfer_blk = 63;
> >> +				retry += 2;
> >> +			}
> >> +			debug("step down usb_max_xfer_blk to
> >> %d\n",
> >> +			      usb_max_xfer_blk);
> >> +  
> > 
> > This shall not be needed, as the ehci controller's driver will
> > scatter the transfers up till 240 blocks max. Am I right?
> > 
> > The above code was supposed to work before the 240 block limitation
> > was in place?  
> 
> Correct, this code is wrong.

Ok.

> 
> [...]
> 
> >> +	qhtoken = hc32_to_cpu(qh->qh_overlay.qt_token);
> >>  	if (!(QT_TOKEN_GET_STATUS(qhtoken) &
> >> QT_TOKEN_STATUS_ACTIVE)) { debug("TOKEN=%#x\n", qhtoken);
> >> +		if (qhtoken & QT_TOKEN_STATUS_XACTERR) {  
> > 			      ^^^^^^^^^^^^^^^^^^^^^^^^ - this flag
> > 			      before this patch was also not
> > checked. 
> >> +			if (--trynum >= 0) {
> >> +				/*
> >> +				 * It is necessary to do this,
> >> otherwise the
> >> +				 * disk is clagged.
> >> +				 */
> >> +				debug("reset the TD and redo,
> >> because of XACTERR\n");
> >> +				qhtoken &=
> >> ~QT_TOKEN_STATUS_HALTED;
> >> +				qhtoken |= QT_TOKEN_STATUS_ACTIVE
> >> |
> >> +					QT_TOKEN_CERR(2);
> >> +				vtd->qt_token =
> >> cpu_to_hc32(qhtoken);
> >> +				qh->qh_overlay.qt_token =
> >> cpu_to_hc32(qhtoken);
> >> +				goto retry_xacterr;
> >> +			}
> >> +			dev->status = USB_ST_XACTERR;
> >> +			dev->act_len = length -
> >> +				QT_TOKEN_GET_TOTALBYTES(qhtoken);
> >>  
> > 
> > This may solve the issue - as it resets the controller in the case
> > of transfer error.  
> 
> Do you need to reset the controller ? Or is there some graceful way
> out, i.e. a way to recover from the error ?

The original patch was developed when the async transmission was start
and stopped for each transfer (before your optimization), hence the code
for disabling it.

I need to check if it would be enough to just clear the
QT_TOKEN_STATUS_HALTED and set QT_TOKEN_STATUS_ACTIVE again.

> [...]




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma at denx.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <https://lists.denx.de/pipermail/u-boot/attachments/20200303/0308563c/attachment.sig>


More information about the U-Boot mailing list