[U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed

Tue Jul 31 00:38:54 CEST 2012

Dear Benoît Thébaudeau,

[...]

> > > Can you explain where this gain would come from? In both cases, the
> > > data in USB
> > > transfers would be organized in the same way, and it would be
> > > accessed in memory
> > > also in the same way (regarding bursts). The only difference would
> > > be the fetch
> > > time of a little bit more qTDs, which is extremely fast and
> > > insignificant
> > > compared to the transfer time of the payload, which remains
> > > unchanged.
> > 
> > You are right, the speed different will be minimal, only the memory
> > usage will be lower.
> 
> If your point is only the memory gain, I agree. With your suggestion, there
> are roughly 25% less qTDs used in the "(max wMaxPacketSize)-aligned but
> not page-aligned" case since the number of qTDs is about (total transfer
> size) / 5 instead of (total transfer size) / 4. But this is still small
> compared to usual heap sizes (at least on the kind of hardware I use).

Ok, I see the point. I understand it's not really a bug, just an improvement. 
Maybe we can do a subsequent patch on top of these from Benoit and see how it 
fares?

> > > Moreover, in your use case, if you are e.g. using FAT, on the one
> > > hand, the
> > > buffers in fat.c are never aligned to more than the DMA min
> > > alignment, and on
> > > the other hand, if you can align your user buffers to 512 bytes,
> > > you can also
> > > align them directly to 4 kB.
> > 
> > The user buffer is aligned to 4kB, but this doesn't matter as a file
> > load from a storage device (ex. fatload) can be segmented in partial
> > USB
> > transfers. This can lead to any block aligned buffer for a partial
> > transfer.
> 
> What do you mean by "partial USB transfers"? As seen from EHCI users like
> the MSC driver (usb_storage.c), USB transfers either succeed or fail, but
> they cannot be "segmented".

Segmented -- like multiple transfers being issues with small payload? You can 
not put these together at the USB-level, since it's the issuing code that has to 
be fixed.

> On its side, the MSC driver will only segment the FAT layer requests if
> they are larger than 65535 blocks, so still not what you describe.
> 
> As to the FAT stack, it will only read whole clusters while accessing file
> payload, and the most usual cluster sizes are by default a multiple of 4
> kiB (see http://support.microsoft.com/kb/140365).

512b is minimum and it's quite often used.

> So I don't see "segmentation" anywhere, and for usual cluster sizes, the
> EHCI buffer alignment is fully determined by the applicative buffer
> alignment and the file position corresponding to the beginning of the
> applicative buffer. But there are indeed some unusual use cases (e.g.
> smaller clusters) for which only a block-aligned buffer will reach EHCI
> despite a page-aligned applicative buffer.

I don't quite get this one.

> > >> My suggestion would be to truncate the xfr_bytes with the max
> > >> wMaxPacketSize (1024) and for the qtd_count use:
> > >> 
> > >> if ((uint32_t)buffer & 1023)    /* wMaxPacketSize unaligned */
> > >> 
> > >>       qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
> > >>       
> > >>               length, (QT_BUFFER_CNT - 1) * 4096);
> > >> 
> > >> else                /* wMaxPacketSize aligned */
> > >> 
> > >>       qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
> > >>       
> > >>               length, QT_BUFFER_CNT * 4096);
> > >> 
> > >> This allows 50% of unaligned block data (512) to be transferred
> > >> with
> > >> min
> > >> qTDs.
> > > 
> > > That would also require a realignment-to-page stage. This is
> > > specific code for
> > > specific buffer alignment from the upper layers. We could also skip
> > > the
> > > realignment to page and always keep the same qTD transfer size
> > > except for the
> > > last one, by adding as many packets as possible for the buffer
> > > alignment.
> > 
> > What you mean by realignment-to-page stage?
> 
> I mean that the alignment of the transfer to 1024 instead of 4096 can make
> the first qTD transfer larger than the following ones, which guarantees
> that the following qTD transfers are page-aligned, even if the first one
> was only aligned to 1024. For the 1024-aligned case, this results in the
> change that you suggest, but it also changes things for the unaligned
> case, which makes this part of the code inaccurate. See below.
> 
> > > But I still don't see a significant reason to complicate code to do
> > > that.
> > 
> > I don't understand where you expect to complicate the code.
> > 
> > You limit the size of one transfer (xfr_bytes) to (QT_BUFFER_CNT - 1)
> > *
> > 4kB for unaligned buffers. But you only need to limit it to a
> > multiple
> > of the maximal possible wMaxPacketSize (1kB) to make sure that you
> > always send full packages.
> > 
> > I only suggest to replace the causeless 4kB alignment with a
> > reasonable
> > 1kB alignment and adapte the qtd_count caluclation.
> > 
> >                          int xfr_bytes = min(left_length,
> >                          
> >                                              (QT_BUFFER_CNT * 4096 -
> >                                              
> >                                               ((uint32_t)buf_ptr &
> >                                               4095)) &
> > 
> > -                                           ~4095);
> > +                                           ~1023);
> 
> I agree for this part of the code. But for the allocation part, your
> suggestion is already a little bit more complicated than my original code,
> while still incomplete. Besides that, the "((uint32_t)buffer & 4095) +"
> for the page-unaligned case in my code was actually useless, which
> emphasizes the difference, even if it's a light complication.
> 
> For the allocation part, the appropriate code for your suggestion would be:
> 
> 		if ((uint32_t)buffer & 1023) /* max-wMaxPacketSize-unaligned */
> 			qtd_count +=

DIV_ROUND_UP(
    max(
        length > 0,
        length - (4096 - ((uint32_t)buf_ptr & 4095) & ~1023)
    ),
    (QT_BUFFER_CNT - 1) * 4096
);

Ok, I now think I understand what's going on here. I still have to wonder how 
much would the compiler optimize of this if you "decompressed" your code -- to 
make it more easy to understand.

> 		else			     /* max-wMaxPacketSize-aligned */
> 			qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
> 					length, QT_BUFFER_CNT * 4096);
> 
> This code allocates exactly the required number of qTDs, no less, no more.
> It's clearly more complicated than the 4096 version.
> 
> A test should also be added to make sure that qtd_count is never 0.
> Otherwise, ZLPs are broken (this applies to my original code too).
> 
> If we want to compromise accuracy for simplicity, we can change that to:
> 
> 		qtd_count += 2 + length /
> 			((QT_BUFFER_CNT - !!((uint32_t)buffer & 1023)) * 4096);
> 
> This code allocates enough qTDs for all cases, with at worst 2 extra qTDs
> (i.e. a total of 128 bytes) that will be left unused. It also handles
> intrinsically ZLPs.
> 
> Now, let's consider the possible values of wMaxPacketSize:
>  - control endpoints:
>     * LS: 8,
>     * FS: 8, 16, 32 or 64,
>     * HS: 64,
>  - isochronous endpoints: not supported by ehci-hcd.c,
>  - interrupt endpoints:
>     * LS: <= 8,
>     * FS: <= 64,
>     * HS: <= 1024 (1x to 3x for high-bandwidth),
>  - bulk endpoints:
>     * LS: N/A,
>     * FS: 8, 16, 32 or 64,
>     * HS: 512.
> 
> My code assumes that wMaxPacketSize is a power of 2. This is not always
> true for interrupt endpoints. Let's talk about these. Their handling is
> currently broken in U-Boot since their transfers are made asynchronous
> instead of periodic. Devices shouldn't care too much about that, as long
> as transfers do not exceed wMaxPacketSize, in which case my code still
> works because wMaxPacketSize always fits in a single qTD. Interrupt
> transfers larger than wMaxPacketSize do not seem to be used by U-Boot. If
> they were used, the current code in U-Boot would have a timing issue
> because the asynchronous scheme would break the interval requested by
> devices, which could at worst make them fail in some way. So the only
> solution would be that such transfers be split by the caller of
> submit_int_msg, in which case my code still works. What would you think
> about failing with an error message in submit_int_msg if length is larger
> than wMaxPacketSize? Marek, what do you think?

Let's do that ... I think the interrupt endpoint is only used for keyboard and 
if someone needs it for something else, the code will be there, just needing 
improvement. Comment and error message are OK.

> For all other cases, wMaxPacketSize is a power of 2, so everything is fine,
> except that in those cases wMaxPacketSize is at most 512, which means that
> with the suggested limitation applied to submit_int_msg, your suggested
> 1024 could be replaced with 512, which is good news since this is also the
> most common storage sector size.
>
> We could even use usb_maxpacket(dev, pipe) instead of 512, with some
> restrictions. If we don't want to alter the misalignment check in
> ehci_td_buffer, max(usb_maxpacket(dev, pipe), ARCH_DMA_MINALIGN) would
> actually have to be used. This misalignment check could be limited to the
> first qTD transfer of a USB transfer, but that would complicate things,
> all the more the corresponding call to flush_dcache_range would have to be
> modified to fix alignments.
> 
> So we have to make two choices:
>  - between 4096, 1024, 512 and max(usb_maxpacket(dev, pipe),
> ARCH_DMA_MINALIGN), - between the accurate and simple allocations.
> That makes a total of 8 working possibilities. What do you guys think we
> should choose? On my side, I like max(usb_maxpacket(dev, pipe),
> ARCH_DMA_MINALIGN)

Won't maxpacket fall below 512 on occasions, which might cause trouble?

> with the simple allocation. It's efficient as to code
> speed, size and readability, as well as to RAM usage.

For now, I'd go for the safest, easiest and dumbest solution and see how it 
fares. Subsequent patch can be submitted to improve that and measurements made.

"We should forget about small efficiencies, say about 97% of the time; premature 
optimization is the root of all evil"
            -- Donald E. Knuth, Structured Programming with go to Statements
[...]

> So we could perhaps issue a #error in ehci-hcd or in usb_storage if
> CONFIG_SYS_MALLOC_LEN is not large enough, but I don't think it's a good
> idea because:
>  - the threshold value would have to depend on runtime block sizes or
> something, which could lead to a big worst worst case that would almost
> never happen in real life, so giving such an unrealistic heap size
> constraint would be cumbersome,

#warning then?

>  - reaching the top sizes would mean reading a huge file or something to a
> large buffer (much larger than the qTDs this transfer requires), which
> would very likely be heap-allocated (except for commands like fatload), so
> CONFIG_SYS_MALLOC_LEN would already have to be large for the application,
> - for command line operations like fatload, transfers of uncontrolled
> lengths could simply crash U-Boot if they go too far in memory

Why, because they overwrite it?

> , which
> means that users of such commands need to know what they are doing anyway,
> so they have to control transfer sizes,
>  - there is already a runtime error displayed in case of allocation
> failure.

Ok

> Marek, what do you think?

Had a good evening with the EHCI r10 spec, hope I answered most of your 
questions.