[U-Boot] [PATCH v2 2/5] ehci-hcd: Boost transfer speed

Tue Jul 24 15:02:00 CEST 2012

Am 23.07.2012 19:15, schrieb Benoît Thébaudeau:
> On Monday 23 July 2012 15:35:25 Stefan Herbrechtsmeier wrote:
>> Am 20.07.2012 17:35, schrieb Benoît Thébaudeau:
>>> On Friday 20 July 2012 17:15:13 Stefan Herbrechtsmeier wrote:
>>>> Am 20.07.2012 17:03, schrieb Benoît Thébaudeau:
>>>>> On Friday 20 July 2012 16:51:33 Stefan Herbrechtsmeier wrote:
>>>>>> Am 20.07.2012 15:56, schrieb Benoît Thébaudeau:
>>>>>>> Dear Marek Vasut,
>>>>>>>
>>>>>>> On Friday 20 July 2012 15:44:01 Marek Vasut wrote:
>>>>>>>>> On Friday 20 July 2012 13:37:37 Stefan Herbrechtsmeier wrote:
>>>>>>>>>> Am 20.07.2012 13:26, schrieb Benoît Thébaudeau:
>>>>>>>>>>> +			int xfr_bytes = min(left_length,
>>>>>>>>>>> +					    (QT_BUFFER_CNT * 4096 -
>>>>>>>>>>> +					     ((uint32_t)buf_ptr & 4095)) &
>>>>>>>>>>> +					    ~4095);
>>>>>>>>>> Why you align the length to 4096?
>>>>>>>>> It's to guarantee that each transfer length is a multiple of
>>>>>>>>> the
>>>>>>>>> max packet
>>>>>>>>> length. Otherwise, early short packets are issued, which
>>>>>>>>> breaks
>>>>>>>>> the
>>>>>>>>> transfer and results in time-out error messages.
>>>>>>>> Early short packets ? What do you mean?
>>>>>>> During a USB transfer, all packets must have a length of max
>>>>>>> packet
>>>>>>> length for
>>>>>>> the pipe/endpoint, except the final one that can be a short
>>>>>>> packet.
>>>>>>> Without the
>>>>>>> alignment I make for xfr_bytes, short packets can occur within
>>>>>>> a
>>>>>>> transfer,
>>>>>>> because the hardware starts a new packet for each new queued
>>>>>>> qTD
>>>>>>> it
>>>>>>> handles.
>>>>>> But if I am right, the max packet length is 512 for bulk and
>>>>>> 1024
>>>>>> for
>>>>>> Interrupt transfer.
>>>>> There are indeed different max packet lengths for different
>>>>> transfer types, but
>>>>> it does not matter since the chosen alignment guarantees a
>>>>> multiple
>>>>> of all these
>>>>> possible max packet lengths.
>>>> But thereby you limit the transfer to 4 qT buffers for unaligned
>>>> transfers.
>>> Not exactly. The 5 qt_buffers are used for page-unaligned buffers,
>>> but that
>>> results in only 4 full pages of unaligned data, requiring 5 aligned
>>> pages.
>> Sorry I mean 4 full pages of unaligned data.
>>> For page-aligned buffers, the 5 qt_buffers result in 5 full pages
>>> of aligned
>>> data.
>> Sure.
>>> The unaligned case could be a little bit improved to always use as
>>> many packets
>>> as possible per qTD, but that would over-complicate things for a
>>> very negligible
>>> speed and memory gain.
>> In my use case (fragmented file on usb storage)  the gain would be
>> nearly 20%. The reason is that the data are block aligned (512) and
>> could be aligned to 4096 with the first transfer (5 qt_buffers).
> Can you explain where this gain would come from? In both cases, the data in USB
> transfers would be organized in the same way, and it would be accessed in memory
> also in the same way (regarding bursts). The only difference would be the fetch
> time of a little bit more qTDs, which is extremely fast and insignificant
> compared to the transfer time of the payload, which remains unchanged.
You are right, the speed different will be minimal, only the memory 
usage will be lower.

> Moreover, in your use case, if you are e.g. using FAT, on the one hand, the
> buffers in fat.c are never aligned to more than the DMA min alignment, and on
> the other hand, if you can align your user buffers to 512 bytes, you can also
> align them directly to 4 kB.
The user buffer is aligned to 4kB, but this doesn't matter as a file 
load from a storage device (ex. fatload) can be segmented in partial USB 
transfers. This can lead to any block aligned buffer for a partial transfer.

>
>> My suggestion would be to truncate the xfr_bytes with the max
>> wMaxPacketSize (1024) and for the qtd_count use:
>>
>> if ((uint32_t)buffer & 1023)    /* wMaxPacketSize unaligned */
>>       qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
>>               length, (QT_BUFFER_CNT - 1) * 4096);
>> else                /* wMaxPacketSize aligned */
>>       qtd_count += DIV_ROUND_UP(((uint32_t)buffer & 4095) +
>>               length, QT_BUFFER_CNT * 4096);
>>
>> This allows 50% of unaligned block data (512) to be transferred with
>> min
>> qTDs.
> That would also require a realignment-to-page stage. This is specific code for
> specific buffer alignment from the upper layers. We could also skip the
> realignment to page and always keep the same qTD transfer size except for the
> last one, by adding as many packets as possible for the buffer alignment.
What you mean by realignment-to-page stage?

> But I still don't see a significant reason to complicate code to do that.
I don't understand where you expect to complicate the code.

You limit the size of one transfer (xfr_bytes) to (QT_BUFFER_CNT - 1) * 
4kB for unaligned buffers. But you only need to limit it to a multiple 
of the maximal possible wMaxPacketSize (1kB) to make sure that you 
always send full packages.

I only suggest to replace the causeless 4kB alignment with a reasonable 
1kB alignment and adapte the qtd_count caluclation.

                         int xfr_bytes = min(left_length,
                                             (QT_BUFFER_CNT * 4096 -
                                              ((uint32_t)buf_ptr & 4095)) &
-                                           ~4095);
+                                           ~1023);

> BTW, the 15x speed gain that I gave in my patch description was compared to an
> older version of the original code that used 20 blocks per transfer in
> usb_storage.c. This is now 40 blocks per transfer with a page-aligned buffer, so
> the speed gain compared to the current code should be rather about 7x. I should
> update that.
I'm sure that there is a significant speed gain but you shouldn't miss 
the heap usage as the main CONFIG_SYS_MALLOC_LEN is 128kB.

Maybe you should also add a worst case heap usage and I'm not sure, if 
your calculation are right, as the size of struct qTD is allays 32B and 
thereby I get 50kB or 64kB.

Best regards,
     Stefan