[U-Boot] [PATCH 6/9] CACHE: nand read/write: Test if start address is aligned

Tue Jun 26 21:25:04 CEST 2012

On 06/25/2012 08:33 PM, Marek Vasut wrote:
> Dear Scott Wood,
> 
>> On 06/25/2012 06:37 PM, Marek Vasut wrote:
>>> Dear Scott Wood,
>>>
>>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
>>>>> but that involves a lot of copying and therefore degrades performance
>>>>> rapidly. Therefore disallow this possibility of unaligned load
>>>>> address altogether if data cache is on.
>>>>
>>>> How about use the bounce buffer only if the address is misaligned?
>>>
>>> Not happening, bounce buffer is bullshit,
>>
>> Hacking up the common frontend with a new limitation because you can't
>> be bothered to fix your drivers is bullshit.
> 
> The drivers are not broken, they have hardware limitations. 

They're broken because they ignore those limitations.

> And checking for 
> those has to be done as early as possible.

Why?

> And it's not a new common frontend!

No, it's a compatibility-breaking change to the existing common frontend.

>>> It's like driving a car in the wrong lane. Sure, you can do it, but it'll
>>> eventually have some consequences. And using a bounce buffer is like
>>> driving a tank in the wrong lane ...
>>
>> Using a bounce buffer is like parking your car before going into the
>> building, rather than insisting the building's hallways be paved.
> 
> The other is obviously faster, more comfortable and lets you carry more stuff at 
> once.

Then you end up needing buildings to be many times as large to give
every cubicle an adjacent parking spot, maneuvering room, etc.  You'll
be breathing fumes all day, and it'll be a lot less comfortable to get
even across the hallway without using a car, etc.  Communication between
coworkers would be limited to horns and obscene gestures. :-)

> And if you drive a truck, you can dump a lot of payload instead of 
> carrying it back and forth from the building. That's why there's a special 
> garage for trucks possibly with cargo elevators etc.

Yes, it's called targeted optimization rather than premature optimization.

>>>> The
>>>> corrective action a user has to take is the same as with this patch,
>>>> except for an additional option of living with the slight performance
>>>> penalty.
>>>
>>> Slight is very weak word here.
>>
>> Prove me wrong with benchmarks.
> 
> Well, copying data back and forth is tremendous overhead. You don't need a 
> benchmark to calculate something like this:
> 
> 133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput

You're saying you get only a little more bandwidth from memory than
you'd get from a 100 Mb/s ethernet port?  Come on.  Data buses are not
one bit wide.

And how fast can you pull data out of a NAND chip, even with DMA?

> (now if it's DDR, dual/quad pumped, that doesn't give you any more advantage 

So such things were implemented for fun?

> since you have to: send address, read the data, send address, write the data ... 

What about bursts?  I'm pretty sure you don't have to send the address
separately for every single byte.

> this is expensive ... without data cache on, even more so)

Why do we care about "without data cache"?  You don't need the bounce
buffer in that case.

> Now consider you do it via really dump memcpy, what happens:

It looks like ARM U-Boot has an optimized memcpy.

> 1) You need to read the data into register
> 1a) Send address
> 1b) Read the data into register
> 2) You need to write the data to a new location
> 2a) Send address
> 2b) Write the data into the memory
> 
> In the meantime, you get some refresh cycles etc. Now, if you take read and 
> write in 1 time unit and "send address" in 0.5 time unit (this gives total 3 
> time units per one loop) and consider you're not doing sustained read/write, you 
> should see you'll be able to copy at speed of about 133/3 ~= 40Mb/s
> 
> If you want to load 3MB kernel at 40Mb/s onto an unaligned address via DMA, the 
> DMA will deploy it via sustained write, that'll be at 10MB/s, therefore in 
> 300ms. But the subsequent copy will take another 600ms.

On a p5020ds, using NAND hardware that doesn't do DMA at all, I'm able
to load a 3MiB image from NAND in around 300-400 ms.  This is with using
memcpy_fromio() on an uncached hardware buffer.

Again, I'm not saying that bounce buffers are always negligible overhead
-- just that I doubt NAND is fast enough that it makes a huge difference
in this specific case.

>>>> How often does this actually happen?  How much does it
>>>> actually slow things down compared to the speed of the NAND chip?
>>>
>>> If the user is dumb, always. But if you tell the user how to milk the
>>> most of the hardware, he'll be happier.
>>
>> So, if you use bounce buffers conditionally (based on whether the
>> address is misaligned), there's no impact except to "dumb" users, and
>> for those users they would merely get a performance degradation rather
>> than breakage.  How is this "bullshit"?
> 
> Correct, but users will complain if they get a subpar performance.

If you expend the minimal effort required to make the bounce buffer
usage conditional on the address actually being misaligned, the only
users that will see subpar performance are those who would see breakage
with your approach.  Users will complain if they see breakage even more
than when the see subpar performance.

If the issue is educating the user to avoid the performance hit
(regardless of magnitude), and you care enough, have the driver print a
warning (not error) message the first time it needs to use a bounce buffer.

-Scott