[U-Boot] [PATCH 6/9] CACHE: nand read/write: Test if start address is aligned

Tue Jun 26 03:33:30 CEST 2012

Dear Scott Wood,

> On 06/25/2012 06:37 PM, Marek Vasut wrote:
> > Dear Scott Wood,
> > 
> >> On 06/24/2012 07:17 PM, Marek Vasut wrote:
> >>> This prevents the scenario where data cache is on and the
> >>> device uses DMA to deploy data. In that case, it might not
> >>> be possible to flush/invalidate data to RAM properly. The
> >>> other option is to use bounce buffer,
> >> 
> >> Or get cache coherent hardware. :-)
> > 
> > Oh ... you mean powerpc? Or rather something like this
> > http://cache.freescale.com/files/32bit/doc/fact_sheet/QORIQLS2FAMILYFS.pd
> > f ? :-D
> 
> The word "coherent/coherency" appears 5 times in that document by my
> count. :-)
> 
> I hope that applies to DMA, not just core-to-core.
> 
> >>> but that involves a lot of copying and therefore degrades performance
> >>> rapidly. Therefore disallow this possibility of unaligned load
> >>> address altogether if data cache is on.
> >> 
> >> How about use the bounce buffer only if the address is misaligned?
> > 
> > Not happening, bounce buffer is bullshit,
> 
> Hacking up the common frontend with a new limitation because you can't
> be bothered to fix your drivers is bullshit.

The drivers are not broken, they have hardware limitations. And checking for 
those has to be done as early as possible. And it's not a new common frontend!

> > It's like driving a car in the wrong lane. Sure, you can do it, but it'll
> > eventually have some consequences. And using a bounce buffer is like
> > driving a tank in the wrong lane ...
> 
> Using a bounce buffer is like parking your car before going into the
> building, rather than insisting the building's hallways be paved.

The other is obviously faster, more comfortable and lets you carry more stuff at 
once. And if you drive a truck, you can dump a lot of payload instead of 
carrying it back and forth from the building. That's why there's a special 
garage for trucks possibly with cargo elevators etc.

> >> The
> >> corrective action a user has to take is the same as with this patch,
> >> except for an additional option of living with the slight performance
> >> penalty.
> > 
> > Slight is very weak word here.
> 
> Prove me wrong with benchmarks.

Well, copying data back and forth is tremendous overhead. You don't need a 
benchmark to calculate something like this:

133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput
(now if it's DDR, dual/quad pumped, that doesn't give you any more advantage 
since you have to: send address, read the data, send address, write the data ... 
this is expensive ... without data cache on, even more so)

Now consider you do it via really dump memcpy, what happens:
1) You need to read the data into register
1a) Send address
1b) Read the data into register
2) You need to write the data to a new location
2a) Send address
2b) Write the data into the memory

In the meantime, you get some refresh cycles etc. Now, if you take read and 
write in 1 time unit and "send address" in 0.5 time unit (this gives total 3 
time units per one loop) and consider you're not doing sustained read/write, you 
should see you'll be able to copy at speed of about 133/3 ~= 40Mb/s

If you want to load 3MB kernel at 40Mb/s onto an unaligned address via DMA, the 
DMA will deploy it via sustained write, that'll be at 10MB/s, therefore in 
300ms. But the subsequent copy will take another 600ms.

And now, I need someone to recalculate it. Also, read up here, it is _VERY_ good 
and it is certainly more accurate than my previous delirious attempt:

http://www.akkadia.org/drepper/cpumemory.pdf

> >> How often does this actually happen?  How much does it
> >> actually slow things down compared to the speed of the NAND chip?
> > 
> > If the user is dumb, always. But if you tell the user how to milk the
> > most of the hardware, he'll be happier.
> 
> So, if you use bounce buffers conditionally (based on whether the
> address is misaligned), there's no impact except to "dumb" users, and
> for those users they would merely get a performance degradation rather
> than breakage.  How is this "bullshit"?

Correct, but users will complain if they get a subpar performance.

> >> I'm hesitant to break something -- even if it's odd (literally in this
> >> case) -- that currently works on most hardware, just because one or two
> >> drivers can't handle it.  It feels kind of like changing the read() and
> >> write() system calls to require cacheline alignment. :-P
> > 
> > That's actually almost right, we're doing a bootloader here, it might
> > have limitations. We're not writing yet another operating system with no
> > bounds on possibilities!
> 
> We also don't need to bend over backwards to squeeze every last cycle
> out of the boot process, at the expense of a stable user interface (not
> to mention requiring the user to know the system's cache line size).

But that's reported in my patch ;-)

And yes, we want to make the boot process as blazing fast as possible. Imagine 
you fire a rocket into the deep space and it gets broken and needs reboot, will 
you enjoy the waiting ? ;-)

> -Scott

Best regards,
Marek Vasut