[U-Boot] [PATCH 6/9] CACHE: nand read/write: Test if start address is aligned

Tue Jun 26 22:39:08 CEST 2012

Dear Scott Wood,

> On 06/25/2012 08:33 PM, Marek Vasut wrote:
> > Dear Scott Wood,
> > 
> >> On 06/25/2012 06:37 PM, Marek Vasut wrote:
> >>> Dear Scott Wood,
> >>> 
> >>>> On 06/24/2012 07:17 PM, Marek Vasut wrote:
> >>>>> but that involves a lot of copying and therefore degrades performance
> >>>>> rapidly. Therefore disallow this possibility of unaligned load
> >>>>> address altogether if data cache is on.
> >>>> 
> >>>> How about use the bounce buffer only if the address is misaligned?
> >>> 
> >>> Not happening, bounce buffer is bullshit,
> >> 
> >> Hacking up the common frontend with a new limitation because you can't
> >> be bothered to fix your drivers is bullshit.
> > 
> > The drivers are not broken, they have hardware limitations.
> 
> They're broken because they ignore those limitations.
> 
> > And checking for
> > those has to be done as early as possible.
> 
> Why?

Why keep an overhead.

> > And it's not a new common frontend!
> 
> No, it's a compatibility-breaking change to the existing common frontend.

Well, those are corner cases, it's not like the people will start hitting it en-
masse.

I agree it should be somehow platform or even CPU specific.

> >>> It's like driving a car in the wrong lane. Sure, you can do it, but
> >>> it'll eventually have some consequences. And using a bounce buffer is
> >>> like driving a tank in the wrong lane ...
> >> 
> >> Using a bounce buffer is like parking your car before going into the
> >> building, rather than insisting the building's hallways be paved.
> > 
> > The other is obviously faster, more comfortable and lets you carry more
> > stuff at once.
> 
> Then you end up needing buildings to be many times as large to give
> every cubicle an adjacent parking spot, maneuvering room, etc.  You'll
> be breathing fumes all day, and it'll be a lot less comfortable to get
> even across the hallway without using a car, etc.  Communication between
> coworkers would be limited to horns and obscene gestures. :-)

Ok, this has gone waaay out of hand here :-)

> > And if you drive a truck, you can dump a lot of payload instead of
> > carrying it back and forth from the building. That's why there's a
> > special garage for trucks possibly with cargo elevators etc.
> 
> Yes, it's called targeted optimization rather than premature optimization.
> 
> >>>> The
> >>>> corrective action a user has to take is the same as with this patch,
> >>>> except for an additional option of living with the slight performance
> >>>> penalty.
> >>> 
> >>> Slight is very weak word here.
> >> 
> >> Prove me wrong with benchmarks.
> > 
> > Well, copying data back and forth is tremendous overhead. You don't need
> > a benchmark to calculate something like this:
> > 
> > 133MHz SDRAM (pumped) gives you what ... 133 Mb/s throughput
> 
> You're saying you get only a little more bandwidth from memory than
> you'd get from a 100 Mb/s ethernet port?  Come on.  Data buses are not
> one bit wide.

Good point, it was too late and I forgot to count that in.

> And how fast can you pull data out of a NAND chip, even with DMA?
> 
> > (now if it's DDR, dual/quad pumped, that doesn't give you any more
> > advantage
> 
> So such things were implemented for fun?
> 
> > since you have to: send address, read the data, send address, write the
> > data ...
> 
> What about bursts?  I'm pretty sure you don't have to send the address
> separately for every single byte.

If you do memcpy? You only have registers, sure, you can optimize it, but on 
intel for example, you don't have many of those.

> > this is expensive ... without data cache on, even more so)
> 
> Why do we care about "without data cache"?  You don't need the bounce
> buffer in that case.

Correct, it's expensive in both cases though.

> > Now consider you do it via really dump memcpy, what happens:
> It looks like ARM U-Boot has an optimized memcpy.
> 
> > 1) You need to read the data into register
> > 1a) Send address
> > 1b) Read the data into register
> > 2) You need to write the data to a new location
> > 2a) Send address
> > 2b) Write the data into the memory
> > 
> > In the meantime, you get some refresh cycles etc. Now, if you take read
> > and write in 1 time unit and "send address" in 0.5 time unit (this gives
> > total 3 time units per one loop) and consider you're not doing sustained
> > read/write, you should see you'll be able to copy at speed of about
> > 133/3 ~= 40Mb/s
> > 
> > If you want to load 3MB kernel at 40Mb/s onto an unaligned address via
> > DMA, the DMA will deploy it via sustained write, that'll be at 10MB/s,
> > therefore in 300ms. But the subsequent copy will take another 600ms.
> 
> On a p5020ds, using NAND hardware that doesn't do DMA at all, I'm able
> to load a 3MiB image from NAND in around 300-400 ms.  This is with using
> memcpy_fromio() on an uncached hardware buffer.

blazing almost half a second ... but everyone these days wants a faster boot, 
without memcpy, it can go down to 100ms or even less! And this kind of 
limitation is not something that'd inconvenience anyone.

> Again, I'm not saying that bounce buffers are always negligible overhead
> -- just that I doubt NAND is fast enough that it makes a huge difference
> in this specific case.

It does make a difference. Correct, thinking about it -- implementing a generic 
bounce buffer for cache-impotent hardware might be a better way to go.

> >>>> How often does this actually happen?  How much does it
> >>>> actually slow things down compared to the speed of the NAND chip?
> >>> 
> >>> If the user is dumb, always. But if you tell the user how to milk the
> >>> most of the hardware, he'll be happier.
> >> 
> >> So, if you use bounce buffers conditionally (based on whether the
> >> address is misaligned), there's no impact except to "dumb" users, and
> >> for those users they would merely get a performance degradation rather
> >> than breakage.  How is this "bullshit"?
> > 
> > Correct, but users will complain if they get a subpar performance.
> 
> If you expend the minimal effort required to make the bounce buffer
> usage conditional on the address actually being misaligned, the only
> users that will see subpar performance are those who would see breakage
> with your approach.  Users will complain if they see breakage even more
> than when the see subpar performance.
> 
> If the issue is educating the user to avoid the performance hit
> (regardless of magnitude), and you care enough, have the driver print a
> warning (not error) message the first time it needs to use a bounce buffer.

Ok, on a second thought, you're right. Let's do it the same way we did it with 
mmc.

> -Scott

Best regards,
Marek Vasut