[U-Boot] [PATCH v2 3/4] mtd: nand: add Freescale NFC driver

Tue Aug 19 19:00:56 CEST 2014

Am 2014-08-18 18:38, schrieb Bill Pringlemeir:
> On 18 Aug 2014, stefan at agner.ch wrote:
> 
>> Am 2014-08-14 23:12, schrieb Bill Pringlemeir:
>>>> On 14 Aug 2014, stefan at agner.ch wrote:
>>>>
>>>> This adds initial support for Freescale NFC (NAND Flash Controller)
>>>> found in ARM Vybrid SoC's, Power Architecture MPC5125 and others.
>>>> However, this driver is only tested on Vybrid.
>>>
>>> This is only to expand on the nand controller register and SRAM use.
>>>
>>> [snip]
>>>
>>>> diff --git a/drivers/mtd/nand/vf610_nfc.c
>>>> b/drivers/mtd/nand/vf610_nfc.c new file mode 100644 index
>>>> 0000000..3150ac1 --- /dev/null +++ b/drivers/mtd/nand/vf610_nfc.c
>>>
>>> [snip]
>>>
>>>> +static inline u32 vf610_nfc_read(struct mtd_info *mtd, uint reg) +{
> 
>>> Ok, we always use readl/writel.  This is fine, but a little slower
>>> and bigger.  I may try a register cache if I resubmit to the Linux
>>> MTD as per Scott's suggestion.  Especially, this version is good for
>>> an incremental patch.
> 
>> I measured the difference and get 1MB/s
>> Full pages, readl/writel:
>> NAND read: device 0 offset 0x200000, size 0x800000
>> 8388608 bytes read in 772 ms (10.4 MiB/s): OK
> 
>> Full pages, __raw_readl/__raw_writel
>> NAND read: device 0 offset 0x200000, size 0x800000
>> 8388608 bytes read in 696 ms (11.5 MiB/s): OK
> 
>> Ok, this is actually quite a lot. Especially since I already optimized
>> the C code (by not using the helper functions like nfc_set/nfc_clear
>> in vf610_nfc_send_command), one would think there is now almost no
>> optimization potential. I looked into the disassembled code and could
>> narrow down the issue. Due to the memory barriers, all offsets were
>> calculated on each register access (nfc base to reg base, and add reg
>> offset), multiple instances of:
> 
>>  20:	e59cc120 	ldr	ip, [ip, #288]	; 0x120
>>  24:	e59cc134 	ldr	ip, [ip, #308]	; 0x134
> 
>> I optimized the code again and calculate the offsets manually and
>> access __raw_readl/__raw_writel rather then vf610_nfc_read/write in
>> the vf610_nfc_send_command(s) function, I get the full speed again:
> 
>> NAND read: device 0 offset 0x200000, size 0x800000
>> 8388608 bytes read in 687 ms (11.6 MiB/s): OK
> 
> I think what you have is fine.  The 10BM/s versus 11MB/s is not
> insignificant, but it is not a huge difference.  I expect that the
> driver is already better than the others; especially the Imx-25 was only
> 7.7MB/s read and 4.6Mb/s write from Linux mtd tests.  Although the end
> user might like to wait 10% less for an image to load.
> 
> Also, probably a lot of people care about code size.  This is not so
> much a case for U-Boot as it is usually machine specific and doesn't
> support several SOCs afaik.
> 
> However, the more specific you make the optimization to the platform,
> the less likely it is to extend well.  We also wish to have this work
> well with different gcc versions and CPUs (PowerPC, etc).  The
> 'readl/writel' handicap the compiler.  Although they are more likely to
> work with a wide variety of buses.
> 
> [snip]
> 
>> On Wed, 2014-08-13 at 22:32, Scott Wood wrote:
>>> raw_writel() is itself something that should only be used for
>>> hand-optimized sections.  For non-performance-critical code you
>>> should use normal writel() so that you don't need to worry about
>>> manually adding I/O barriers.
>>
>> The reason I choosed readl/writel instead of the raw variants is to
>> preserve align with other drivers...
> 
>>> care.  Maybe a comment is fine?  It seems the Vybrid is safe for
>>> different access sizes, but it is possible that some other CPU might
>>> not be able to access this memory via 32/16/8 bit accesses and
>>> 'memcpy()' may not be appropriate.  It seems that 'natural' size of
>>> the NFC controller itself is 32bits and the CPU interface does lane
>>> masking.  Ie, boot mode documentation talks about remapping
>>> 'sram_physical_addr[13:3] = {cpu_addr[11:3],cpu_addr[13:12]}' saying
>>> that bits 2,1 are not used (hopefully one based numbers).  This is
>>> just my guess...
> 
>> What assumptions do you make how memcpy accesses memory? This latest
>> patch now uses the optimized versions from the kernel... Maybe they
>> even try to access 64-bit width (the NIC interconnect supports 64-bit
>> access)
> 
> The memcpy() itself could use anything. 64bits is possible on AXI/NIC.
> The 'PBRIDGE' is 64bit, but I think the AIPS/IPS (apparently AIPS means
> 'AHB-lite to IPS) are 32bit.  At least that is the case on the Imx25
> which has a different AIPS version.  I assumed the 'memcpy()' was using
> 32bits but this certainly isn't explicit in the code.
> 
> The majority of the register banks are non-volatile with this
> controller.  Instead of running multiple NAND programming sequences, the
> controller runs them all for us.  Most registers are are mainly like
> SRAM.
> 
> My only point is that the SRAM buffers use the same interface as the
> main Nand controller register banks.  So using 'readl/writel' for the
> register, but not the SRAM buffers seems inconsistent.
> 
> So to address this inconsistency, I was thinking that we should at least
> have a comment?
> 
>  /* BUS: This assumes the BUS is 32 bit accessible.  If you are porting
>     to other systems, this may not be the case.
>   */
>  memcpy(nfc->regs + NFC_MAIN_AREA(0) + c, buf, l);
> 

IMHO, we just treat this as if its memory and I guess this is fine for a
buffer. memcpy knows how to copy data, and takes care if the
architecture needs aligned access when reading 32-bit width, or similar
requirements. We do not know whether memcpy really uses 32-bit accesses,
hence this comment might even be wrong. In a short test, I could also
access the buffer in byte/word length (tested using md.b/md.w).

Also, I assume this just works for a different architecture too. If not,
the one using this driver the first time on a different architecture
would see this pretty quickly I guess :-)

> Or we could implement our own version of memcpy that did 32bit aligned
> transfers with a similar comment.  In theory, we need a barrier after
> the memcpy(), in case anyone modified the code to touch the
> NFC_FLASH_CMD2's START_BIT directly after the memcpy() or custom
> function.  But all the paranoia adds some code and has potential to slow
> things down.

Well that would be a reordering accross multiple function reads, and
many instructions right now. I don't think that this is a valid case to
introduce a barrier.

> Doing a barrier after every single byte read as the ARM Linux's
> memcpy_fromio() does will surely make significant performance
> differences.  Instead of being double the Imx25, it was half.  A user
> waiting 400% longer won't be too happy.

There are also patches floating around which just use memcpy:
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-June/173195.html

In our case, a barrier just after the memcpy would be sufficient.

Somehow your comment and my latest patch revision crossed each other.
Could you post your comment on the latest revision in case you are fine
with it?

--
Stefan