[U-Boot] [PATCH 3/4] mtd: nand: add Freescale NFC driver

Stefan Agner stefan at agner.ch
Wed Aug 13 18:27:14 CEST 2014


Am 2014-08-13 17:14, schrieb Bill Pringlemeir:
> On 13 Aug 2014, stefan at agner.ch wrote:
> 
>> Am 2014-08-13 00:58, schrieb Bill Pringlemeir:
>> [snip]
>>>>>>> +static u32 nfc_read(struct mtd_info *mtd, uint reg)
>>>>>>> +{
>>>>>>> +	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>>>>>> +
>>>>>>> +	if (reg == NFC_FLASH_STATUS1 ||
>>>>>>> +	    reg == NFC_FLASH_STATUS2 ||
>>>>>>> +	    reg == NFC_IRQ_STATUS)
>>>>>>> +		return __raw_readl(nfc->regs + reg);
>>>>>>> +	/* Gang read/writes together for most registers. */
>>>>>>> +	else
>>>>>>> +		return *(u32 *)(nfc->regs + reg);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void nfc_write(struct mtd_info *mtd, uint reg, u32 val)
>>>>>>> +{
>>>>>>> +	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>>>>>> +
>>>>>>> +	if (reg == NFC_FLASH_STATUS1 ||
>>>>>>> +	    reg == NFC_FLASH_STATUS2 ||
>>>>>>> +	    reg == NFC_IRQ_STATUS)
>>>>>>> +		__raw_writel(val, nfc->regs + reg);
>>>>>>> +	/* Gang read/writes together for most registers. */
>>>>>>> +	else
>>>>>>> +		*(u32 *)(nfc->regs + reg) = val;
>>>>>>> +}
>>>>>>
>>>>>> You should always be using raw I/O accessors.  If the intent is to
>>>>>> bypass I/O ordering for certain registers, the raw accessors
>>>>>> should already be skipping that.
>>>
>>>>> Ok, will do.
>>>
>>>> Sorry, the above should say "always be using I/O accesors", not
>>>> always raw ones.
>>>
>>> This is probably because of me.  There are lines like this for
>>> configuration,
>>>
>>> /* Set configuration register. */
>>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_ADDR_AUTO_INCR_BIT);
>>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BUFNO_AUTO_INCR_BIT);
>>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BOOT_MODE_BIT);
>>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_DMA_REQ_BIT);
>>> nfc_set(mtd, NFC_FLASH_CONFIG, CONFIG_FAST_FLASH_BIT);
>>>
>>> If you use some sort of 'volatile', you are saying to the compiler
>>> that these lines are *not*,
>>>
>>> ldr r0, [r1, #CONFIG]    # read previous value
>>> ldr r2, =~MASK
>>> orr r0, #FAST_FLASH_BIT  # set one bit.
>>> and r0, r2               # clear all bits at once.
>>> str r0, [r1, #CONFIG]    # write it back.
>>>
>>> Instead it will change into five different load/modify/stores.  The
>>> memory itself is just like SRAM except a few registers that are
>>> actually volatile; ie changed via the hardware.
>>>
>>> Particularly bad are the memcpy_io() on the ARM.  Using these results
>>> in about 1/2 to 1/4 of the performance of the drivers through-put on
>>> the Vybrid.  I can see that using accessors is good, but just
>>> replacing this with 'writel' will result in significantly more code
>>> and slower throughput.
>>>
>>> If people insist on this, then something like,
>>>
>>> val = nfc_read(mtd, NFC_REG);
>>> val = nfc_clear(val, CONFIG_ADDR_AUTO_INCR_BIT);
>>> val = nfc_clear(val, CONFIG_BUFNO_AUTO_INCR_BIT);
>>> val = nfc_clear(val, CONFIG_BOOT_MODE_BIT);
>>> val = nfc_clear(val, CONFIG_DMA_REQ_BIT);
>>> val = nfc_set(val, CONFIG_FAST_FLASH_BIT);
>>> nfc_write(mtd, NFC_REG, val);
>>>
>>> would result in the same code with 'nfc_read()' and 'nfc_write()'
>>> using the I/O accessors.  I found this more difficult to read for the
>>> higher level functions.  Instead some some standard macros for
>>> setting and clearing bits could be used.  The original driver was
>>> using 'nfc_set()' and 'nfc_clear()'.  Maybe they should just go.
>>>
>>
>> I just applied this change:
>>
>> diff --git a/drivers/mtd/nand/fsl_nfc.c b/drivers/mtd/nand/fsl_nfc.c
>> index df2c8be..01c010f 100644
>> --- a/drivers/mtd/nand/fsl_nfc.c
>> +++ b/drivers/mtd/nand/fsl_nfc.c
>> @@ -191,26 +191,14 @@ static u32 nfc_read(struct mtd_info *mtd, uint
>> reg)
>> {
>> 	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>
>> -	if (reg == NFC_FLASH_STATUS1 ||
>> -	    reg == NFC_FLASH_STATUS2 ||
>> -	    reg == NFC_IRQ_STATUS)
>> -		return __raw_readl(nfc->regs + reg);
>> -	/* Gang read/writes together for most registers. */
>> -	else
>> -		return *(u32 *)(nfc->regs + reg);
>> +	return __raw_readl(nfc->regs + reg);
>> }
>>
>> static void nfc_write(struct mtd_info *mtd, uint reg, u32 val)
>> {
>> 	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>
>> -	if (reg == NFC_FLASH_STATUS1 ||
>> -	    reg == NFC_FLASH_STATUS2 ||
>> -	    reg == NFC_IRQ_STATUS)
>> -		__raw_writel(val, nfc->regs + reg);
>> -	/* Gang read/writes together for most registers. */
>> -	else
>> -		*(u32 *)(nfc->regs + reg) = val;
>> +	__raw_writel(val, nfc->regs + reg);
>> }
>>
>> static void nfc_set(struct mtd_info *mtd, uint reg, u32 bits)
>>
>> And did some performance measurement:
>>
>> => nand read ${loadaddr} 0x0 0x2000000
>>
>> ===> With if/else
>> NAND read: device 0 offset 0x0, size 0x2000000
>> 33554432 bytes read in 8494 ms: OK (3.8 MiB/s)
>>
>> ===> With __raw_readl only...
>> NAND read: device 0 offset 0x0, size 0x2000000
>> 33554432 bytes read in 8184 ms: OK (3.9 MiB/s)
>>
>> => nand write ${loadaddr} rootfs-ubi 0x2000000
>>
>> ===> With if/else
>> NAND write: device 0 offset 0xa00000, size 0x2000000
>> 33554432 bytes written in 18200 ms: OK (1.8 MiB/s)
>>
>> ===> With __raw_readl only...
>> NAND write: device 0 offset 0xa00000, size 0x2000000
>> 33554432 bytes written in 15095 ms: OK (2.1 MiB/s)
>>
>> These values are reproducible. I guess by removing the conditional
>> branch in the nfc_read/nfc_write functions we even gain performance.
>>
>> IMHO we should use the raw_writel only and "hand optimize" for
>> functions which are used often. For the initialization/configuration
>> functions, there is little value to save some register access.
> 
> This will indeed depend on the compiler.  I guess that the U-Boot
> 'readl' and 'writel' are the same as Linux?  If the compiler doesn't do
> the analysis to see from the parameters that the 'if/else' is not
> needed, then it will not reduce the code and this can be larger.
> 
> I was using gcc 4.8 when I looked at this.  The NFC is an AHB peripheral
> and takes several clocks to get access to.  Did the object size
> increase/decrease?  For Linux and this change with gcc 4.8,
> 
> With if/else
> 
> $ size drivers/mtd/nand/fsl_nfc.o
>    text    data     bss     dec     hex filename
>    3300    3652       0    6952    1b28 drivers/mtd/nand/fsl_nfc.o
> 
> Only readl/writel,
> 
> $ size drivers/mtd/nand/fsl_nfc.o
>    text    data     bss     dec     hex filename
>    3532    3652       0    7184    1c10 drivers/mtd/nand/fsl_nfc.o
> 
> Do you get bigger sizes for the 'readl/writel' only cases?  This would
> indicate the compiler didn't inline/reduce the function calls.  For
> certain, the 'read/set+clear/write' will be smaller and marginally
> faster with all compilers.  If you have a smaller binary with the
> 'if/else' and the code still performs slower, then something I don't
> understand is happening.
> 
> It is sensible that this will not effect performance that much.  The
> register access is not the main bottle neck.  The 'memcpy()' in
> nfc_read_buf(), nfc_read_spare() and nfc_write_buf() are more critical
> to performance.  Ie, getting the nand data on/off of the NFC controllers
> internal buffers to main memory is the main bottle neck.  Ironically,
> on the Vybrid, the Cortex-A5 is able to do this faster than the DMA
> engine as it has much better though-put to the SDRAM.

As far as I understood the RM, we would have multiple buffers? So we
could get the next page while moving the last to main memory (by DMA or
CPU). However, I'm not sure whether such an operation is possible when
using the MTD interface.. This probably makes more sense for the Linux
driver anyway.

> The buffer transfers were originally 'memcpy_io()' or something which
> did barriers after each and every memory fetch.  By replacing the
> register access alone, you shouldn't get that much of a performance
> difference (1/2 to 1/4 through-put).

The inlining is indeed the most prominent factor for this performance
difference. I use a GCC 4.7.4. I also think that U-Boot uses different
optimization flags then the kernel. However, in my setup the compiler
did not inline nfc_read/write default. When removing the if/else, the
inlining was done by GCC 4.7.4

Funny is, the size is bigger in the first uninlined case... Maybe GCC
inlined the function only for some calls, I did not checked that... 

With if/else
   text	   data	    bss	    dec	    hex	filename
   2395	   2904	      0	   5299	   14b3	drivers/mtd/nand/fsl_nfc.o

${CROSS_COMPILE}objdump -d drivers/mtd/nand/fsl_nfc.o | grep nfc_write
Disassembly of section .text.nfc_write:
00000000 <nfc_write>:
....

Without if/else, in-lined by the compiler...
   text	   data	    bss	    dec	    hex	filename
   2263	   2904	      0	   5167	   142f	drivers/mtd/nand/fsl_nfc.o

I then tried explicitly inlining with if/else, I got more or less the
same performance.

My next version will not contain the if/else and access through
writel/readl exclusively. I also explicitly inlined those functions
since this made the measurable difference in performance. Some hand
optimization for the register access of the most important functions
(nfc_send_command(s)) showed a slight improvement. This way we have
easily readable code in initialization functions and get good
performance.

=> nand read ${loadaddr} 0x0 0x2000000

NAND read: device 0 offset 0x0, size 0x2000000
 33554432 bytes read in 8032 ms: OK (4 MiB/s)

   text	   data	    bss	    dec	    hex	filename
   2000	   2904	      0	   4904	   1328	drivers/mtd/nand/fsl_nfc.o

--
Stefan


More information about the U-Boot mailing list