[U-Boot] [PATCH 3/4] mtd: nand: add Freescale NFC driver

Bill Pringlemeir bpringlemeir at nbsps.com
Wed Aug 13 17:14:24 CEST 2014


On 13 Aug 2014, stefan at agner.ch wrote:

> Am 2014-08-13 00:58, schrieb Bill Pringlemeir:
> [snip]
>>>>>> +static u32 nfc_read(struct mtd_info *mtd, uint reg)
>>>>>> +{
>>>>>> +	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>>>>> +
>>>>>> +	if (reg == NFC_FLASH_STATUS1 ||
>>>>>> +	    reg == NFC_FLASH_STATUS2 ||
>>>>>> +	    reg == NFC_IRQ_STATUS)
>>>>>> +		return __raw_readl(nfc->regs + reg);
>>>>>> +	/* Gang read/writes together for most registers. */
>>>>>> +	else
>>>>>> +		return *(u32 *)(nfc->regs + reg);
>>>>>> +}
>>>>>> +
>>>>>> +static void nfc_write(struct mtd_info *mtd, uint reg, u32 val)
>>>>>> +{
>>>>>> +	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>>>>> +
>>>>>> +	if (reg == NFC_FLASH_STATUS1 ||
>>>>>> +	    reg == NFC_FLASH_STATUS2 ||
>>>>>> +	    reg == NFC_IRQ_STATUS)
>>>>>> +		__raw_writel(val, nfc->regs + reg);
>>>>>> +	/* Gang read/writes together for most registers. */
>>>>>> +	else
>>>>>> +		*(u32 *)(nfc->regs + reg) = val;
>>>>>> +}
>>>>>
>>>>> You should always be using raw I/O accessors.  If the intent is to
>>>>> bypass I/O ordering for certain registers, the raw accessors
>>>>> should already be skipping that.
>>
>>>> Ok, will do.
>>
>>> Sorry, the above should say "always be using I/O accesors", not
>>> always raw ones.
>>
>> This is probably because of me.  There are lines like this for
>> configuration,
>>
>> /* Set configuration register. */
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_ADDR_AUTO_INCR_BIT);
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BUFNO_AUTO_INCR_BIT);
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BOOT_MODE_BIT);
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_DMA_REQ_BIT);
>> nfc_set(mtd, NFC_FLASH_CONFIG, CONFIG_FAST_FLASH_BIT);
>>
>> If you use some sort of 'volatile', you are saying to the compiler
>> that these lines are *not*,
>>
>> ldr r0, [r1, #CONFIG]    # read previous value
>> ldr r2, =~MASK
>> orr r0, #FAST_FLASH_BIT  # set one bit.
>> and r0, r2               # clear all bits at once.
>> str r0, [r1, #CONFIG]    # write it back.
>>
>> Instead it will change into five different load/modify/stores.  The
>> memory itself is just like SRAM except a few registers that are
>> actually volatile; ie changed via the hardware.
>>
>> Particularly bad are the memcpy_io() on the ARM.  Using these results
>> in about 1/2 to 1/4 of the performance of the drivers through-put on
>> the Vybrid.  I can see that using accessors is good, but just
>> replacing this with 'writel' will result in significantly more code
>> and slower throughput.
>>
>> If people insist on this, then something like,
>>
>> val = nfc_read(mtd, NFC_REG);
>> val = nfc_clear(val, CONFIG_ADDR_AUTO_INCR_BIT);
>> val = nfc_clear(val, CONFIG_BUFNO_AUTO_INCR_BIT);
>> val = nfc_clear(val, CONFIG_BOOT_MODE_BIT);
>> val = nfc_clear(val, CONFIG_DMA_REQ_BIT);
>> val = nfc_set(val, CONFIG_FAST_FLASH_BIT);
>> nfc_write(mtd, NFC_REG, val);
>>
>> would result in the same code with 'nfc_read()' and 'nfc_write()'
>> using the I/O accessors.  I found this more difficult to read for the
>> higher level functions.  Instead some some standard macros for
>> setting and clearing bits could be used.  The original driver was
>> using 'nfc_set()' and 'nfc_clear()'.  Maybe they should just go.
>>
>
> I just applied this change:
>
> diff --git a/drivers/mtd/nand/fsl_nfc.c b/drivers/mtd/nand/fsl_nfc.c
> index df2c8be..01c010f 100644
> --- a/drivers/mtd/nand/fsl_nfc.c
> +++ b/drivers/mtd/nand/fsl_nfc.c
> @@ -191,26 +191,14 @@ static u32 nfc_read(struct mtd_info *mtd, uint
> reg)
> {
> 	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>
> -	if (reg == NFC_FLASH_STATUS1 ||
> -	    reg == NFC_FLASH_STATUS2 ||
> -	    reg == NFC_IRQ_STATUS)
> -		return __raw_readl(nfc->regs + reg);
> -	/* Gang read/writes together for most registers. */
> -	else
> -		return *(u32 *)(nfc->regs + reg);
> +	return __raw_readl(nfc->regs + reg);
> }
>
> static void nfc_write(struct mtd_info *mtd, uint reg, u32 val)
> {
> 	struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>
> -	if (reg == NFC_FLASH_STATUS1 ||
> -	    reg == NFC_FLASH_STATUS2 ||
> -	    reg == NFC_IRQ_STATUS)
> -		__raw_writel(val, nfc->regs + reg);
> -	/* Gang read/writes together for most registers. */
> -	else
> -		*(u32 *)(nfc->regs + reg) = val;
> +	__raw_writel(val, nfc->regs + reg);
> }
>
> static void nfc_set(struct mtd_info *mtd, uint reg, u32 bits)
>
> And did some performance measurement:
>
> => nand read ${loadaddr} 0x0 0x2000000
>
> ===> With if/else
> NAND read: device 0 offset 0x0, size 0x2000000
> 33554432 bytes read in 8494 ms: OK (3.8 MiB/s)
>
> ===> With __raw_readl only...
> NAND read: device 0 offset 0x0, size 0x2000000
> 33554432 bytes read in 8184 ms: OK (3.9 MiB/s)
>
> => nand write ${loadaddr} rootfs-ubi 0x2000000
>
> ===> With if/else
> NAND write: device 0 offset 0xa00000, size 0x2000000
> 33554432 bytes written in 18200 ms: OK (1.8 MiB/s)
>
> ===> With __raw_readl only...
> NAND write: device 0 offset 0xa00000, size 0x2000000
> 33554432 bytes written in 15095 ms: OK (2.1 MiB/s)
>
> These values are reproducible. I guess by removing the conditional
> branch in the nfc_read/nfc_write functions we even gain performance.
>
> IMHO we should use the raw_writel only and "hand optimize" for
> functions which are used often. For the initialization/configuration
> functions, there is little value to save some register access.

This will indeed depend on the compiler.  I guess that the U-Boot
'readl' and 'writel' are the same as Linux?  If the compiler doesn't do
the analysis to see from the parameters that the 'if/else' is not
needed, then it will not reduce the code and this can be larger.

I was using gcc 4.8 when I looked at this.  The NFC is an AHB peripheral
and takes several clocks to get access to.  Did the object size
increase/decrease?  For Linux and this change with gcc 4.8,

With if/else

$ size drivers/mtd/nand/fsl_nfc.o
   text    data     bss     dec     hex filename
   3300    3652       0    6952    1b28 drivers/mtd/nand/fsl_nfc.o

Only readl/writel,

$ size drivers/mtd/nand/fsl_nfc.o
   text    data     bss     dec     hex filename
   3532    3652       0    7184    1c10 drivers/mtd/nand/fsl_nfc.o

Do you get bigger sizes for the 'readl/writel' only cases?  This would
indicate the compiler didn't inline/reduce the function calls.  For
certain, the 'read/set+clear/write' will be smaller and marginally
faster with all compilers.  If you have a smaller binary with the
'if/else' and the code still performs slower, then something I don't
understand is happening.

It is sensible that this will not effect performance that much.  The
register access is not the main bottle neck.  The 'memcpy()' in
nfc_read_buf(), nfc_read_spare() and nfc_write_buf() are more critical
to performance.  Ie, getting the nand data on/off of the NFC controllers
internal buffers to main memory is the main bottle neck.  Ironically,
on the Vybrid, the Cortex-A5 is able to do this faster than the DMA
engine as it has much better though-put to the SDRAM.

The buffer transfers were originally 'memcpy_io()' or something which
did barriers after each and every memory fetch.  By replacing the
register access alone, you shouldn't get that much of a performance
difference (1/2 to 1/4 through-put).

Regards,
Bill Pringlemeir.


More information about the U-Boot mailing list