[U-Boot] [PATCH 3/4] mtd: nand: add Freescale NFC driver
Bill Pringlemeir
bpringlemeir at nbsps.com
Wed Aug 13 17:14:24 CEST 2014
On 13 Aug 2014, stefan at agner.ch wrote:
> Am 2014-08-13 00:58, schrieb Bill Pringlemeir:
> [snip]
>>>>>> +static u32 nfc_read(struct mtd_info *mtd, uint reg)
>>>>>> +{
>>>>>> + struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>>>>> +
>>>>>> + if (reg == NFC_FLASH_STATUS1 ||
>>>>>> + reg == NFC_FLASH_STATUS2 ||
>>>>>> + reg == NFC_IRQ_STATUS)
>>>>>> + return __raw_readl(nfc->regs + reg);
>>>>>> + /* Gang read/writes together for most registers. */
>>>>>> + else
>>>>>> + return *(u32 *)(nfc->regs + reg);
>>>>>> +}
>>>>>> +
>>>>>> +static void nfc_write(struct mtd_info *mtd, uint reg, u32 val)
>>>>>> +{
>>>>>> + struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>>>>>> +
>>>>>> + if (reg == NFC_FLASH_STATUS1 ||
>>>>>> + reg == NFC_FLASH_STATUS2 ||
>>>>>> + reg == NFC_IRQ_STATUS)
>>>>>> + __raw_writel(val, nfc->regs + reg);
>>>>>> + /* Gang read/writes together for most registers. */
>>>>>> + else
>>>>>> + *(u32 *)(nfc->regs + reg) = val;
>>>>>> +}
>>>>>
>>>>> You should always be using raw I/O accessors. If the intent is to
>>>>> bypass I/O ordering for certain registers, the raw accessors
>>>>> should already be skipping that.
>>
>>>> Ok, will do.
>>
>>> Sorry, the above should say "always be using I/O accesors", not
>>> always raw ones.
>>
>> This is probably because of me. There are lines like this for
>> configuration,
>>
>> /* Set configuration register. */
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_ADDR_AUTO_INCR_BIT);
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BUFNO_AUTO_INCR_BIT);
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_BOOT_MODE_BIT);
>> nfc_clear(mtd, NFC_FLASH_CONFIG, CONFIG_DMA_REQ_BIT);
>> nfc_set(mtd, NFC_FLASH_CONFIG, CONFIG_FAST_FLASH_BIT);
>>
>> If you use some sort of 'volatile', you are saying to the compiler
>> that these lines are *not*,
>>
>> ldr r0, [r1, #CONFIG] # read previous value
>> ldr r2, =~MASK
>> orr r0, #FAST_FLASH_BIT # set one bit.
>> and r0, r2 # clear all bits at once.
>> str r0, [r1, #CONFIG] # write it back.
>>
>> Instead it will change into five different load/modify/stores. The
>> memory itself is just like SRAM except a few registers that are
>> actually volatile; ie changed via the hardware.
>>
>> Particularly bad are the memcpy_io() on the ARM. Using these results
>> in about 1/2 to 1/4 of the performance of the drivers through-put on
>> the Vybrid. I can see that using accessors is good, but just
>> replacing this with 'writel' will result in significantly more code
>> and slower throughput.
>>
>> If people insist on this, then something like,
>>
>> val = nfc_read(mtd, NFC_REG);
>> val = nfc_clear(val, CONFIG_ADDR_AUTO_INCR_BIT);
>> val = nfc_clear(val, CONFIG_BUFNO_AUTO_INCR_BIT);
>> val = nfc_clear(val, CONFIG_BOOT_MODE_BIT);
>> val = nfc_clear(val, CONFIG_DMA_REQ_BIT);
>> val = nfc_set(val, CONFIG_FAST_FLASH_BIT);
>> nfc_write(mtd, NFC_REG, val);
>>
>> would result in the same code with 'nfc_read()' and 'nfc_write()'
>> using the I/O accessors. I found this more difficult to read for the
>> higher level functions. Instead some some standard macros for
>> setting and clearing bits could be used. The original driver was
>> using 'nfc_set()' and 'nfc_clear()'. Maybe they should just go.
>>
>
> I just applied this change:
>
> diff --git a/drivers/mtd/nand/fsl_nfc.c b/drivers/mtd/nand/fsl_nfc.c
> index df2c8be..01c010f 100644
> --- a/drivers/mtd/nand/fsl_nfc.c
> +++ b/drivers/mtd/nand/fsl_nfc.c
> @@ -191,26 +191,14 @@ static u32 nfc_read(struct mtd_info *mtd, uint
> reg)
> {
> struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>
> - if (reg == NFC_FLASH_STATUS1 ||
> - reg == NFC_FLASH_STATUS2 ||
> - reg == NFC_IRQ_STATUS)
> - return __raw_readl(nfc->regs + reg);
> - /* Gang read/writes together for most registers. */
> - else
> - return *(u32 *)(nfc->regs + reg);
> + return __raw_readl(nfc->regs + reg);
> }
>
> static void nfc_write(struct mtd_info *mtd, uint reg, u32 val)
> {
> struct fsl_nfc *nfc = mtd_to_nfc(mtd);
>
> - if (reg == NFC_FLASH_STATUS1 ||
> - reg == NFC_FLASH_STATUS2 ||
> - reg == NFC_IRQ_STATUS)
> - __raw_writel(val, nfc->regs + reg);
> - /* Gang read/writes together for most registers. */
> - else
> - *(u32 *)(nfc->regs + reg) = val;
> + __raw_writel(val, nfc->regs + reg);
> }
>
> static void nfc_set(struct mtd_info *mtd, uint reg, u32 bits)
>
> And did some performance measurement:
>
> => nand read ${loadaddr} 0x0 0x2000000
>
> ===> With if/else
> NAND read: device 0 offset 0x0, size 0x2000000
> 33554432 bytes read in 8494 ms: OK (3.8 MiB/s)
>
> ===> With __raw_readl only...
> NAND read: device 0 offset 0x0, size 0x2000000
> 33554432 bytes read in 8184 ms: OK (3.9 MiB/s)
>
> => nand write ${loadaddr} rootfs-ubi 0x2000000
>
> ===> With if/else
> NAND write: device 0 offset 0xa00000, size 0x2000000
> 33554432 bytes written in 18200 ms: OK (1.8 MiB/s)
>
> ===> With __raw_readl only...
> NAND write: device 0 offset 0xa00000, size 0x2000000
> 33554432 bytes written in 15095 ms: OK (2.1 MiB/s)
>
> These values are reproducible. I guess by removing the conditional
> branch in the nfc_read/nfc_write functions we even gain performance.
>
> IMHO we should use the raw_writel only and "hand optimize" for
> functions which are used often. For the initialization/configuration
> functions, there is little value to save some register access.
This will indeed depend on the compiler. I guess that the U-Boot
'readl' and 'writel' are the same as Linux? If the compiler doesn't do
the analysis to see from the parameters that the 'if/else' is not
needed, then it will not reduce the code and this can be larger.
I was using gcc 4.8 when I looked at this. The NFC is an AHB peripheral
and takes several clocks to get access to. Did the object size
increase/decrease? For Linux and this change with gcc 4.8,
With if/else
$ size drivers/mtd/nand/fsl_nfc.o
text data bss dec hex filename
3300 3652 0 6952 1b28 drivers/mtd/nand/fsl_nfc.o
Only readl/writel,
$ size drivers/mtd/nand/fsl_nfc.o
text data bss dec hex filename
3532 3652 0 7184 1c10 drivers/mtd/nand/fsl_nfc.o
Do you get bigger sizes for the 'readl/writel' only cases? This would
indicate the compiler didn't inline/reduce the function calls. For
certain, the 'read/set+clear/write' will be smaller and marginally
faster with all compilers. If you have a smaller binary with the
'if/else' and the code still performs slower, then something I don't
understand is happening.
It is sensible that this will not effect performance that much. The
register access is not the main bottle neck. The 'memcpy()' in
nfc_read_buf(), nfc_read_spare() and nfc_write_buf() are more critical
to performance. Ie, getting the nand data on/off of the NFC controllers
internal buffers to main memory is the main bottle neck. Ironically,
on the Vybrid, the Cortex-A5 is able to do this faster than the DMA
engine as it has much better though-put to the SDRAM.
The buffer transfers were originally 'memcpy_io()' or something which
did barriers after each and every memory fetch. By replacing the
register access alone, you shouldn't get that much of a performance
difference (1/2 to 1/4 through-put).
Regards,
Bill Pringlemeir.
More information about the U-Boot
mailing list