[PATCH v5 00/11] spl: Use common function for loading/parsing images

Wed Oct 18 19:33:20 CEST 2023

On 8/3/23 00:41, Heinrich Schuchardt wrote:
> On 8/3/23 03:31, Tom Rini wrote:
>> On Wed, Aug 02, 2023 at 09:13:21PM -0400, Sean Anderson wrote:
>>> On 8/2/23 16:11, Tom Rini wrote:
>>>> On Mon, Jul 31, 2023 at 06:42:52PM -0400, Sean Anderson wrote:
>>>>
>>>>> This series adds support for loading all image types (Legacy, FIT (with
>>>>> and without LOAD_FIT_FULL), and i.MX) to the MMC, SPI, NOR, NET, FAT,
>>>>> and EXT load methods. It does this by introducing a helper function
>>>>> which handles the minutiae of invoking the proper parsing function, and
>>>>> reading the rest of the image.
>>>>>
>>>>> Hopefully, this will make it easier for load methods to support all
>>>>> image types that U-Boot supports, without having undocumented
>>>>> unsupported image types. I applied this to several loaders which were
>>>>> invoking spl_load_simple_fit and/or spl_parse_image_header, but I did
>>>>> not use it with others (e.g. DFU/RAM) which had complications in the
>>>>> mix.
>>>>>
>>>>> In order to meet size requirements for some boards, the first two
>>>>> patches of this series are general size-reduction changes. The ffs/fls
>>>>> change in particular should reduce code size across the board. The
>>>>> malloc change also has the potential to reduce code size. I've left it
>>>>> disabled by default, but maybe we can reverse that in the future.
>>>>>
>>>>> Here's some bloat-o-meter for j7200_evm_a72_defconfig with ext4 support
>>>>> enabed:
>>>>>
>>>> add/remove: 2/1 grow/shrink: 1/5 up/down: 444/-584 (-140)
>>>>> Function                                     old     new   delta
>>>>> spl_load                                       -     216    +216
>>>> spl_simple_read                                -     184    +184
>>>>> spl_fit_read                                  60     104     +44
>>>>> file_fat_read                                 40       -     -40
>>>>> spl_nor_load_image                           120      76     -44
>>>>> spl_load_image_ext                           364     296     -68
>>>> spl_load_image_fat                           320     200    -120
>>>>> spl_spi_load_image                           304     176    -128
>>>>> spl_mmc_load                                 716     532    -184
>>>>> Total: Before=233618, After=233478, chg -0.06%
>>>>>
>>>>> For most boards with a few load methods, this series should break even.
>>>>> However, in the worst case this series will add around 100 bytes.
>>>>>
>>>>> I have only tested a few loaders. Please try booting your favorite board
>>>>> with NOR/SPI flash or SPI falcon mode.
>>>>
>>>>> On sama5d27_wlsom1_ek_mmc_defconfig, 100 bytes is too much, so this
>>>>> series depends on [1] to fit everything in. CI run at [2].
>>>>>
>>>>> [1] https://lore.kernel.org/u-boot/20230731223327.109865-1-sean.anderson@seco.com/
>>>>> [2] https://source.denx.de/u-boot/custodians/u-boot-clk/-/pipelines/17116
>>>>
>>>> I've boot-tested this on my lab and a handful of SD+FAT or SD+raw boot
>>>> devices, and it's all worked.  I also did my usual world build
>>>> before/after to check out the size changes and well, I don't know.  The
>>>> size wins come on platforms where there's both FAT+EXT support.  And
>>>> then on mips with say mt7620_rfb I wonder if we're loosing
>>>> functionality.
>>>
>>> Yes we are. I'll address this in v6.
>>>
>>>> But most platforms grow a bit, especially if they're just
>>>> a single load type (which is the most common case). Maybe this problem
>>>> needs to be approached another way? I've put the whole log over at
>>>> https://gist.github.com/trini/e6772b2134e0eb44393364ea98729e06 and maybe
>>>> something will jump out to you.
>>>
>>> The fls/ffs patch [1] reduces the growth by around 80 bytes on ARM
>>> platforms. I sent it separately for ease of review, but it really should
>>> be applied before this series, since a good portion of the growth is due
>>> to that one function call. It seems like MIPS also has this instruction
>>> (and Linux uses it), so I can try putting together a patch for it as
>>> well.
>>>
>>> In the future, I would like to convert bl_len to bl_shift or something,
>>> which would remove the need for fls (and going from bl_shift to bl_len
>>> is just a shift). spl_load_info is used in a lot of places, so I wanted
>>> to do that in a follow-up. On arches with efficient fls implementations
>>> the difference is only around 3 instructions or so.
>>>
>>> But fundamentally some of the problem is that the logic is being moved
>>> into multiple translation units, so the compiler can't see enough to
>>> make optimizations. For example, all filesystems use a bl_len of 1, so
>>> all the multiplications/roundings/etc. can be eliminated if the compiler
>>> can see that. For example, on arm64 copying spl_load into spl_fat.c (as
>>> fat_load) and making it static saves us 100 bytes:
>>>
>>> add/remove: 3/1 grow/shrink: 0/1 up/down: 324/-92 (232)
>>> Function                                     old     new   delta
>>> spl_load                                       -     160    +160
>>> spl_simple_read                                -     104    +104
>>> spl_fit_read                                   -      60     +60
>>> file_fat_read                                 40       -     -40
>>> spl_load_image_fat                           252     200     -52
>>> Total: Before=66833, After=67065, chg +0.35%
>>>
>>> add/remove: 2/1 grow/shrink: 1/0 up/down: 172/-40 (132)
>>> Function                                     old     new   delta
>>> spl_simple_read                                -     104    +104
>>> spl_fit_read                                   -      60     +60
>>> spl_load_image_fat                           252     260      +8
>>> file_fat_read                                 40       -     -40
>>> Total: Before=66833, After=66965, chg +0.20%
>>>
>>> and even then, by inspection spl_simple_read isn't really taking
>>> advantage of bl_len=1. Surprisingly, after enabling LTO this board is
>>> +400 bytes (compared to without this series). It's just hard to beat
>>> handwritten routines with a generic solution.
>>>
>>> I would really like to consolidate this stuff, since every load method
>>> ends up only supporting a few image types and having almost identical
>>> implementations for what they do support.
>>>
>>> --Sean
>>>
>>> [1] https://lore.kernel.org/u-boot/20230731212734.7256-1-sean.anderson@seco.com/
>>
>> Thanks for digging more, lets keep going in this direction then.
>>
> 
> Do we need to pass sectors to the info->load() functions?
> 
> Only nand and mmc use bl_len != 1. Can we move the conversion to sectors
> and the alignment checks to functions called by info->load()?
> 
> I would assume that this can avoid part of the code size increase.

Unfortunately, while the conversion to sectors and alignment can be put into ->load,
MMC (and ROMAPI) can only read in block-sized chunks. This means that the buffer must
be a multiple of the block size (otherwise we will go over the end), and there must be
a block's worth of space in front of the buffer as well (which we use when we have a
non-block-aligned offset).

These situations generally only matter when reading into a malloc'd buffer, since there
is data to overwrite on either end of the buffer. With things like SYS_TEXT_BASE, there
is nothing to overwrite on either end. From examining the various loaders, malloc'd buffers
are only used when loading the image header (which may include the full image for FITs).
In that situation, under the beginning of the buffer is not an issue, since the load method
should have supplied an offset which is a multiple of the block size. Thus, we only need
to solve the problem of writing past the end of the buffer.

 From my perspective, there are three approaches we could take:

- Keep bl_len so that callers of ->read can allocate a sufficiently-large buffer.
- Use a bounce-buffer. This uses around double the memory and will also increase code size.
- Complete most of the read as normal, but read the final block into a separate buffer which
   is then copied into the passed-in buffer. This requires two reads, which is likely slower
   than one. It will also increase code size.

I think the first approach could be improved by optionally removing bl_len when there are
no load methods which need it. This will let the compiler optimize out any alignment if it
is unnecessary, which it cannot currently do (unless LTO is enabled). I am going to try using
this approach for v6.

--Sean