[U-Boot] [PATCH 2/3] string: Provide a slimmed-down memset()

Tue Mar 28 12:34:46 UTC 2017

Am Montag, 27. März 2017, 23:16:45 CEST schrieb Alexander Graf:
> 
> On 27/03/2017 17:17, Heiko Stuebner wrote:
> > Am Montag, 27. März 2017, 09:14:47 CEST schrieb Alexander Graf:
> >>
> >> On 27/03/2017 01:38, Simon Glass wrote:
> >>> Most of the time the optimised memset() is what we want. For extreme
> >>> situations such as TPL it may be too large. For example on the 'rock'
> >>> board, using a simple loop saves a useful 48 bytes. With gcc 4.9 and
> >>> the rodata bug, this patch is enough to reduce the TPL image below the
> >>> limit.
> >>>
> >>> Signed-off-by: Simon Glass <sjg at chromium.org>
> >>> ---
> >>>
> >>>  lib/Kconfig  | 9 +++++++++
> >>>  lib/string.c | 6 ++++--
> >>>  2 files changed, 13 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/lib/Kconfig b/lib/Kconfig
> >>> index 65c01573e1..5bf512d8c0 100644
> >>> --- a/lib/Kconfig
> >>> +++ b/lib/Kconfig
> >>> @@ -52,6 +52,15 @@ config LIB_RAND
> >>>  	help
> >>>  	  This library provides pseudo-random number generator functions.
> >>>
> >>> +config FAST_MEMSET
> >>> +	bool "Use an optimised memset()"
> >>> +	default y
> >>> +	help
> >>> +	  The faster memset() is the arch-specific one (if available) enabled
> >>> +	  by CONFIG_USE_ARCH_MEMSET. If that is not enabled, we can still get
> >>> +	  better performance by write a word at a time. Disable this option
> >>> +	  to reduce code size slightly at the cost of some speed.
> >>
> >> The comment sounds slightly confused - it took me a few times of reading
> >> it until I grasped what it was trying to tell me :).
> >>
> >>> +
> >>>  source lib/dhry/Kconfig
> >>>
> >>>  source lib/rsa/Kconfig
> >>> diff --git a/lib/string.c b/lib/string.c
> >>> index 67d5f6a421..159493ed17 100644
> >>> --- a/lib/string.c
> >>> +++ b/lib/string.c
> >>> @@ -437,8 +437,10 @@ char *strswab(const char *s)
> >>>  void * memset(void * s,int c,size_t count)
> >>>  {
> >>>  	unsigned long *sl = (unsigned long *) s;
> >>> -	unsigned long cl = 0;
> >>>  	char *s8;
> >>> +
> >>> +#ifdef CONFIG_FAST_MEMSET
> >>> +	unsigned long cl = 0;
> >>>  	int i;
> >>>
> >>>  	/* do it one word at a time (32 bits or 64 bits) while possible */
> >>> @@ -452,7 +454,7 @@ void * memset(void * s,int c,size_t count)
> >>>  			count -= sizeof(*sl);
> >>>  		}
> >>>  	}
> >>> -	/* fill 8 bits at a time */
> >>> +#endif	/* fill 8 bits at a time */
> >>
> >> So while this is all neat, a few ideas:
> >>
> >> 1) Would having memset in a header improve things even more? After all,
> >> each external function call clobbers registers that you need to
> >> save/restore...
> >
> > I'd guess it really depends on the size constraints. The regular
> > libgeneric memset compiles on my rk3188 tpl to a total of
> > 64bytes on both gcc-4.9 and gcc-6.3 while Simon's fast-memset
> > comes down to 14bytes on my rk3188.
> >
> > On the rk3188 the only memset user is board_init_f, so here memset
> > is called only once without needing to save registers and I'd guess if an
> > implementation really is that size-constrained to worry about 50bytes
> > this one caller will probably always be the only one?
> 
> I'm not sure I follow. If you put it into a header, the compiler has a 
> better chance of evicting untaken code paths and optimize register usage 
> over object linked variants (unless you use GOLD). I was mostly 
> wondering whether that would already give you the savings without 
> introducing a complicated #ifdef that is going to bitrot over time :).

On rk3188-tpl that small non-fast memset gets compiled to (bfd linker):
100809aa <board_init_f_init_reserve>:
100809aa:       b510            push    {r4, lr}
100809ac:       22c0            movs    r2, #192        ; 0xc0
100809ae:       2100            movs    r1, #0
100809b0:       4604            mov     r4, r0
100809b2:       f000 f804       bl      100809be <memset>
100809b6:       34c0            adds    r4, #192        ; 0xc0
100809b8:       f8c9 4090       str.w   r4, [r9, #144]  ; 0x90
100809bc:       bd10            pop     {r4, pc}

100809be <memset>:
100809be:       4402            add     r2, r0
100809c0:       4603            mov     r3, r0
100809c2:       4293            cmp     r3, r2
100809c4:       d100            bne.n   100809c8 <memset+0xa>
100809c6:       4770            bx      lr
100809c8:       f803 1b01       strb.w  r1, [r3], #1
100809cc:       e7f9            b.n     100809c2 <memset+0x4>

not saving any outside registers, as it's used only once at all and what
I was trying to say was that in cases where we worry about having the
tiniest memset possible, I guess that will most likely stay the only call.

But I may have been dug into the rk3188 tpl-specifics to long, to see
other possible cases right now :-) .

> I'm just slightly worried about the massive number of preprocessor 
> excludes that happen in U-Boot in general. It seems like something 
> that's really hard to ever have full testing coverage on.

That's essentially what I was worried about as well, seeing that memset
can be provided by different sources it seems.
There is the libgeneric memset we're having here and also the arch-
specific memset (way faster but also again way bigger) and without using
either, one could also provide some completely separate implementation
at the moment.

So having one version in a header would probably also incur some sort of
ifdef voodoo?

> >> 2) How much would GOLD save you? Have you tried? U-Boot is small enough
> >> of a code base that global optimizations should be able to give
> >> significant size savings.
> >
> > I think the issue that this is trying to solve is to allow more
> > toolchains to be used and thus make rebuilds on changes work on a lot
> > of boards at the same time with random toolchains.
> >
> > gcc-6.3 already produces way smaller results (well within the size
> > constraints the rk3188 has) than for example the gcc-4.9 used by
> > buildman as baseline toolchain.
> 
> Ah, I see. So 4.9 does not have -lto? There's a good chance my gut 
> feeling that GOLD actually saves anything is wrong - I don't know. Has 
> anyone done the numbers? Then we would have something to actually base 
> gut feeling on.

It looks like the u-boot Makefile makes explicitly sure to use GNU ld.
So I didn't try to dig deeper into this :-) .

> Size is always a serious constraint in U-Boot, especially in SPL 
> environments. If we can include one more tool in our portfolio to 
> optimize size across the board, I'm all for it. This patch just feels 
> slightly short-term - but I'm definitely not nack'ing it :).

Heiko