STM32MP1 boot slow
patrick.delaunay at st.com
Fri Mar 27 16:27:35 CET 2020
> From: Simon Glass <sjg at chromium.org>
> Sent: jeudi 26 mars 2020 17:20
> Hi Patrick,
> On Wed, 25 Mar 2020 at 09:57, Patrick DELAUNAY <patrick.delaunay at st.com>
> > Hi,
> > > From: Marek Vasut <marex at denx.de>
> > > Sent: mercredi 25 mars 2020 00:39
> > >
> > > Hi,
> > >
> > > I was looking at the STM32MP1 boot time and I noticed it takes about
> > > 2 seconds to get to U-Boot.
> > Thanks for the feedback.
> > To be clear, the SPL is not the ST priority as we have many limitation
> > (mainly on power management) for the SPL boot chain
> > Rom code => SPL => U-Boot
> > The preconized boot chain for STM32MP1 is Rom code => TF-A => U-Boot
> > (stm32mp15_trusted_defconfg).
> > > One problem is the insane I2C timing calculation in stm32f7 i2c
> > > driver, which is almost a mallocator and CPU stress test and takes
> > > about 1 second to complete in SPL -- we need some simpler
> > > replacement for that, possibly the one in DWC I2C driver might do?
> > Our first idea to manage this I2C settings (prescaler/timings setting)
> > was to set this values in device tree, but this binding was refused so
> > this function stm32_i2c_choose_solution()
> Was the binding refused in linux? Could we add something U-Boot-specific then? I
> think having 'early' timings, etc. is very handy. We are doing this on x86.
st,i2c-timing : A 32-bit I2C timing register value
> Of course it has traditionally been impossible to convince Linux people to add this
> sort of thing. Still, I think we should do it. Our U-Boot-specific files allow this.
> > provided the better settings for any input clock and I2C frequency (called for
> each probe).
Yes.... it is one possible solution.
I already propose it internally.
> > But it is brutal and not optimum solution: try all the solution to found the better
> > And the performance problem of this loop (shared code between Linux /
> > U-Boot/TF-A drivers) had be already see/checked on ST side in TF-A context.
> We should be able to calculate it, like with dw-i2c.
I checked drivers/i2c/designware_i2c.c...
Nothing obviously applicable on ST IP.
In fact today I also challenge the I2C responsible for the need of
this loop to found optimum parameter in bootloader.
I think that 'good enough' register value could be found with few operation
(as in designware_i2c.c).
And moreover it seems tuning isn't really needed if we limit the I2C speed at 400kHz.
I still waiting internal feedback but with COVID-19 it is more difficult here.
> > We try to improve the solution, without success, but finally the
> > performance issue was solved by dcache activation in TF-A before to execute
> this loop.
> I would like to see patches to enable the cache. We did this some years ago in a
> Chromebook and it made a big difference. It is not that hard.
Yes I am working on this patch, an today it is already functional.
Updated bootstage report are available in the commit message.
I just need to cross check if the TLB and the cache is correctly managed
if I only activate/ deactivate cache with CP15 function.
And don't want miss something for the sensible point.
> > But as in SPL the data cache is not activated, this loop has terrible performance.
> > We need to ding again of this topic for U-Boot point of view (SPL &
> > also in U-Boot, before relocation and after relocation) .
> > And I had shared this issue with the ST owner of this code.
> > For information, I add some trace and I get for same code execution on DK2
> > - 440ms in SPL (dcache OFF)
> > - 36ms in U-Boot (dcache ON)
> > > Another item I found is that, in U-Boot, initf_dm() takes about half
> > > a second and so does serial_init(). I didn't dig into it to find out
> > > why, but I suspect it has to do with the massive amount of UCLASSes
> > > the DM has to traverse OR with the CPU being slow at that point, as the clock
> driver didn't get probed just yet.
> > >
> > > Thoughts ?
> > Yes, it is the first parsing of device tree, and it is really slow...
> > directly linked to device tree size and libfdt.
> I wonder if we can improve this. There was a change to how the drivers were
> bound (changing the ordering). We could perhaps revert that for SPL.
I no issue in SPL as the reduced device tree is short.
The issue in the U-Boot pre-reloc stage (with full device tree but without cache).
> > And because it is done before relocation (before dache enable).
> > Measurement on DK2 = 649ms
> > It is a other topic in my TODO list.
> > I want to explore livetree activation to reduce the DT parsing time.
> Not in SPL though I suspect.
In U-Boot proper (it is in my TODO list)
More information about the U-Boot