[PATCH 0/8] efi_loader: Complete the bootflow_efi() test

Mon Jan 13 20:01:36 CET 2025

Hi Tom,

On Fri, 10 Jan 2025 at 09:48, Tom Rini <trini at konsulko.com> wrote:
>
> On Fri, Jan 10, 2025 at 06:40:37AM -0700, Simon Glass wrote:
> > Hi Tom,
> >
> > On Thu, 9 Jan 2025 at 09:51, Tom Rini <trini at konsulko.com> wrote:
> > >
> > > On Thu, Jan 09, 2025 at 08:02:01AM -0700, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Wed, 8 Jan 2025 at 12:15, Tom Rini <trini at konsulko.com> wrote:
> > > > >
> > > > > On Wed, Jan 08, 2025 at 10:02:52AM -0700, Simon Glass wrote:
> > > > > > Hi Heinrich, Tom,
> > > > > >
> > > > > > On Tue, 7 Jan 2025 at 08:47, Heinrich Schuchardt <xypron.glpk at gmx.de> wrote:
> > > > > > >
> > > > > > > On 07.01.25 16:11, Tom Rini wrote:
> > > > > > > > On Tue, Jan 07, 2025 at 06:57:50AM -0700, Simon Glass wrote:
> > > > > > > >> Hi Heinrich,
> > > > > > > >>
> > > > > > > >> On Tue, 7 Jan 2025 at 06:11, Heinrich Schuchardt <xypron.glpk at gmx.de> wrote:
> > > > > > > >>>
> > > > > > > >>> On 07.01.25 13:15, Simon Glass wrote:
> > > > > > > >>>> Hi Heinrich,
> > > > > > > >>>>
> > > > > > > >>>> On Mon, 6 Jan 2025 at 10:00, Heinrich Schuchardt <xypron.glpk at gmx.de> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>> On 06.01.25 15:47, Simon Glass wrote:
> > > > > > > >>>>>> This test was hamstrung in code review so this series is an attempt to
> > > > > > > >>>>>> complete the intended functionality:
> > > > > > > >>>>>>
> > > > > > > >>>>>> - Check memory allocations look correct
> > > > > > > >>>>>> - Check that exit-boot-services removes active-DMA devices
> > > > > > > >>>>>> - Check that the bootflow is still present after testapp finishes
> > > > > > > >>>>>>
> > > > > > > >>>>>> The EFI functionality duplicates bootm_announce_and_cleanup() and still
> > > > > > > >>>>>> uses the defunct board_quiesce_devices() so a nice cleanup would be to
> > > > > > > >>>>>> call the bootm function instead, with suitable modifications. That would
> > > > > > > >>>>>> allow bootstage to work too.
> > > > > > > >>>>>>
> > > > > > > >>>>>> This series is based on sjg/master since the EFI logging was rejected so
> > > > > > > >>>>>> far.
> > > > > > > >>>>>
> > > > > > > >>>>> Yes, it was rejected because a solution at the lib/log.c level would be
> > > > > > > >>>>> more generic.
> > > > > > > >>>>
> > > > > > > >>>> As I mentioned, that idea isn't suitable for programmatic use.
> > > > > > > >>>
> > > > > > > >>> What can be done with show_addr("mem", rec->memory); that log_debug()
> > > > > > > >>> does not offer or which you could not do with a new log function in
> > > > > > > >>> lib/log.c that takes variadic arguments?
> > > > > > > >>
> > > > > > > >> There are asserts in [1], for example. How do you propose to handle
> > > > > > > >> that? See [2] for my previous explanation, quoted here:
> > > > > > > >>
> > > > > > > >>> CONFIG_LOG with a bloblist option would be a great idea, but it's hard
> > > > > > > >>> to programmatically scan text...plus only the external call sites are
> > > > > > > >>> actually logged.
> > > > > > > >>
> > > > > > > >> Also see the discussion on the original patch [3]. There was also your
> > > > > > > >> reply at [4], but I think you missed that this is intended for use in
> > > > > > > >> unit tests (i.e. with ut_assert()).
> > > > > > > >>
> > > > > > > >> You also requested that this be generalised, rather than being
> > > > > > > >> EFI-loader-specific. I have no objection to that, but don't have a use
> > > > > > > >> case for it yet, so have deferred that to later. It's a fairly simple
> > > > > > > >> change, if/when needed. If the series was not NAKed, I'd be happy to
> > > > > > > >> do it now.
> > > > > > > >>
> > > > > > > >>>>
> > > > > > > >>>>>
> > > > > > > >>>>> Tom suggested not to send patches that are for private enjoyment to the
> > > > > > > >>>>> mailing list.
> > > > > > > >>>>
> > > > > > > >>>> My contributions to U-Boot are only ever about private enjoyment :-)
> > > > > > > >>>>
> > > > > > > >>>> Do you have any comments on the patches?
> > > > > > > >>
> > > > > > > >> Regards,
> > > > > > > >> Simon
> > > > > > > >>
> > > > > > > >> [1] https://patchwork.ozlabs.org/project/uboot/patch/20250106144755.3054780-6-sjg@chromium.org/
> > > > > > > >> [2] https://lore.kernel.org/u-boot/CAFLszTjxOE_037+kR0jgdax80sBombYo_k0YgiuVnP=KZCOvuA@mail.gmail.com/
> > > > > > > >> [3] https://lore.kernel.org/u-boot/CAC_iWjKtaN54B98OKbkoXkC_GmKJ=x+M4=UY_O6roSOpZaDxag@mail.gmail.com/
> > > > > > > >> [4] https://lore.kernel.org/u-boot/D513D326-41A6-425E-B11F-85958065BCD2@gmx.de/
> > > > > > > >
> > > > > > > > Looking at the logging portions of the original series again, especially
> > > > > > > > if this was made generic, we probably don't want to print to actual
> > > > > > > > console every time we're making a note of some memory allocation for
> > > > > > > > example, that would be unreadable outside of a debug context. The point
> > > > > > > > of this really seems to be "log things for verifying in tests later".
> > > > > > > > Does that end up being useful? I don't know. Heinrich or Ilias, do the
> > > > > > > > tests in [1] look generally useful?
> > > > > > > >
> > > > > > >
> > > > > > > The tests in [1] are not documented, not even in the commit message. So
> > > > > > > the reasoning behind the tests remains Simon's secret.
> > > > > >
> > > > > > Are you asking for code comments in the test? If so, I can add some.
> > > > > >
> > > > > > >
> > > > > > > At first sight the tests in [1] don't make much sense. E.g. that only a
> > > > > > > subset of memory types have been used does not tell that the right
> > > > > > > memory type has been used for the right object.
> > > > > >
> > > > > > It is a pretty good start, though. It makes sure that the memory types
> > > > > > are sane, checks addresses are within DRAM, etc. With [5] it makes
> > > > > > sure that devices are removed.
> > > > > >
> > > > > > >
> > > > > > > Implementing a specific tracing functionality for EFI is definitively
> > > > > > > the wrong way forward as it will lead to code duplication.
> > > > > >
> > > > > > We can cross that bridge when we come to it.
> > > > >
> > > > > Well, no. It's backwards to make a bridge in one place when everyone
> > > > > agrees it needs to be moved somewhere else. I mean [5] is a generic
> > > > > issue and test/py/tests/test_net_boot.py or some other test we already
> > > > > have which tests booting an OS should confirm that we've quiesced
> > > > > devices before moving on. And as a bonus it's in python where dealing
> > > > > with strings doesn't suck.
> > > >
> > > > I really don't want to write C tests in Python. CI is slow enough as
> > > > it is, something realy want to fix. I'm also not sure how you can tell
> > > > if a device has been removed. Run 'dm tree' and look for the missing
> > > > 'star' in the resulting 300 lines of text?
> > >
> > > As I'm in a bisect-hell in our C tests you'll have to forgive me for not
> > > thinking the C tests are noticeably faster than python tests. Or that
> > > they aren't their own potential source of corner-case bugs. But I
> > > digress..
> >
> > Welcome to my world. I bisected my lab devices so many times to try to
> > isolate all the breakages that have crept in. What is the problem,
> > maybe I can help?
>
> Sure. test/cmd/hash.c::dm_test_cmd_hash_md5 fails randomly, in maybe 1
> out of 100 runs, via pytest, in sandbox. Not via "./u-boot -T -c 'ut dm
> dm_test_cmd_hash_md5'" however (I stopped checking after 1000
> iterations). I was iterating over "and built with clang" but I think it
> happens with gcc too, from the actual failures in CI. And you can use
> "-k ut" to limit to just what's matched there, so it's a quicker
> iteration.

Hmmm do you have a link? It's hard to imagine what it is, but perhaps
a dependency on a previous test.

At present 'ut all' fails so I am going to take a look at that. Quite
a bit of clean-up needed in test system, though. Ideally we could run
the tests in random order so we can find and fix the dependencies. For
driver model we reinit as needed, but that's not the case for EFI, for
example.

>
> > > And yes, taking a bunch of text and parsing it, is what python is fast
> > > at. And easier to write.
> > >
> > > > But actually [5] is not generic, since EFI uses its own code to remove
> > > > devices. This test is solely focussed on EFI.
> > >
> > > Yes, you're testing the EFI version of the code in
> > > arch/$(ARCH)/lib/bootm.c. The remove devices functions being called in
> > > both cases are generic.
> >
> > The code in EFI is:
> >
> > if (!efi_st_keep_devices) {
> > bootm_disable_interrupts();
> > if (IS_ENABLED(CONFIG_USB_DEVICE))
> > udc_disconnect();
> > board_quiesce_devices();
> > dm_remove_devices_active();
> > }
> >
> > It does call somewhat the same functions, but is doing its own thing,
> > not even using the arch-specific code. As I mentioned, a nice clean-up
> > would be to make bootm_announce_and_cleanup() common.
>
> Yes, we almost agree? Both the EFI code, and arch/$(ARCH)/lib/bootm.c
> have functions that make the above calls. A nice clean-up would be to
> have something common.

Yes indeed. It still does not provide a test for the EFI bootmeth,
though, where I found half a dozen bugs.

>
> >
> > Actually, now that I see efi_st_keep_devices, I wonder why Heinrich
> > didn't want my ANSI patch[6] which serves a similar function.
>
> No? Your patch disables ANSI output in those tests, that variable is for
> making sure those tests can accomplish (if I skim things right) similar
> kinds of tests you've asked for before, but with an EFI app instead? But
> perhaps better to not start yet another tangent here...

I wouldn't know where to start, anyway...

> > > > If you want the logging to be renamed and placed centrally I don't
> > > > mind doing it now. But note that only EFI will use it for now.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > We already have function _log() which is variadic.
> > > > > > >
> > > > > > > Simon could write a new log driver that parses the `format` parameter
> > > > > > > and saves the binary data in an appropriate format for analysis by the
> > > > > > > unit tests:
> > > > > > >
> > > > > > > * For %s the driver should save the string and not the address of the
> > > > > > > string.
> > > > > > > * For %pD the driver should save the device path instead of the pointer.
> > > > > > > * ...
> > > > > > >
> > > > > > > Some changes to the log driver interface will be needed to pass the
> > > > > > > variadic arguments instead of the formatted message.
> > > > > >
> > > > > > Perhaps the word 'log' is confusing people. But the above suggestion
> > > > > > is quite a complicated way of handling things. We have no way to
> > > > > > decode printf() strings in this way. See log_dispatch() for how this
> > > > > > is handled today. It uses sprintf(). Trying to test based on text
> > > > > > output would be very clumsy (lots of regexes and sscan() calls?) and
> > > > > > result in a huge amount of parsing code, highly dependent on the
> > > > > > printf() format, etc.
> > > > > >
> > > > > > I very-much doubt that would produce a useful implementation, but if
> > > > > > you would like to try it out then I would be happy to look at it.
> > > > > >
> > > > > > I mentioned this several times, but even if we did go that way, we
> > > > > > only have logging on the external calls, so much of the EFI-memory
> > > > > > allocation in U-Boot would not be logged.
> > > > > >
> > > > > > Regards,
> > > > > > Simon
> > > > > >
> > > > > > [5] https://patchwork.ozlabs.org/project/uboot/patch/20250106144755.3054780-9-sjg@chromium.org/
> > > > >
> > > > > Yes, calling this a "log" when it's intended for capturing information
> > > > > for tests got some of this off on the wrong track. But that also helps
> > > > > explain now that this is still on the wrong track and should instead be
> > > > > following normal design practices for testing and expanding existing
> > > > > infrastructure and not inventing a new everything. So if you don't like
> > > > > Heinrich's suggestion, take a look at Caleb's suggestion.
> > > >
> > > > I don't have the energy to port the tracing framework from Linux to
> > > > U-Boot, although I agree it would be useful. Still, function tracing
> > > > is quite fragile and confusing to work with when refactoring code. I
> > > > don't like that idea much for this use case, although if function
> > > > tracing did exist in U-Boot I would likely have used it.
> > >
> > > I mean yes, it would be good if you went back and expanded on the trace
> > > functionality you did before.
> >
> > I still don't believe it is the best solution and seems like yet
> > another ocean I should avoid sticking my heater into.
>
> I strongly disagree. If you go back to the trace code you brought in to
> start with and make it more useful / include newer features existing
> elsewhere you're not going to end up in conflict with everyone asking
> why you're doing something subsystem specific.

Perhaps someone else could do this? It would be a substantial amount
of work to bring runtime tooling into U-Boot, bpf and the like. It
would be quite a pain to use, I suspect, and certainly not possible to
write a simple C test as I have done here.

>
> > > > > And if you
> > > > > don't like Caleb's suggestion, go put this in a topic branch you can
> > > > > merge when you need to debug some problem that seemingly nothing else
> > > > > will catch.
> > > >
> > > > Here we are over a year after I reported the bug and we still don't
> > > > have a test to cover it. This series is better than the available
> > > > alternatives, IMO.
> > >
> > > Well, no. We have commit dabaa4ae3206 ("dm: Add
> > > dm_remove_devices_active() for ordered device removal") we have a test
> > > for the underlying problem. We need more functional boot tests, but we
> > > need those to be in python too, and not more C code.
> >
> > That is a nice improvement, but did not fix the underlying problem.
> > The underlying problem was that EFI was calling exit-boot-services,
> > causing U-Boot to free up data structures which were needed to boot.
> > This was on x86_64. I never quite figured out which one (very hard
> > when you cannot get back to U-Boot to check).
> >
> > There were quite a lot of problems, actually. There v2 series is at [7]
> >
> > Only a C test can check what actually happens inside U-Boot.
>
> Yes, I think now we get back to disagreeing on which symptoms lead to
> which code problems and then what to do about them.

OK

>
> > > And you're not just coming up with a test, you're refactoring a bunch of
> > > code and introducing new subsystems in order to do that. When as I keep
> > > pointing out, we don't need that. We could easily extend the existing OS
> > > boot tests we have to script booting an ISO. And we only run those when
> > > say "ENABLE_SLOW_TESTS" is set, and only do that on tagged releases.
> >
> > Yes of course we need to refactor to make tests work. This is not
> > necessarily a bad thing, as it helps us break code down into testable
> > chunks. We cannot rely only on large functional-tests, not that you
> > are suggesting that. See [8], but they are too slow, too hard to debug
> > when they fail. They also tend to devolve into chaos as people get
> > lazy and stop writing unit/smaller tests.
>
> I'll just note that I don't ever even think to use "make tests" or
> "qcheck" or any of the others since they never work for me.

Would you mind filing an issue on that? I use 'make pcheck' all the time.

> With only a
> little bit of wrappering I can however run pytest like in CI.

Yes, I use that too.

Regards,
Simon