[PATCH] CI: Add automatic retry for test.py jobs

Thu Jul 27 21:18:12 CEST 2023

Hi Tom,

On Sun, 16 Jul 2023 at 12:18, Tom Rini <trini at konsulko.com> wrote:
>
> On Sat, Jul 15, 2023 at 05:40:25PM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Thu, 13 Jul 2023 at 15:57, Tom Rini <trini at konsulko.com> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 03:03:57PM -0600, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini at konsulko.com> wrote:
> > > > >
> > > > > On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > > > > > Hi Tom,
> > > > > >
> > > > > > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini at konsulko.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > > > > > Hi Tom,
> > > > > > > >
> > > > > > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini at konsulko.com> wrote:
> > > > > > > > >
> > > > > > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > > > > > to investigate.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Tom Rini <trini at konsulko.com>
> > > > > > > > > ---
> > > > > > > > >  .azure-pipelines.yml | 1 +
> > > > > > > > >  .gitlab-ci.yml       | 1 +
> > > > > > > > >  2 files changed, 2 insertions(+)
> > > > > > > >
> > > > > > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > > > > > if we should disable the tests / builders instead, until it can be
> > > > > > > > corrected?
> > > > > > >
> > > > > > > It happens in Azure, so it's not just the broken runner problem we have
> > > > > > > in GitLab. And the problem is timing, as I said in the commit.
> > > > > > > Sometimes we still get the RTC test failing. Other times we don't get
> > > > > > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> > > > > >
> > > > > > How do we keep this list from growing?
> > > > >
> > > > > Do we need to? The problem is in essence since we rely on free
> > > > > resources, sometimes some heavy lifts take longer.  That's what this
> > > > > flag is for.
> > > >
> > > > I'm fairly sure the RTC thing could be made deterministic.
> > >
> > > We've already tried that once, and it happens a lot less often. If we
> > > make it even looser we risk making the test itself useless.
> >
> > For sleep, yes, but for rtc it should be deterministic now...next time
> > you get a failure could you send me the trace?
>
> Found one:
> https://dev.azure.com/u-boot/u-boot/_build/results?buildId=6592&view=logs&j=b6c47816-145c-5bfe-20a7-c6a2572e6c41&t=0929c28c-6e32-5635-9624-54eaa917d713&l=599

I don't seem to have access to that...but it is rtc or sleep?

>
> And note that we have a different set of timeout problems that may or may not
> be configurable, which is in the upload of the pytest results. I haven't seen
> if there's a knob for this one yet, within Azure (or the python project we're
> adding for it).

Oh dear.

>
> > > > The spawning thing...is there a timeout for that? What actually fails?
> > >
> > > It doesn't spawn in time for the framework to get to the prompt.  We
> > > could maybe increase the timeout value.  It's always the version test
> > > that fails.
> >
> > Ah OK, yes increasing the timeout makes sense.
> >
> > >
> > > > > > > > I'll note that we don't have this problem with sandbox tests.
> > > > > > >
> > > > > > > OK, but that's not relevant?
> > > > > >
> > > > > > It is relevant to the discussion about using QEMU instead of sandbox,
> > > > > > e.g. with the TPM. I recall a discussion with Ilias a while back.
> > > > >
> > > > > I'm sure we could make sandbox take too long to start as well, if enough
> > > > > other things are going on with the system.  And sandbox has its own set
> > > > > of super frustrating issues instead, so I don't think this is a great
> > > > > argument to have right here (I have to run it in docker, to get around
> > > > > some application version requirements and exclude event_dump, bootmgr,
> > > > > abootimg and gpt tests, which could otherwise run, but fail for me).
> > > >
> > > > I haven't heard about this before. Is there anything that could be done?
> > >
> > > I have no idea what could be done about it since I believe all of them
> > > run fine in CI, including on this very host, when gitlab invokes it
> > > rather than when I invoke it. My point here is that sandbox tests are
> > > just a different kind of picky about things and need their own kind of
> > > "just hit retry".
> >
> > Perhaps this is Python dependencies? I'm not sure, but if you see it
> > again, please let me know in case we can actually fix this.
>
> Alright. So the first pass I took at running sandbox pytest with as
> little hand-holding as possible I hit the known issue of /boot/vmlinu*
> being 0400 in Ubuntu. I fixed that and then re-ran and:
> test/py/tests/test_cleanup_build.py F
>
> ========================================== FAILURES ===========================================
> _________________________________________ test_clean __________________________________________
> test/py/tests/test_cleanup_build.py:94: in test_clean
>     assert not leftovers, f"leftovers: {', '.join(map(str, leftovers))}"
> E   AssertionError: leftovers: fdt-out.dtb, sha1-pad/sandbox-u-boot.dtb, sha1-pad/sandbox-kernel.dtb, sha1-basic/sandbox-u-boot.dtb, sha1-basic/sandbox-kernel.dtb, sha384-basic/sandbox-u-boot.dtb, sha384-basic/sandbox-kernel.dtb, algo-arg/sandbox-u-boot.dtb, algo-arg/sandbox-kernel.dtb, sha1-pss/sandbox-u-boot.dtb, sha1-pss/sandbox-kernel.dtb, sha256-pad/sandbox-u-boot.dtb, sha256-pad/sandbox-kernel.dtb, sha256-global-sign/sandbox-binman.dtb, sha256-global-sign/sandbox-u-boot.dtb, sha256-global-sign/sandbox-u-boot-global.dtb, sha256-global-sign/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-binman-pss.dtb, sha256-global-sign-pss/sandbox-u-boot.dtb, sha256-global-sign-pss/sandbox-kernel.dtb, sha256-global-sign-pss/sandbox-u-boot-global-pss.dtb, auto_fit/dt-1.dtb, auto_fit/dt-2.dtb, sha256-pss/sandbox-u-boot.dtb, sha256-pss/sandbox-kernel.dtb, sha256-pss-pad/sandbox-u-boot.dtb, sha256-pss-pad/sandbox-kernel.dtb, hashes/sandbox-kernel.dtb, sha256-basic/sandbox-u-boot.dtb, sha256-basic/sandbox-kernel.dtb, sha1-pss-pad/sandbox-u-boot.dtb, sha1-pss-pad/sandbox-kernel.dtb, sha384-pad/sandbox-u-boot.dtb, sha384-pad/sandbox-kernel.dtb, sha256-pss-pad-required/sandbox-u-boot.dtb, sha256-pss-pad-required/sandbox-kernel.dtb, ecdsa/sandbox-kernel.dtb, sha256-pss-required/sandbox-u-boot.dtb, sha256-pss-required/sandbox-kernel.dtb
> E   assert not [PosixPath('fdt-out.dtb'), PosixPath('sha1-pad/sandbox-u-boot.dtb'), PosixPath('sha1-pad/sandbox-kernel.dtb'), PosixPa...ic/sandbox-u-boot.dtb'), PosixPath('sha1-basic/sandbox-kernel.dtb'), PosixPath('sha384-basic/sandbox-u-boot.dtb'), ...]
> ------------------------------------ Captured stdout call -------------------------------------
> +make O=/tmp/pytest-of-trini/pytest-231/test_clean0 clean
> make[1]: Entering directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
>   CLEAN   cmd
>   CLEAN   dts/../arch/sandbox/dts
>   CLEAN   dts
>   CLEAN   lib
>   CLEAN   tools
>   CLEAN   tools/generated
>   CLEAN   include/bmp_logo.h include/bmp_logo_data.h include/generated/env.in include/generated/env.txt drivers/video/u_boot_logo.S u-boot u-boot-dtb.bin u-boot-initial-env u-boot-nodtb.bin u-boot.bin u-boot.cfg u-boot.dtb u-boot.dtb.gz u-boot.dtb.out u-boot.dts u-boot.lds u-boot.map u-boot.srec u-boot.sym System.map image.map keep-syms-lto.c lib/efi_loader/helloworld_efi.S
> make[1]: Leaving directory '/tmp/pytest-of-trini/pytest-231/test_clean0'
> =================================== short test summary info ===================================
> FAILED test/py/tests/test_cleanup_build.py::test_clean - AssertionError: leftovers: fdt-out....
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ================================= 1 failed, 6 passed in 6.42s =================================

That test never passes for me locally, because as you say we add a lot
of files to the build directory and there is no tracking of them such
that 'make clean' could remove them. We could fix that, e.g.:

1. Have binman record all its output filenames in a binman.clean file
2. Have tests always use a 'testfiles' subdir for files they create

>
> Fixing that manually with an rm -rf of /tmp/pytest-of-trini and now it's
> stuck.  I've rm -rf'd that and git clean -dfx and just repeat that
> failure.  I'm hopeful that when I reboot whatever magic is broken will
> be cleaned out.  Moving things in to a docker container again, I get:
> =========================================== ERRORS ============================================
> _______________________________ ERROR at setup of test_gpt_read _______________________________
> /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:74: in state_disk_image
>     ???
> /home/trini/work/u-boot/u-boot/test/py/tests/test_gpt.py:37: in __init__
>     ???
> test/py/u_boot_utils.py:279: in __enter__
>     self.module_filename = module.__file__
> E   AttributeError: 'NoneType' object has no attribute '__file__'
> =================================== short test summary info ===================================
> ERROR test/py/tests/test_gpt.py::test_gpt_read - AttributeError: 'NoneType' object has no at...
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ========================== 41 passed, 45 skipped, 1 error in 19.29s ===========================
>
> And then ignoring that one with "-k not gpt":
> test/py/tests/test_android/test_ab.py E
>
> =========================================== ERRORS ============================================
> __________________________________ ERROR at setup of test_ab __________________________________
> /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:54: in ab_disk_image
>     ???
> /home/trini/work/u-boot/u-boot/test/py/tests/test_android/test_ab.py:28: in __init__
>     ???
> test/py/u_boot_utils.py:279: in __enter__
>     self.module_filename = module.__file__
> E   AttributeError: 'NoneType' object has no attribute '__file__'
> =================================== short test summary info ===================================
> ERROR test/py/tests/test_android/test_ab.py::test_ab - AttributeError: 'NoneType' object has...
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ============= 908 passed, 75 skipped, 10 deselected, 1 error in 159.17s (0:02:39) =============

These two are the same error. It looks like somehow it is unable to
obtain the module with:

        frame = inspect.stack()[1]
        module = inspect.getmodule(frame[0])

i.e. module is None

+Stephen Warren who may know

What Python version is this?

>
> Now, funny things. If I git clean -dfx, I can then get that test to
> pass.  So I guess something else isn't cleaning up / is writing to a
> common area? I intentionally build within the source tree, but in a
> subdirectory of that, and indeed a lot of tests write to the source
> directory itself.

Wow that really is strange. The logic in that class is pretty clever.
Do you see a message like 'Waiting for generated file timestamp to
increase' at any point?

BTW these problems don't have anything to do with sandbox, which I
think was your original complaint. The more stuff we bring into tests
(Python included) the harder it gets.

Regards,
Simon