[PATCH] CI: Add automatic retry for test.py jobs

Thu Jul 13 23:03:57 CEST 2023

Hi Tom,

On Wed, 12 Jul 2023 at 14:38, Tom Rini <trini at konsulko.com> wrote:
>
> On Wed, Jul 12, 2023 at 02:32:18PM -0600, Simon Glass wrote:
> > Hi Tom,
> >
> > On Wed, 12 Jul 2023 at 11:09, Tom Rini <trini at konsulko.com> wrote:
> > >
> > > On Wed, Jul 12, 2023 at 08:00:23AM -0600, Simon Glass wrote:
> > > > Hi Tom,
> > > >
> > > > On Tue, 11 Jul 2023 at 20:33, Tom Rini <trini at konsulko.com> wrote:
> > > > >
> > > > > It is not uncommon for some of the QEMU-based jobs to fail not because
> > > > > of a code issue but rather because of a timing issue or similar problem
> > > > > that is out of our control. Make use of the keywords that Azure and
> > > > > GitLab provide so that we will automatically re-run these when they fail
> > > > > 2 times. If they fail that often it is likely we have found a real issue
> > > > > to investigate.
> > > > >
> > > > > Signed-off-by: Tom Rini <trini at konsulko.com>
> > > > > ---
> > > > >  .azure-pipelines.yml | 1 +
> > > > >  .gitlab-ci.yml       | 1 +
> > > > >  2 files changed, 2 insertions(+)
> > > >
> > > > This seems like a slippery slope. Do we know why things fail? I wonder
> > > > if we should disable the tests / builders instead, until it can be
> > > > corrected?
> > >
> > > It happens in Azure, so it's not just the broken runner problem we have
> > > in GitLab. And the problem is timing, as I said in the commit.
> > > Sometimes we still get the RTC test failing. Other times we don't get
> > > QEMU + U-Boot spawned in time (most often m68k, but sometimes x86).
> >
> > How do we keep this list from growing?
>
> Do we need to? The problem is in essence since we rely on free
> resources, sometimes some heavy lifts take longer.  That's what this
> flag is for.

I'm fairly sure the RTC thing could be made deterministic.

The spawning thing...is there a timeout for that? What actually fails?

>
> > > > I'll note that we don't have this problem with sandbox tests.
> > >
> > > OK, but that's not relevant?
> >
> > It is relevant to the discussion about using QEMU instead of sandbox,
> > e.g. with the TPM. I recall a discussion with Ilias a while back.
>
> I'm sure we could make sandbox take too long to start as well, if enough
> other things are going on with the system.  And sandbox has its own set
> of super frustrating issues instead, so I don't think this is a great
> argument to have right here (I have to run it in docker, to get around
> some application version requirements and exclude event_dump, bootmgr,
> abootimg and gpt tests, which could otherwise run, but fail for me).

I haven't heard about this before. Is there anything that could be done?

Regards.

Simon