[PATCH 1/2] gitlab: Move the n900 test into its own section

Sun Jan 31 18:31:04 CET 2021

On Sunday 31 January 2021 10:10:56 Simon Glass wrote:
> Hi Pali,
> 
> On Sun, 31 Jan 2021 at 10:05, Pali Rohár <pali at kernel.org> wrote:
> >
> > On Sunday 31 January 2021 09:51:44 Simon Glass wrote:
> > > Hi Pali,
> > >
> > > On Sun, 31 Jan 2021 at 08:52, Pali Rohár <pali at kernel.org> wrote:
> > > >
> > > > On Sunday 31 January 2021 08:43:19 Simon Glass wrote:
> > > > > Hi Pali,
> > > > >
> > > > > On Sun, 31 Jan 2021 at 08:04, Pali Rohár <pali at kernel.org> wrote:
> > > > > >
> > > > > > On Sunday 31 January 2021 08:49:20 Tom Rini wrote:
> > > > > > > On Sun, Jan 31, 2021 at 01:15:20PM +0100, Pali Rohár wrote:
> > > > > > > > On Saturday 30 January 2021 22:17:45 Simon Glass wrote:
> > > > > > > > > This test is not reliable. Quite often (20%?) it makes the build fail and
> > > > > > > > > a retry succeeds.
> > > > > > > >
> > > > > > > > This test should work. Are there any logs with issues?
> > > > > > >
> > > > > > > I don't see it failing any more often than other tests do, due to
> > > > > > > network connectivity issues.  That may be helped by, now that we've
> > > > > > > dropped Travis, having the container be pre-populated with more of the
> > > > > > > downloaded files and pre-building the special QEMU.
> > > > > >
> > > > > > If there are just network issue problems then pre-downloading required
> > > > > > files into cache / container should resolve them.
> > > > >
> > > > > The flake issues I see are like this:
> > > > >
> > > > > https://gitlab.denx.de/u-boot/custodians/u-boot-dm/-/jobs/202441
> > > > >
> > > > > I am not sure of the cause, but it would be good to fix it!
> > > >
> > > > Hello Simon! This is not a network issue problem but rather some U-Boot
> > > > regression in mmc code. Second test failed with error:
> > > >
> > > >     "Failed to boot kernel from eMMC"
> > > >
> > > > Other tests succeed:
> > > >
> > > >     "Kernel was successfully booted from RAM"
> > > >     "Kernel was successfully booted from OneNAND"
> > > >
> > > > So problem is really with second boot attempt from eMMC. U-Boot log is
> > > > also available in output (as second run):
> > > >
> > > >     Check if pads/pull-ups of bus are properly configured
> > > >     Timed out in wait_for_event: status=0000
> > > >     ...
> > > >     Timed out in wait_for_event: status=0000
> > > >     Check if pads/pull-ups of bus are properly configured
> > > >     Timed out in wait_for_event: status=0000
> > > >     Check if pads/pull-ups of bus are properly configured
> > > >     Timed out in wait_for_event: status=0000
> > > >     Check if pads/pull-ups of bus are properly configured
> > > >     test/nokia_rx51_test.sh: line 233:  5946 Killed                  ./qemu-system-arm -M n900 -mtdblock mtd_emmc.img -sd emmc_emmc.img -serial /dev/stdout -display none > qemu_emmc.log
> > > >
> > > > After 300s was qemu killed and test marked as failure.
> > > >
> > > > So this is valid failure and regression in u-boot emmc code. So it would
> > > > be needed to identify which commit caused it and revert it...
> > >
> > > The problem is that it is intermittent. Can you repeat it?
> >
> > So when you run this test more times from same sources / git commit,
> > this error appears only sometimes?
> 
> Perhaps 1 time in 5 or 10? Every time I click 'retry' in gitlab it
> tries again and passes.

It would be interested to know if problem is with compiled binary (and
rebuilding fixes it) or problem is in qemu runtime part (same compiled
binary sometimes passes and sometimes fails).

But as I have not see this issue, I do not know what is happening here.

> >
> > This particular issue I have not seen in qemu yet when I run tests on my
> > local machine. So I cannot reproduce it.
> >
> > I saw similar errors, but only on real device (not in qemu) and they
> > were visible always (not sometimes). And for all my known problems I
> > have sent patches to mailing list. including i2c, mmc and usb. Some of
> > them are still waiting for review & merge...
> 
> So perhaps it has been fixed, but not yet merged?

Yea, this is possible.

> >
> > ===
> >
> > I know only one error which is not fixed yet and happens "only
> > sometimes" which I was not able to debug yet. Probably if u-boot binary
> > has particular size then it completely crashes (and with same binary it
> > can be reproduced for every run). But recompiling u-boot binary resolves
> > this issue and sometimes even without modifying source code. So I
> > suspect that time&date string (which changes for every recompilation)
> > must have some effect (maybe some +-1 padding?). Adding new random 100
> > characters into env variables seems to fix it.
> 
> That's not good.
> 
> Re the analsys, that seems a bit of a stretch. While the time/date
> changes, its length doesn't normally change.
> 
> Uninited values can have any behaviour. I assumes this is in U-Boot
> proper, not SPL? You could check that BSS variables are not used
> before relocation, perhaps?

This is U-Boot binary. N900 does not use SPL at all. U-Boot binary is
loaded and executed by (proprietary) Nokia loader directly to RAM and it
do almost all HW initialization.

And it is even more strange. If build produce binary which does not work
on real device, it always crashes on real devices. But same binary is
working fine in qemu (so no way to debug it). And if I start qemu in
debug mode, ready for attaching gdb to look at this issue, it somehow
disappear... Total heisenbug. I have no idea if bug is in u-boot code or
in gcc (because also recompiling with different gcc version and
different flags hides it)...

I caught this issue in qemu with attached gdb only once. This is my
screen from terminal, I do not have nothing more. U-Boot crashed on
division by zero error because htab->size was zero:

(gdb) bt
#0  __aeabi_uidivmod () at arch/arm/lib/lib1funcs.S:325
#1  0x8002f054 in hsearch_r (item=..., action=ENV_FIND, retval=0x8fd12cec, htab=0x80041348, flag=0) at lib/hashtable.c:313
#2  0x80011f68 in env_get (name=0x8fd19830 "switchmmc") at cmd/nvedit.c:677
#3  env_get (name=0x8fd19830 "switchmmc") at cmd/nvedit.c:668
#4  0x800187e0 in do_run (cmdtp=<optimized out>, flag=<optimized out>, argc=2, argv=<optimized out>) at common/cli.c:142
#5  0x8fe042cc in ?? ()
#6  0x8fe042cc in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) up
#1  0x8002f054 in hsearch_r (item=..., action=ENV_FIND, retval=0x8fd12cec, htab=0x80041348, flag=0) at lib/hashtable.c:313
313             hval %= htab->size;
(gdb) print htab
$6 = (struct hsearch_data *) 0x80041348
(gdb) print *htab
$7 = {table = 0x0, size = 0, filled = 0, change_ok = 0x8002ab20 <env_flags_validate>}
(gdb) info registers
r0             0x43eab0e3       1139454179
r1             0x0      0
r2             0x73     115
r3             0x8fd19830       -1882089424
r4             0x80041348       -2147216568
r5             0x8fd19830       -1882089424
r6             0x2      2
r7             0x0      0
r8             0x8fd19830       -1882089424
r9             0x8fd12ee0       -1882116384
r10            0x0      0
r11            0x8fe0419c       -1881128548
r12            0x8fd12cb8       -1882116936
sp             0x8fd12ca0       0x8fd12ca0
lr             0x8002f054       -2147291052
pc             0x8002f054       0x8002f054 <hsearch_r+56>
cpsr           0x600001d3       1610613203

	/*
	 * First hash function:
	 * simply take the modul but prevent zero.
	 */
	hval %= htab->size;
	if (hval == 0)
		++hval;

I spend more time with it and I was not able to debug it more. And now I
do not have time to look at it again. For me this one issue does not
make sense at all. And because workaround exist (recompile binary,
possibly by padding dummy env variable) I stopped investigation.

But I think you must see something different as in my case this issue
cause U-Boot crash prior staring bootmenu and boot procedure...

> >
> > > >
> > > > > Re the network issues, I have a persistent DNS problem with my
> > > > > network. I am really not sure of the root cause but sometimes it will
> > > > > fail to find a host, then succeed 5 seconds later. I spent some time
> > > > > on it a few weeks ago but will try again.
> Regards,
> Simon