Broken watchdog in u-boot master branch

Pali Rohár pali at kernel.org
Mon Oct 10 21:33:36 CEST 2022


On Monday 10 October 2022 15:24:13 Tom Rini wrote:
> On Mon, Oct 10, 2022 at 02:14:25PM -0400, Tom Rini wrote:
> > On Mon, Oct 10, 2022 at 08:01:23PM +0200, Pali Rohár wrote:
> > > On Monday 10 October 2022 13:56:10 Tom Rini wrote:
> > > > On Mon, Oct 10, 2022 at 07:44:05PM +0200, Pali Rohár wrote:
> > > > > On Monday 10 October 2022 13:40:38 Tom Rini wrote:
> > > > > > On Mon, Oct 10, 2022 at 07:22:56PM +0200, Pali Rohár wrote:
> > > > > > > On Monday 10 October 2022 12:28:18 Tom Rini wrote:
> > > > > > > > On Sun, Oct 09, 2022 at 09:12:25PM +0200, Pali Rohár wrote:
> > > > > > > > > Hello! Watchdog code seems to be broken in u-boot master branch.
> > > > > > > > > On Nokia N900 I'm getting following message in qemu:
> > > > > > > > > 
> > > > > > > > > cyclic function rx51_watchdog took too long: 10000us vs 1000us max, disabling
> > > > > > > > > 
> > > > > > > > > Seems that watchdog core code is not prepared for "slower" watchdogs
> > > > > > > > > which communicate over slower i2c bus, like it is the case for N900.
> > > > > > > > > 
> > > > > > > > > Disabling slower watchdog is a bad idea as it would result in reboot
> > > > > > > > > loop instead of slower - but working code.
> > > > > > > > 
> > > > > > > > So, looking at this in more detail, we have
> > > > > > > > CONFIG_CYCLIC_MAX_CPU_TIME_US as a configuration option (which is where
> > > > > > > > the too long comes from). And picking a random CI run:
> > > > > > > > https://source.denx.de/u-boot/u-boot/-/jobs/511177
> > > > > > > > I do see we hit this in CI once, but not every time, QEMU runs here. Is
> > > > > > > > that the max time is configurable enough to satisfy your concerns here?
> > > > > > > 
> > > > > > > It is needed to investigate, how to _properly_ fix this issue, not just
> > > > > > > workarounded it. Probably other boards may be affected.
> > > > > > 
> > > > > > So it's the cyclic watchdog code, which we merged as early as possible
> > > > > > that's the reason here. And it was merged as early as we could to see if
> > > > > > there's problems. Are there problems? We're seeing "system too slow,
> > > > > > disabling" on QEMU, sometimes, and the value of too slow is
> > > > > > configurable. I know you reported other problems with n900 HW, so we
> > > > > > can't see if it's failing there
> > > > > 
> > > > > I was tested it with older asm code (as described in that other email,
> > > > > via git checkout commit -- file) on n900 HW and watchdog problem is
> > > > > there too. Phone reboots in about 20 seconds. But as I do not have
> > > > > serial console, I do not know if that "disabling" message is printed
> > > > > there too (but I guess it is).
> > > > 
> > > > I think I'm a bit baffled at this point, honestly. The watchdog timeout
> > > > is 60 seconds. If you're confident in it being about 20 seconds,
> > > > consistently, changing WATCHDOG_TIMEOUT_MSECS to say 10000 (so, 10
> > > > seconds) should let you see if U-Boot has configured the watchdog and
> > > > it's being tripped, or if it's still at the prior stage value.
> > > 
> > > $ git grep CONFIG_WATCHDOG_TIMEOUT_MSECS configs/nokia_rx51_defconfig
> > > configs/nokia_rx51_defconfig:CONFIG_WATCHDOG_TIMEOUT_MSECS=31000
> > > 
> > > Also watchdog is started by NOLO (which loads and execute U-Boot) so
> > > there can be some smaller timeout.
> > > 
> > > So I have feeling that on the real HW is same issue. cyclic code
> > > disabled watchdog kicking and then watchdog restarted phone.
> > > 
> > > I do not remember exact time (if it is 20s or 25s; I have not measured
> > > it precisely), but it sounds plausible.
> > 
> > OK, so what happens if you increase CONFIG_CYCLIC_MAX_CPU_TIME_US to
> > something very high (so we should still enable the watchdog and
> > configure the timeout) along with CONFIG_WATCHDOG_TIMEOUT_MSECS being
> > high too (so if we can't service it in time really it's so long as to be
> > noticeable) ? Or CONFIG_WATCHDOG_TIMEOUT_MSECS to something much lower
> > (so that if the device is resetting quicker we're crashing elsewhere) ?
> 
> OK, on my beagleboard xM with a small change:
> diff --git a/drivers/watchdog/omap_wdt.c b/drivers/watchdog/omap_wdt.c
> index ca2bc7cfb59e..f0e57b4f7286 100644
> --- a/drivers/watchdog/omap_wdt.c
> +++ b/drivers/watchdog/omap_wdt.c
> @@ -39,7 +39,7 @@
>  #include <common.h>
>  #include <log.h>
>  #include <watchdog.h>
> -#include <asm/arch/hardware.h>
> +#include <asm/ti-common/omap_wdt.h>
>  #include <asm/io.h>
>  #include <asm/processor.h>
>  #include <asm/arch/cpu.h>
> 
> On my beagleboard xM I now see:
> U-Boot SPL 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)
> Trying to boot from MMC1
> 
> 
> U-Boot 2022.10-00459-g73e741b8ee46-dirty (Oct 10 2022 - 15:18:38 -0400)
> 
> OMAP3630/3730-GP ES1.1, CPU-OPP2, L3-200MHz, Max CPU Clock 800 MHz
> Model: TI OMAP3 BeagleBoard
> OMAP3 Beagle board + LPDDR/NAND
> I2C:   ready
> DRAM:  256 MiB
> Core:  45 devices, 19 uclasses, devicetree: separate
> WDT:   Started wdt at 48314000 without servicing  (60s timeout)
> NAND:  0 MiB
> MMC:   OMAP SD/MMC: 0
> Loading Environment from NAND... *** Warning - readenv() failed, using default environment
> 
> Beagle xM Rev A/B
> No EEPROM on expansion board
> OMAP die ID: 6e5e00211ff00000015739eb08031024
> Net:   No ethernet found.
> Hit any key to stop autoboot:  0
> 
> So, this is as close as I can get to testing on n900 HW, and it's fine
> here.
> 
> -- 
> Tom

This is omap watchdog which has registers directly on the processor.

With N900 is issue that uses watchdog connected via i2c (not that omap
watchdog) and there is some issue in new u-boot that i2c does not work
at high speeds. So i2c is configured at lower speed now (when it works
correctly). u-boot i2c driver has more udelay() calls during i2c
transfer and I think that this is the reason why n900 watchdog kick
spend much more time... (just guessing). If you have watchdog registers
directly mapped to MMIO (and even part of processor) then any ldr or str
instruction is executed immediately and not with ms delay timeouts.

It was easier to lower i2c speed than debugging it as nobody knows why
since some u-boot version, i2c transfers started failing at higher
speeds.

I will try to play with those config options later to see what happens.


More information about the U-Boot mailing list