[PATCH 12/20] cyclic: Rise default CYCLIC_MAX_CPU_TIME_US to 5000

Wed Jun 19 17:20:42 CEST 2024

On Wed, Jun 19, 2024 at 10:21:51AM +0200, Rasmus Villemoes wrote:
> On 18/06/2024 23.03, Tim Harvey wrote:
> > On Tue, Jun 18, 2024 at 7:32 AM Tom Rini <trini at konsulko.com> wrote:
> >>
> 
> > Stefan and Tom,
> > 
> > I'm seeing CI issues here even with 5000us [1]:
> > host bind 0 /tmp/sandbox/persistent-cyclic function wdt-gpio-level
> > took too long: 5368us vs 5000us max
> 
> Yes, 5ms is way too little when you're not the only thing running on the
> cpu, which is why I went with 100ms.
> 
> Random thoughts and questions:
> 
> (1) Do we have any way to programmatically grab all the logs from azure
> CI, so we can get some kind of objective statistics on the number after
> "took too long:". Clicking through the web interface and randomly
> searching is too painful.
>
> It would also be helpful to know what percentage of CI runs have failed
> due to that, versus due to some genuine error.

I don't think we can easily grab logs via API. And at least for
https://dev.azure.com/u-boot/u-boot/_build?definitionId=2&_a=summary a
lot of the failures have rolled off already because we only get so many
logs and I try and mark full release logs as keep forever. But
anecdotally I can say 75%+ of the Azure runs fail at least once due to
this, and most fail enough to fail the pipeline (we get 2 tries per
job).

> (2) I considered a patch that just added a
> 
>   default $something big if SANDBOX
> 
> to config CYCLIC_MAX_CPU_TIME_US, but since the problem also hit qemu, I
> dropped that. But, if my patch is too ugly (and I might tend to think
> that myself...), perhaps at least this would be an added improvement
> over the generic bump to 5000us.

I was fine with your approach really, maybe a bit bigger of a comment
and note under doc/ as well why we do it?

> (3) I also thought that perhaps for sandbox, we should simply measure
> the time using clock_gettime(CLOCK_PROCESS_CPUTIME_ID), instead of
> wallclock time. But it's a little ugly to implement since the "now"
> variable is both used to decide if its time to run the callback, and as
> a starting point for measuring cpu time, and we probably still want the
> "is it time" to be measured on wallclock and not however much cpu-time
> the u-boot process has been given. Or maybe we don't, and
> CLOCK_PROCESS_CPUTIME_ID would simply be a better backend for
> os_get_nsec(). Sure, time in the sandbox would progress slower than on
> the host, but does that actually matter?
> 
> (4) Btw., what kind of clock tick do we even get when run under qemu? I
> don't have much experience with qemu, but from quick googling it seems
> that -icount would be interesting. Also see
> https://github.com/zephyrproject-rtos/zephyr/issues/14173 . From quick
> reading it seems there were some issues back in 2019, but that today it
> mostly works for them, except some SMP issues (that are certainly not
> relevant to U-Boot).

That could be interesting too, yeah. I do worry it might open its own
set of problems to figure out tho.

-- 
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 659 bytes
Desc: not available
URL: <https://lists.denx.de/pipermail/u-boot/attachments/20240619/7c8ff596/attachment.sig>