an off-by-one error in dm_test_rtc_set_get()?

Thu Oct 28 09:22:30 CEST 2021

On 27/10/2021 15.22, Tom Rini wrote:
> On Wed, Oct 27, 2021 at 12:43:38PM +0800, Bin Meng wrote:
>> Hi Simon,
>>
>> gitlab reported the following test error below:
>>
>> =================================== FAILURES ===================================
>> __________________________ test_ut[ut_dm_rtc_set_get] __________________________
>> test/py/tests/test_ut.py:43: in test_ut
>> assert output.endswith('Failures: 0')
>> E AssertionError: assert False
>> E + where False = <built-in method endswith of str object at
>> 0x7f3bb792dcb0>('Failures: 0')
>> E + where <built-in method endswith of str object at 0x7f3bb792dcb0> =
>> 'Test: dm_test_rtc_set_get: rtc.c\r\r\nexpected: 27/10/2021
>> 03:38:15\r\r\nactual: 27/10/2021 03:38:14\r\r\ntest/dm/rtc...w, &cmp,
>> 1): Expected 0x0 (0), got 0xffffffea (-22)\r\r\nTest:
>> dm_test_rtc_set_get: rtc.c (flat tree)\r\r\nFailures: 1'.endswith
>> ----------------------------- Captured stdout call -----------------------------
>> =>
>>
>> See https://source.denx.de/u-boot/custodians/u-boot-x86/-/jobs/341905
>>
>> But the same branch same commit, azure test results passed:
>> https://dev.azure.com/bmeng/GitHub/_build/results?buildId=460&view=results
>>
>> It looks like the error is an off-by-one where actual time is 1 second
>> behind the expected time?
>>
>> expected: 27/10/2021 03:38:15
>> actual: 27/10/2021 03:38:14
>>
>> Is this a known issue?
> 
> Yes, which is why the test checks for a certain amount of "fuzz" around
> the return value. 

You said the same thing about dm_test_rtc_reset() in
https://lore.kernel.org/u-boot/20210831124441.GC858@bill-the-cat/ , but
I can't find anything about any fuzz in the code. Could you point out
where you think that's implemented? In both cases, the expected and
actual values were just 1 from each other, and I fail to see how any
fuzz value could be smaller than that.

 I've wondered about if we need to increase that value
> slightly sometimes, or just live with hitting the re-run failed jobs
> button on whatever CI system was a bit too slow sometimes.

It has nothing to do with a CI being slow, it's plain and simple buggy
test code AFAICT. It's essentially "assert(time(NULL) == time(NULL))".
If a call to time() takes 1us, do this a million times and it will on
average fail once. Obviously, a loaded system increases the chance of
being preempted between the two calls and hence effectively increases
the delta and proportionally the probability of hitting this.

Rasmus