[U-Boot] [RFC PATCH] lib/timer: initialize timebase_l/timebase_h

Andre Przywara andre.przywara at arm.com
Fri Nov 4 11:33:40 CET 2016


Hi,

On 26/10/16 08:14, Alexander Graf wrote:
> On 10/26/2016 02:07 AM, André Przywara wrote:
>> On 25/10/16 08:52, Alexander Graf wrote:
>>
>> Hi Alex,
>>
>> thanks for looking at this!
>>
>>> On 25/10/2016 02:51, Andre Przywara wrote:
>>>> On systems using the generic timer routines defined in lib/time.c we
>>>> use timebase_l and timebase_h fields from the gd to detect wraparounds
>>>> in our tick counter. The tick calculcation algorithm silently assumes
>>>> that a long is only 32 bits, which leads to wrong results when
>>>> timebase_h
>>>> is not 0 on 64-bit systems.
>>>> As one possible fix lets initialize timebase_h (and timebase_l) to
>>>> 0, so
>>>> on 64-bit systems timebase_h will never(TM) be bigger than 0 and thus
>>>> cannot spoil timebase_l by being ORed into it.
>>>>
>>>> This fixes occasional timeout issues on the Pine64 (and possibly other
>>>> ARM64 systems).
>> Well, not really (this fix isn't sufficient), but read on ...
>>
>>>> Signed-off-by: Andre Przywara <andre.przywara at arm.com>
>>>> ---
>>>> Hi,
>>>>
>>>> I am bit puzzled what the proper fix is, this one is the easiest I
>>>> could
>>>> come up with. Not sure if the gd should be zeroed normally (and it's
>>>> just
>>>> broken on sunxi/arm64 because of some linker issues) or whether we
>>>> really
>>>> forgot to initialize those fields and just got away with it.
>>> The gd clearing happens via crt0_64.S -> board_init_f_init_reserve().
>>> There we should have fully cleared all memory associated with global
>>> data.
>> Ah,thanks for pointing to that. I was a bit clueless where to start
>> looking for it - "git grep gd" is obviously not a good idea ;-)
>>
>>> I can't see anything obviously wrong in that code. Could you try to dump
>>> gd if the timer offsets are != 0 on init? Maybe we can conclude
>>> something from the data in it.
>> So I agree that this code looks sane and indeed in all my dumps it looks
>> like the initialization works fine.
>> I did some more debugging and learned that the increased timebase_h
>> comes from detected overflows: in fact some readings are really lower
>> than previous ones:
>> ...
>> MMC:   SUNXI SD/MMC: 0
>> get_ticks() overflow: now: 118046720, tbl: 118063103, tbh: 0
>> get_ticks() overflow: now: 118439936, tbl: 118456318, tbh: 1
>> get_ticks() overflow: now: 118571008, tbl: 118587391, tbh: 2
>> get_ticks() overflow: now: 118734848, tbl: 118751230, tbh: 3
>> get_ticks() overflow: now: 119422976, tbl: 119439358, tbh: 4
>> get_ticks() overflow: now: 119783424, tbl: 119799807, tbh: 5
>> get_ticks() overflow: now: 120045568, tbl: 120061950, tbh: 6
>> get_ticks() overflow: now: 120406016, tbl: 120422398, tbh: 7
>> *** Warning - bad CRC, using default environment
>> ......
>> Not sure how this actually happens - I am not aware of any such severe
>> hardware errata in the A53 r0p4 or the timer, also I think that would
>> have bitten us elsewhere already.
>> As ATF keeps the secondaries in WFI, U-Boot only runs on CPU0 (confirmed
>> by MPIDR reads).
>> Also U-Boot reads the physical counter, so a bogus CNTVOFF can also not
>> be the culprit.
>>
>> So I can fix this annoying issue by using one of the other proposed
>> fixes (handling timebase_h only if BITS_PER_LONG < 64 or defining
>> get_ticks in armv8/generic_timer.c), but it would still be interesting
>> to find the real root cause.
> 
> Can you try and ask around? If this bites us in U-Boot, there's a good
> chance Linux timers should be broken too, no?
> 
> I remember that NXP had similar problems with the timer:
> 
>   https://patchwork.kernel.org/patch/9344727/

So I did some more experiments, and indeed it looks very much like a
silicon bug in the A64. Philipp found this in the BSP kernel:
https://github.com/longsleep/linux-pine64/blob/pine64-hacks-1.2/drivers/clocksource/arm_arch_timer.c#L231

So I have a direct test that reads both the CNTVCNT register and
CLOCK_MONOTONIC_RAW back to back for a few million times and looks for
decreasing values. On the Freescale box it triggers on every boot, on
A64 only on certain boots.

But the error pattern looks similar (though not identical), on the A64
it is:
CNTVCNT before: xxxx7fff, CNTVCNT after: xxxx4000
CNTVCNT before: xxxxffff, CNTVCNT after: xxxxc000
On the Freescale box I see different lengths of bogus lower bits, on the
A64 it's always the lower 15 bits.

These are at least the cases that I could spot (because the later read
returns a smaller value), there may be more corruptions that are harder
to find.

Setting the new Freescale erratum DT property in the arch timer node
fixes at least the Linux interface on the A64, as expected.
But as stated above this doesn't happen on every boot, so there might be
the chance that it's a missing initialization somewhere.

Since clock_gettime() gets disabled in the VDSO with the workaround, I'd
like to do more research to get an idea why it sometimes works and
sometimes not.

But unless we find another fix, I'll send a patch to enable the FSL
workaround (both in U-Boot and Linux).

Cheers,
Andre.


More information about the U-Boot mailing list