[U-Boot] [PATCH 4/4] ARM: bcm283x: Switch to generic timer

Fri May 8 18:06:41 CEST 2015

On 05/06/2015 01:51 PM, Tyler Baker wrote:
> On 6 May 2015 at 11:13, Marek Vasut <marex at denx.de> wrote:
>> On Wednesday, May 06, 2015 at 05:52:37 PM, Stephen Warren wrote:
>> [...]
>>>>>> So, if now is close to 0x7fffffff (which it can), then if endtime is
>>>>>> big-ish, diff will become negative and this udelay() will not perform
>>>>>> the correct delay, right ?
>>>>>
>>>>> I don't believe so, no.
>>>>>
>>>>> endtime and now are both unsigned. My (admittedly intuitive rather than
>>>>> well-researched) understanding of C math promotion rules means that
>>>>> "endtime - now" will be calculated as an unsigned value, then converted
>>>>> into a signed value to be stored in the signed diff. As such, I would
>>>>> expect the value of diff to be a small value in this case. I wrote a
>>>>> test program to validate this; endtime = 0x80000002, now = 0x7ffffffe,
>>>>> yields diff=4 as expected.
>>>>>
>>>>> Perhaps you meant a much larger endtime value than 0x80000002; perhaps
>>>>> 0xffffffff? This doesn't cause issues either. All that's relevant is the
>>>>> difference between endtime and now, not their absolute values, and not
>>>>> whether endtime has wrapped but now has or hasn't. For example, endtime
>>>>> = 0x00000002, now = 0xfffffff0 yields diff=18 as expected.
>>>>
>>>> So what if the difference is bigger than 1 << 31 ?
>>>
>>> As I said, I don't believe that case is relevant; it can only happen if
>>> passing ridiculously large delay values into __udelay() (i.e. greater
>>> than the 1<<31value you mention), and I don't believe there's any need
>>> to support that.
>>
>> So what you say is that it's OK to have a function which is buggy in
>> corner cases ?
>>
>>> The implementation in lib/time.c probably has exactly the same problem,
>>> except that since it uses 64-bit math rather than 32-bit math, so the
>>> issue happens at 1<<63 rather than 1<<31. It's probably equally
>>> problematic for delay values as large as 1<<63:-) In practice, given
>>> 1<<31 us is so large, I don't think there's any practical difference.
>>
>> The implementation in lib/time.c uses 32bit usec argument though, so
>> it's not prone to this overflow. Please correct me if I'm wrong.
>>
>>>>>>> Besides, what's passing a value >~36 minutes to udelay()?
>>>>>>
>>>>>> Nothing, but that doesn't mean we can have a possibly broken
>>>>>> implementation, right ?
>>>>>
>>>>> True. However, I'd expect that any specification for udelay would
>>>>> disallow such large parameter values, and hence its behaviour wouldn't
>>>>> be relevant if such values were passed.
>>>>
>>>> Do you think you can pick this patch and drop the "fixes overflow" part
>>>> or do you need resubmission ?
>>>
>>> Tom Rini (or in the past Albert Aribaud) actually apply the patches.
>>>
>>> Re: the patch description: I'd certainly be happy if it was re-written
>>> to say something more like "replace bcm2835-specific timer logic with
>>> common code to reduce the number of different implementations for the
>>> same thing".
>>
>> Tom, do you want a repost ?
>>
>>> I think you'd mentioned on IRC that this change fixed something
>>> USB-related for you, and I still don't understand how that could be
>>> possible. Perhaps there's some intermittent problem, and it just
>>> happened not to show up when you tested after this patch?
>>
>> I think Tyler can elaborate on that, but in his test case, he still
>> triggers the USB issue.
>
> I'll provide some context on the issue I'm fighting...
>
> I recently bought a RPi B+ Model, flashed the latest raspbian image[0]
> to an sd card, built the master branch (v2015.05+) u-boot and
> overwrote kernel.img with u-boot.bin. U-Boot came right up but I was
> unable to obtain a DHCP lease after using 'usb start; dhcp'. I ran
> tcpdump and saw the DHCP requests being made, but shortly after the
> board was seemly ignoring responses from the DHCP server. Now
> sometimes if the response from the DHCP server was quick enough, it
> could get a lease, but the tftp transfer would stall. I decided it
> would be best to set 'ipaddr, gateway, netmask' to see if this was a
> DHCP issue. It turned out that I had no network connectivity even when
> I configured the ip address statically.
>
> Could not obtain a lease:
>
>   starting USB...
> USB0: Core Release: 2.80a
> scanning bus 0 for devices... 3 USB Device(s) found
> scanning usb for storage devices... 0 Storage Device(s) found
> scanning usb for ethernet devices... 1 Ethernet Device(s) found
> Waiting for Ethernet connection... done.
> BOOTP broadcast 1
> BOOTP broadcast 2
> BOOTP broadcast 3
> BOOTP broadcast 4
> BOOTP broadcast 5
> BOOTP broadcast 6
> BOOTP broadcast 7
> BOOTP broadcast 8
> Retry time exceeded; starting again
> Bad Linux ARM zImage magic!
>
> Obtains a lease, but stalls on transfer:
>
>   starting USB...
> USB0: Core Release: 2.80a
> scanning bus 0 for devices... 3 USB Device(s) found
> scanning usb for storage devices... 0 Storage Device(s) found
> scanning usb for ethernet devices... 1 Ethernet Device(s) found
> Waiting for Ethernet connection... done.
> BOOTP broadcast 1
> DHCP client bound to address 192.168.2.55 (1328 ms)
> Waiting for Ethernet connection... done.
> Using sms0 device
> TFTP from server 192.168.2.2; our IP address is 192.168.2.55
> Filename 'tmp2rkX_N/.zImage'.
> Load address: 0x1000000
> Loading: * T T T T T T T T T T
> Retry count exceeded; starting again
> Bad Linux ARM zImage magic!
>
> At this point I reached out for help on IRC and that is when Marek and
> I starting chatting about this. Hacking around, I found that using
> v2015.01 I was able to obtain a lease and transfer files 100% of the
> time (although it seems very slow). Thinking that we have a good and
> bad commit, this would be easy to bisect. Wrong, we both tried and got
> different results. It seems that as you move from v2015.01 -> HEAD
> master this issue becomes very intermittent and thus hard to pin down.
>
> So my test case for this issue has become...
>
> * Obtain lease
> * Transfer kernel, dtb, ramdisk without stalling/timing out
> * Do this 10 times in a row with a power cycle in between
>
> Hope this help clarify the situation in some way,

OK, but if you apply Marek's change to replace the timer implementation 
with the one in lib/time.c, does that reliably fix the issue when tested 
over a large number of runs with/without that change? With the current 
explanation, I can't see how it possibly could. Equally, I can't see why 
the move from v2015.01 to HEAD would affect the issue if it was caused 
by the timer implementation.

My suspicion is that there's something else entirely going on.