[U-Boot] driver model is not smp safe

Sat Aug 8 02:49:50 CEST 2015

Hi Simon,

On Sat, Aug 8, 2015 at 3:09 AM, Simon Glass <sjg at chromium.org> wrote:
> Hi Bin,
>
> On 5 August 2015 at 02:43, Bin Meng <bmeng.cn at gmail.com> wrote:
>> Hi Simon, Tom,
>>
>> On Tue, Aug 4, 2015 at 3:27 AM, Simon Glass <sjg at chromium.org> wrote:
>>> Hi Tom,
>>>
>>> On 3 August 2015 at 13:06, Tom Rini <trini at konsulko.com> wrote:
>>>> On Mon, Aug 03, 2015 at 12:52:19PM -0600, Simon Glass wrote:
>>>>> Hi Tom,
>>>>>
>>>>> On 31 July 2015 at 08:31, Tom Rini <trini at konsulko.com> wrote:
>>>>> > On Thu, Jul 30, 2015 at 12:12:03PM +0800, Bin Meng wrote:
>>>>> >
>>>>> >> Hi Simon,
>>>>> >>
>>>>> >> When adding x86 multi-cpu initialization on a board with 4 cores, I found:
>>>>> >>
>>>>> >> => cpu list
>>>>> >>   0: cpu at 0               Genuine Intel(R) CPU         @ 1.58GHz
>>>>> >>   1: cpu at 1               Genuine Intel(R) CPU         @ 1.58GHz
>>>>> >>   2: cpu at 2               Genuine Intel(R) CPU         @ 1.58GHz
>>>>> >>   2: cpu at 3               Genuine Intel(R) CPU         @ 1.58GHz
>>>>> >>
>>>>> >> cpu at 2 and cpu at 3 have the same sequence number, which indicates they
>>>>> >> are running parallelly to get the same sequence number. The call chain
>>>>> >> on an ap is: mp_init_cpu() -> device_probe() -> uclass_resolve_seq().
>>>>> >> Apparently ap2 and ap3 are running at the same time to get the same
>>>>> >> number.
>>>>> >>
>>>>> >> Note so far all x86 boards that we have enabled x86 multi-cpu
>>>>> >> initialization on only have 2 cores, which will not expose such issue
>>>>> >> as there is no parallel execution among aps.
>>>>> >
>>>>> > So what exactly are we doing with these additional cores?  My
>>>>> > recollection of what we do on other arches when we even deal with other
>>>>> > cores is that we bring them "up" and then usually put them in a holding
>>>>> > pattern for the real OS to deal with _or_ it's one of those cases where
>>>>> > we have multiple OSes running and we do what we need to load and release
>>>>> > those other OSes.
>>>>>
>>>>> In this case they end up at stop_this_cpu() which is just a hlt
>>>>> instruction in each case.
>>>>
>>>> So do we really have to be doing anything here?  Or is this just
>>>> pre-emptive work for an async MP type setup down the road?  We could
>>>> probably live with this with a big comment noting why we know it's
>>>> misbehaving.
>>>
>>> I think we should fix it - I suggested some options above and Bin may
>>> have ideas also. Bin may be able to send a patch since he can repeat
>>> the problem.
>>>
>>
>> Yes we should fix it. But IMHO, just fixing the seq number only
>> resolves the surface problem. What concerns me is that multiple cpu
>> running the same piece of codes (in this case, the DM core codes)
>> without any protection. I have no idea whether these core structures
>> (like the device list) still look good from the DM core perspective.
>> Although right now it seems that it only exposes the seq number issue,
>> we don't know if there are other potential DM issues. Thus I was
>> thinking fundamentally we are using DM CPU uclass in a wrong way.
>
> We don't add devices when running on the AP CPUs - we only scan lists.
> So long as the boot CPU creates all the devices and then waits for
> them to populate, we are OK. I don't see any fundamental problem.
>

OK, that makes me feel better, if we only need to resolve the seq
number issue. I will submit a patch for that.

Regards,
Bin