RISCV: the machanism of available_harts may cause other harts boot failure

Heinrich Schuchardt xypron.glpk at gmx.de
Mon Sep 5 18:30:00 CEST 2022


On 9/5/22 18:14, Sean Anderson wrote:
> On 9/5/22 12:00 PM, Heinrich Schuchardt wrote:
>> On 9/5/22 17:45, Sean Anderson wrote:
>>> On 9/5/22 11:41 AM, Heinrich Schuchardt wrote:
>>>> On 9/5/22 17:30, Sean Anderson wrote:
>>>>> On 9/5/22 3:47 AM, Nikita Shubin wrote:
>>>>>> Hi Rick!
>>>>>>
>>>>>> On Mon, 5 Sep 2022 14:22:41 +0800
>>>>>> Rick Chen <rickchen36 at gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> When I free-run a SMP system, I once hit a failure case where some
>>>>>>> harts didn't boot to the kernel shell successfully.
>>>>>>> However it can't be duplicated anymore even if I try many times.
>>>>>>>
>>>>>>> But when I set a break during debugging with GDB, it can trigger the
>>>>>>> failure case each time.
>>>>>>
>>>>>> If hart fails to register itself to available_harts before
>>>>>> send_ipi_many is hit by the main hart:
>>>>>> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/lib/smp.c#L50
>>>>>>
>>>>>> it won't exit the secondary_hart_loop:
>>>>>> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/cpu/start.S#L433
>>>>>> As no ipi will be sent to it.
>>>>
>>>> Can we call send_ipi_many() again when booting?
>>>
>>> AFAIK we do; see arch/riscv/lib/bootm.c and arch/riscv/lib/spl.c
>>
>> arch/riscv/lib/bootm.c:99:
>> ret = smp_call_function(images->ep,
>>
>> This has no effect when booting via UEFI.
>
> How do you figure?

U-Boot never calls the legacy entry point when booting via UEFI.

>
>> Should efi_exit_boot_services() call the function?
>
> Generally, this needs to be called when secondary_hart_loop is going to be
> overwritten. This can either be because U-Boot is relocating (and so
> something
> else may occupy the space where it used to be), or we are executing the
> next
> stage of boot (which may then reuse the memory occupied by
> secondary_hart_loop
> for something else).
>
> AIUI the EFI client?/payload? gets started by U-Boot, which sticks around
> providing services. I would expect the initial jump to the EFI payload
> to cause
> the secondary harts to jump there as well.

Secondary harts never enter UEFI payloads.

Best regards

Heinrich

>
> --Sean
>
>> Best regards
>>
>> Heinrich
>>
>>>
>>>> Do we need to call it before booting?
>>>
>>> Yes. We also call it when relocating (in SPL and U-Boot proper).
>>>
>>>>>>
>>>>>> This might be exactly your case.
>>>>>
>>>>> When working on the IPI mechanism, I considered this possibility.
>>>>> However,
>>>>> there's really no way to know how long to wait. On normal systems,
>>>>> the boot
>>>>> hart is going to do a lot of work before calling send_ipi_many, and
>>>>> the
>>>>> other harts just have to make it through ~100 instructions. So I
>>>>> figured we
>>>>> would never run into this issue.
>>>>>
>>>>> We might not even need the mask... the only direct reason we might is
>>>>> for
>>>>> OpenSBI, as spl_invoke_opensbi is the only function which uses the
>>>>> wait
>>>>> parameter.
>>>>>
>>>>>>> I think the mechanism of available_harts does not provide a method
>>>>>>> that guarantees the success of the SMP system.
>>>>>>> Maybe we shall think of a better way for the SMP booting or just
>>>>>>> remove it ?
>>>>>>
>>>>>> I haven't experienced any unexplained problem with hart_lottery or
>>>>>> available_harts_lock unless:
>>>>>>
>>>>>> 1) harts are started non-simultaneously
>>>>>> 2) SPL/U-Boot is in some kind of TCM, OCRAM, etc... which is not
>>>>>> cleared
>>>>>> on reset which leaves available_harts dirty
>>>>>
>>>>> XIP, of course, has this problem every time and just doesn't use the
>>>>> mask.
>>>>> I remember thinking a lot about how to deal with this, but I never
>>>>> ended
>>>>> up sending a patch because I didn't have a XIP system.
>>>>>
>>>>> --Sean
>>>>>
>>>>>> 3) something is wrong with atomics
>>>>>>
>>>>>> Also there might be something wrong with IPI send/recieve.
>>>>>>
>>>>>>>
>>>>>>> Thread 8 hit Breakpoint 1, harts_early_init ()
>>>>>>>
>>>>>>> (gdb) c
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 7]
>>>>>>>
>>>>>>> Thread 7 hit Breakpoint 1, harts_early_init ()
>>>>>>>
>>>>>>> (gdb)
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 6]
>>>>>>>
>>>>>>> Thread 6 hit Breakpoint 1, harts_early_init ()
>>>>>>>
>>>>>>> (gdb)
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 5]
>>>>>>>
>>>>>>> Thread 5 hit Breakpoint 1, harts_early_init ()
>>>>>>>
>>>>>>> (gdb)
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 4]
>>>>>>>
>>>>>>> Thread 4 hit Breakpoint 1, harts_early_init ()
>>>>>>>
>>>>>>> (gdb)
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 3]
>>>>>>>
>>>>>>> Thread 3 hit Breakpoint 1, harts_early_init ()
>>>>>>> (gdb)
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 2]
>>>>>>>
>>>>>>> Thread 2 hit Breakpoint 1, harts_early_init ()
>>>>>>> (gdb)
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 1]
>>>>>>>
>>>>>>> Thread 1 hit Breakpoint 1, harts_early_init ()
>>>>>>> (gdb)
>>>>>>> Continuing.
>>>>>>> [Switching to Thread 5]
>>>>>>>
>>>>>>>
>>>>>>> Thread 5 hit Breakpoint 3, 0x0000000001200000 in ?? ()
>>>>>>> (gdb) info threads
>>>>>>>    Id   Target Id         Frame
>>>>>>>    1    Thread 1 (hart 1) secondary_hart_loop () at
>>>>>>> arch/riscv/cpu/start.S:436 2    Thread 2 (hart 2)
>>>>>>> secondary_hart_loop
>>>>>>> () at arch/riscv/cpu/start.S:436 3    Thread 3 (hart 3)
>>>>>>> secondary_hart_loop () at arch/riscv/cpu/start.S:436 4    Thread 4
>>>>>>> (hart 4) secondary_hart_loop () at arch/riscv/cpu/start.S:436
>>>>>>> * 5    Thread 5 (hart 5) 0x0000000001200000 in ?? ()
>>>>>>>    6    Thread 6 (hart 6) 0x000000000000b650 in ?? ()
>>>>>>>    7    Thread 7 (hart 7) 0x000000000000b650 in ?? ()
>>>>>>>    8    Thread 8 (hart 8) 0x0000000000005fa0 in ?? ()
>>>>>>> (gdb) c
>>>>>>> Continuing.
>>>>>>
>>>>>> Do they all "offline" harts remain in SPL/U-Boot
>>>>>> secondary_hart_loop ?
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [    0.175619] smp: Bringing up secondary CPUs ...
>>>>>>> [    1.230474] CPU1: failed to come online
>>>>>>> [    2.282349] CPU2: failed to come online
>>>>>>> [    3.334394] CPU3: failed to come online
>>>>>>> [    4.386783] CPU4: failed to come online
>>>>>>> [    4.427829] smp: Brought up 1 node, 4 CPUs
>>>>>>>
>>>>>>>
>>>>>>> /root # cat /proc/cpuinfo
>>>>>>> processor       : 0
>>>>>>> hart            : 4
>>>>>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>>>>> mmu             : sv39
>>>>>>>
>>>>>>> processor       : 5
>>>>>>> hart            : 5
>>>>>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>>>>> mmu             : sv39
>>>>>>>
>>>>>>> processor       : 6
>>>>>>> hart            : 6
>>>>>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>>>>> mmu             : sv39
>>>>>>>
>>>>>>> processor       : 7
>>>>>>> hart            : 7
>>>>>>> isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
>>>>>>> mmu             : sv39
>>>>>>>
>>>>>>> /root #
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rick
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>



More information about the U-Boot mailing list