RISCV: the machanism of available_harts may cause other harts boot failure
Sean Anderson
seanga2 at gmail.com
Mon Sep 5 17:30:38 CEST 2022
On 9/5/22 3:47 AM, Nikita Shubin wrote:
> Hi Rick!
>
> On Mon, 5 Sep 2022 14:22:41 +0800
> Rick Chen <rickchen36 at gmail.com> wrote:
>
>> Hi,
>>
>> When I free-run a SMP system, I once hit a failure case where some
>> harts didn't boot to the kernel shell successfully.
>> However it can't be duplicated anymore even if I try many times.
>>
>> But when I set a break during debugging with GDB, it can trigger the
>> failure case each time.
>
> If hart fails to register itself to available_harts before
> send_ipi_many is hit by the main hart:
> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/lib/smp.c#L50
>
> it won't exit the secondary_hart_loop:
> https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/cpu/start.S#L433
> As no ipi will be sent to it.
>
> This might be exactly your case.
When working on the IPI mechanism, I considered this possibility. However,
there's really no way to know how long to wait. On normal systems, the boot
hart is going to do a lot of work before calling send_ipi_many, and the
other harts just have to make it through ~100 instructions. So I figured we
would never run into this issue.
We might not even need the mask... the only direct reason we might is for
OpenSBI, as spl_invoke_opensbi is the only function which uses the wait
parameter.
>> I think the mechanism of available_harts does not provide a method
>> that guarantees the success of the SMP system.
>> Maybe we shall think of a better way for the SMP booting or just
>> remove it ?
>
> I haven't experienced any unexplained problem with hart_lottery or
> available_harts_lock unless:
>
> 1) harts are started non-simultaneously
> 2) SPL/U-Boot is in some kind of TCM, OCRAM, etc... which is not cleared
> on reset which leaves available_harts dirty
XIP, of course, has this problem every time and just doesn't use the mask.
I remember thinking a lot about how to deal with this, but I never ended
up sending a patch because I didn't have a XIP system.
--Sean
> 3) something is wrong with atomics
>
> Also there might be something wrong with IPI send/recieve.
>
>>
>> Thread 8 hit Breakpoint 1, harts_early_init ()
>>
>> (gdb) c
>> Continuing.
>> [Switching to Thread 7]
>>
>> Thread 7 hit Breakpoint 1, harts_early_init ()
>>
>> (gdb)
>> Continuing.
>> [Switching to Thread 6]
>>
>> Thread 6 hit Breakpoint 1, harts_early_init ()
>>
>> (gdb)
>> Continuing.
>> [Switching to Thread 5]
>>
>> Thread 5 hit Breakpoint 1, harts_early_init ()
>>
>> (gdb)
>> Continuing.
>> [Switching to Thread 4]
>>
>> Thread 4 hit Breakpoint 1, harts_early_init ()
>>
>> (gdb)
>> Continuing.
>> [Switching to Thread 3]
>>
>> Thread 3 hit Breakpoint 1, harts_early_init ()
>> (gdb)
>> Continuing.
>> [Switching to Thread 2]
>>
>> Thread 2 hit Breakpoint 1, harts_early_init ()
>> (gdb)
>> Continuing.
>> [Switching to Thread 1]
>>
>> Thread 1 hit Breakpoint 1, harts_early_init ()
>> (gdb)
>> Continuing.
>> [Switching to Thread 5]
>>
>>
>> Thread 5 hit Breakpoint 3, 0x0000000001200000 in ?? ()
>> (gdb) info threads
>> Id Target Id Frame
>> 1 Thread 1 (hart 1) secondary_hart_loop () at
>> arch/riscv/cpu/start.S:436 2 Thread 2 (hart 2) secondary_hart_loop
>> () at arch/riscv/cpu/start.S:436 3 Thread 3 (hart 3)
>> secondary_hart_loop () at arch/riscv/cpu/start.S:436 4 Thread 4
>> (hart 4) secondary_hart_loop () at arch/riscv/cpu/start.S:436
>> * 5 Thread 5 (hart 5) 0x0000000001200000 in ?? ()
>> 6 Thread 6 (hart 6) 0x000000000000b650 in ?? ()
>> 7 Thread 7 (hart 7) 0x000000000000b650 in ?? ()
>> 8 Thread 8 (hart 8) 0x0000000000005fa0 in ?? ()
>> (gdb) c
>> Continuing.
>
> Do they all "offline" harts remain in SPL/U-Boot secondary_hart_loop ?
>
>>
>>
>>
>> [ 0.175619] smp: Bringing up secondary CPUs ...
>> [ 1.230474] CPU1: failed to come online
>> [ 2.282349] CPU2: failed to come online
>> [ 3.334394] CPU3: failed to come online
>> [ 4.386783] CPU4: failed to come online
>> [ 4.427829] smp: Brought up 1 node, 4 CPUs
>>
>>
>> /root # cat /proc/cpuinfo
>> processor : 0
>> hart : 4
>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>> mmu : sv39
>>
>> processor : 5
>> hart : 5
>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>> mmu : sv39
>>
>> processor : 6
>> hart : 6
>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>> mmu : sv39
>>
>> processor : 7
>> hart : 7
>> isa : rv64i2p0m2p0a2p0c2p0xv5-1p1
>> mmu : sv39
>>
>> /root #
>>
>> Thanks,
>> Rick
>
More information about the U-Boot
mailing list