[U-Boot] [PATCH 01/21] Define new system_restart() and emergency_restart()

Mon Mar 14 20:52:37 CET 2011

On Mar 14, 2011, at 14:59, Wolfgang Denk wrote:
> In message <A60AEA13-1206-4699-9302-0DF9C0F9DE28 at boeing.com> you wrote:
>> My own board needs both processor modules to synchronize resets to allow
>> them to come back up at all, which means that a "reset" may block for an
>> arbitrary amount of time waiting for the other module to cleanly shut down
>> and restart (or waiting for somebody to type "reset" on the other U-Boot).
>> If someone just types "reset" on the console, I want to allow them to hit
>> Ctrl-C to interrupt the process.
> 
> This is not what the "reset" command is supposed to do.  The reset
> command is supposed to be the software equivalent of someone pressing
> the reset button on your board - to the extend possible to be
> implemented in software.

On our boards, when the "reset" button is pressed in hardware, both processor modules on the board and all the attached hardware reset at the same time.

If just *one* of the 2 CPUs triggers the reset then only *some* of the attached hardware will be properly reset due to a hardware errata, and as a result the board will sometimes hang or corrupt DMA transfers to the SSDs shortly after reset.

The only way to reset either chip safely is by resetting both at the same time, which requires them to communicate before the reset and wait (possibly a long time) for the other board to agree to reset.  Yes, it's a royal pain, but we're stuck with this hardware for the time being, and if the board can't communicate then it might as well hang() anyways.

This same logic is also implemented in my Linux board-support code, so when one CPU requests a reset the other treats it as a Ctrl-Alt-Del.

>>> What is the difference between these two - and why do we need
>>> different functions at all?
>>> 
>>> A reset is a reset is a reset, isn't it?
>> 
>> That might be true *IF* all boards could actually perform a real hardware reset.
>> 
>> Some can't, and instead they just jump to their reset vector (Nios-II) or to flash (some ppc 74xx/7xx systems).
> 
> So this is the "reset" on these boards, then.
> 
>> If the board just panic()ed or got an unhandled trap or exception, then you
>> don't want to do a soft-reset that assumes everything is OK.  A startup in
>> a bad environment like that could corrupt FLASH or worse.  Right now there
>> is no way to tell the difference, but the lower-level arch-specific code
>> really should care.
> 
> I don't understand your chain of arguments.
> 
> If there really is no better way to implement the reset on such
> boards, then what else can we do?
> 
> And if there are more things that could be done to provide a "better"
> reset, then why should we not always do these?

If the board is in a panic() state it may well have still-running DMA transfers (such as USB URBs), or be in the middle of writing to FLASH.

Performing a jump to early-boot code which is only ever tested when everything is OK and devices are properly initialized is a great way to cause data corruption.

I know for a fact that our boards would rather hang forever than try to reset without cooperation from the other CPU.

>>> My initial feeling is a plain NAK, for this and the rest of the patch
>>> series.  Why would we want all this?
>> 
>> While I was going through the hooks I noticed that several of them were
>> explicitly NOT safe if the board was in the middle of a panic() for whatever
> 
> Can you please peovide some specific examples?  I don't understand what
> you are talking about.

Ok, using the ppmc7xx board as an example:

        /* Disable and invalidate cache */
        icache_disable();
        dcache_disable();

        /* Jump to cold reset point (in RAM) */
        _start();

        /* Should never get here */
        while(1)
                ;

This board uses the EEPRO100 driver, which appears to set up statically allocated TX and RX rings which the device performs DMA to/from.

If this board starts receiving packets and then panic()s, it will disable address translation and immediately re-relocate U-Boot into RAM, then zero the BSS.  If the network card tries to receive a packet after BSS is zeroed, it will read a packet buffer address of (probably) 0x0 from the RX ring and promptly overwrite part of U-Boot's memory at that address.

Depending on the initialization process and memory layout, other similar boards could start writing garbage values to FLASH control registers and potentially corrupt data.

Since the panic() path is so infrequently used and tested, it's better to be safe and hang() on the boards which do not have a reliable hardware-level reset than it is to cause undefined behavior or potentially corrupt data.

Cheers,
Kyle Moffett