[U-Boot] [PATCH 01/21] Define new system_restart() and emergency_restart()

Mon Mar 14 22:20:23 CET 2011

On Mar 14, 2011, at 16:38, Wolfgang Denk wrote:
> In message <613C8F89-3CE5-4C28-A48E-D5C3E8143A4C at boeing.com> you wrote:
>> 
>> If just *one* of the 2 CPUs triggers the reset then only *some* of
>> the attached hardware will be properly reset due to a hardware
>> errata, and as a result the board will sometimes hang or corrupt DMA
>> transfers to the SSDs shortly after reset.
> ...
>> Yes, it's a royal pain, but we're stuck with this hardware for the
>> time being, and if the board can't communicate then it might as well
>> hang() anyways.
> 
> Do you agree that this is a highly board-specific problem (I would
> call it a hardware bug, but I don't insist you agree on that term),
> and while there is a the need for you to work around such behaviour
> there is little or no reason to do this, or anything like that, in
> common code ?

Oh, absolutely.  I do think there still needs to be a separation between a "normal user-initiated restart" and an "panic-time emergency restart" though, see further on in this email.

The comment about Ctrl-C was simply because our board restart can be aborted by someone at the console because it may take a while for the other CPU to cleanly shut down and acknowledge.  Our __board_restart() watches for Ctrl-C to handle this nicely, and then returns an error, which makes do_reset() return to the user or script with an error.  Obviously in the middle of a panic() there is nothing useful to return to, so the code keeps polling regardless.

Such decisions on what is and is not "acceptable" to run on a panic() are better left to the individual boards and architectures.  Specifically, the separate board and arch hooks for regular and "emergency" restarts that I included in the patch:

  __arch_restart()
  __board_restart()
  __arch_emergency_restart()
  __board_emergency_restart()

>>> And if there are more things that could be done to provide a "better"
>>> reset, then why should we not always do these?
>> 
>> If the board is in a panic() state it may well have still-running DMA
>> transfers (such as USB URBs), or be in the middle of writing to
>> FLASH.
> 
> The same (at least having USB or other drivers still being enabled,
> and USB writing it's SOF counters to RAM) can happen for any call to
> the reset() function.  I see no reason for assuming there would be
> better or worse conditions to perform a reset.

I would argue that is a bug to be fixed.  Regardless of how various boards and architectures implement "reset", U-Boot should provide generic functionality to drivers and library code to allow them to indicate what they want:

  (1) A safe normal operational restart, with all hardware shut down (as much as is considered necessary on the platform).  Depending on the platform this may fail or take some time.

  (2) A critical error restart, where system state may be undefined and the calling code does not expect the function to ever return.

Linux has *both* of those cases in the kernel:  sys_reboot() and emergency_restart().

>> If this board starts receiving packets and then panic()s, it will
>> disable address translation and immediately re-relocate U-Boot into
>> RAM, then zero the BSS. If the network card tries to receive a packet
>> after BSS is zeroed, it will read a packet buffer address of
>> (probably) 0x0 from the RX ring and promptly overwrite part of
>> U-Boot's memory at that address.
> 
> Agreed.  So this should be fixed.  One clean way to fix it would be to
> help improving the driver model for U-Boot (read: create one) and
> making sure drivers get deinitialized in such a case.

This another excellent reason to have separate system_restart() and emergency_restart().  The "system_restart()" function would hook into the driver model and perform device shutdown (just like Linux), while the emergency_restart() function (and therefore panic()) would not.

The "jump to _start" case is very similar to Linux kexec().  There are two specific use-cases:
  (1) Safe reliable run-time handoff from one kernel to another
  (2) Emergency panic() call into another kernel to record the error and reboot safely

In the former case, the kernel runs all of its normal shutdown handlers and quiesces all DMA before passing control to the new kernel.  Under U-Boot this is analogous to the "system_restart()" function I added.  The "system_restart()" function would be the ideal central place to hook driver-model device shut-down.

In the latter case (similar to "emergency_restart()"), the new kernel runs in a sub-section of memory *completely-preallocated* at boot time, and explicitly ignores all other memory because there might be ongoing DMA transactions.  The "emergency kexec" case intentionally avoids running device shutdown handlers because the system state is not well defined.  Under U-Boot there is no way to run completely from a preallocated chunk of memory, which means that a panic()-time jump to "_start" is not safe.

Cheers,
Kyle Moffett