[U-Boot] [PATCH 01/21] Define new system_restart() and emergency_restart()

Mon Mar 14 21:38:08 CET 2011

Dear "Moffett, Kyle D",

In message <613C8F89-3CE5-4C28-A48E-D5C3E8143A4C at boeing.com> you wrote:
>
> On our boards, when the "reset" button is pressed in hardware, both
> processor modules on the board and all the attached hardware reset at
> the same time.

OK.  So a sane design would provide a way for both of the processors
to do the same, for example by toggeling some GPIO or similar.

> If just *one* of the 2 CPUs triggers the reset then only *some* of
> the attached hardware will be properly reset due to a hardware
> errata, and as a result the board will sometimes hang or corrupt DMA
> transfers to the SSDs shortly after reset.
...
> Yes, it's a royal pain, but we're stuck with this hardware for the
> time being, and if the board can't communicate then it might as well
> hang() anyways.

Do you agree that this is a highly board-specific problem (I would
call it a hardware bug, but I don't insist you agree on that term),
and while there is a the need for you to work around such behaviour
there is little or no reason to do this, or anything like that, in
common code ?

> > And if there are more things that could be done to provide a "better"
> > reset, then why should we not always do these?
> 
> If the board is in a panic() state it may well have still-running DMA
> transfers (such as USB URBs), or be in the middle of writing to
> FLASH.

The same (at least having USB or other drivers still being enabled,
and USB writing it's SOF counters to RAM) can happen for any call to
the reset() function.  I see no reason for assuming there would be
better or worse conditions to perform a reset.

> Performing a jump to early-boot code which is only ever tested when
> everything is OK and devices are properly initialized is a great way
> to cause data corruption.

If there is a software way to prevent such issues, then these steps
should always be performed.

> I know for a fact that our boards would rather hang forever than try
> to reset without cooperation from the other CPU.

As mentioned above, this is a board specific issue that should not
influence common code design.

> >> While I was going through the hooks I noticed that several of them were
> >> explicitly NOT safe if the board was in the middle of a panic() for whatever
> > 
> > Can you please peovide some specific examples?  I don't understand what
> > you are talking about.
> 
> Ok, using the ppmc7xx board as an example:
> 
>         /* Disable and invalidate cache */
>         icache_disable();
>         dcache_disable();
> 
>         /* Jump to cold reset point (in RAM) */
>         _start();
> 
>         /* Should never get here */
>         while(1)
>                 ;
> 
> This board uses the EEPRO100 driver, which appears to set up
> statically allocated TX and RX rings which the device performs DMA
> to/from.
> 
> If this board starts receiving packets and then panic()s, it will
> disable address translation and immediately re-relocate U-Boot into
> RAM, then zero the BSS. If the network card tries to receive a packet
> after BSS is zeroed, it will read a packet buffer address of
> (probably) 0x0 from the RX ring and promptly overwrite part of
> U-Boot's memory at that address.

Agreed.  So this should be fixed.  One clean way to fix it would be to
help improving the driver model for U-Boot (read: create one) and
making sure drivers get deinitialized in such a case.

> Since the panic() path is so infrequently used and tested, it's
> better to be safe and hang() on the boards which do not have a
> reliable hardware-level reset than it is to cause undefined behavior
> or potentially corrupt data.

I disagree.  Instead of adding somewhat obscure alternate code paths
(which get tested even less frequently) we should focus oin fixing
such problems where we run into them.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
Microsoft Multitasking:
                     several applications can crash at the same time.