[U-Boot-Users] [PATCH] ppc4xx: Refactor ECC POST for AMCC Denali core

Mon Jan 14 18:44:58 CET 2008

Jerry Van Baren wrote:
> Larry Johnson wrote:
>> The ECC POST reported intermittent failures running after power-up on
>> the Korat PPC440EPx board.  Even when the test passed, the debugging
>> output occasionally reported additional unexpected ECC errors.
>>
>> This refactoring had two main objectives: (1) minimize the code executed
>> with ECC enabled during the tests, and (2) add more checking of the
>> results so any unexpected ECC errors would cause the test to fail.
>>
>> So far, the refactored test has not reported any intermittent failures.
>> Further, synchronization instructions appear no longer to be require, so
>> have been removed.  If intermittent failures do occur in the future, the
>> refactoring should make the causes easier to identify.
> 
> WHOOP, WHOOP, WHOOP, red alert!  "[S]ynchronization instructions appear 
> no longer to be require[d], so have been removed".
> 
> Synchronization instructions either *ARE* required or *ARE NOT* 
> required, there is no "appear".  When sync instructions appear to not be 
> required, but actually are required, that is when really obscure bugs 
> start happening in the dead of winter off the coast of Alaska / Siberia 
> and your boss asks you if you have warm clothes.
> 
> I am not familiar with the 4xx family or the PowerPC core that is used 
> in it but...
> 
> [snip]
> 
>> -static int test_ecc(unsigned long ecc_addr)
>> +static int test_ecc(uint32_t ecc_addr)
>>  {
>> -    unsigned long value;
>> -    volatile unsigned *const ecc_mem = (volatile unsigned *) ecc_addr;
>> -    int pret;
>> +    uint32_t value;
>> +    volatile uint32_t *const ecc_mem = (volatile uint32_t *)ecc_addr;
>>      int ret = 0;
>>  
>> -    sync();
>> -    eieio();
>>      WATCHDOG_RESET();
> 
> The combination of "sync" and "eieio" is a strong indication of someone 
> sprinkling pixie dust rather than understanding the problem.
> 
> "Sync" forces all pending I/O (read/write) operations to be completed 
> all the way to memory/hardware register before the instruction 
> continues.  Sync guarantees *WHEN* the I/O will complete: NOW.  This is 
> a big hammer: it can cause a significant performance hit because it 
> stalls the processor BUT it is guaranteed effective (except for the 
> places that need an both an isync and a sync combination - thankfully, I 
> believe that is only needed in special cases when playing with the 
> processor's control registers).
> 
> "Eieio" (enforce in-order execution of I/O) is a barrier that says all 
> I/O that goes before it must be completed before any I/O that goes after 
> it is started.  It *DOES NOT* guarantee *WHEN* the preceding 
> reads/writes will be completed.  Theoretically, the bus interface unit 
> (BIU) could hold the whole shootin' match for 10 minutes before it does 
> the preceding I/O followed by the succeeding I/O.  Eieio is much less 
> draconian to the processor than sync (which is why eieios are preferred) 
> but an eieio may or may not cause the intended synchronizing result if 
> you are relying on a write or read causing the proper effect *NOW*. Note 
> that eieios are NOPs to processor cores that don't reorder I/O.
> 
> Some PowerPC cores (e.g. the 74xx family) can reorder reads and writes 
> in the bus interface unit (some cores, such as the 603e, do *not* 
> reorder reads and writes).  This is a performance enhancement... writes 
> (generally) are non-blocking to the processor core where a read causes 
> the processor to have to wait for the data (which cascades into pipeline 
> stalls and performance hits).  The bus is a highly oversubscribed 
> resource (core speed / bus speed can be 8x or more).  As a result, you 
> want to get reads done ASAP (if possible) and thus it is beneficial to 
> move a read ahead of a write.
> 
> As you should have picked up by now, a sync (forcing all I/O to 
> complete) followed by eieio is silly - the eieio is superfluous.  Seeing 
> syncs/isyncs/eieios sprinkled in code is an indication that the author 
> didn't understand what was going on and, as a result, kept hitting the 
> problem with a bigger and bigger hammer until it appeared to have gone 
> away.
> 
> Besides read/write reordering problems, the bus interface unit (BIU) can 
> "short circuit" a read that follows a write to the same address.  This 
> is very likely to be implemented in a given core - it offers a very good 
> speed up traded off against a modest increase in complexity to the BIU. 
>  The problem is (for instance), if you configure your EDC to store an 
> invalid EDC flag, do a write to a test location (which gets held in the 
> BIU because the bus is busy), followed by a read of the test location 
> (expecting to see an EDC failure), the BIU could return the queued *but 
> unwritten* write value.
> 
> OK, enough lecturing...
> 
> Repeated disclaimer: What I write here is applicable for more complex 
> PowerPC implementations.  It may not be applicable for the particular 
> 4xx core you are running on.  I am not familiar with the 4xx core.
> 
> The reason sync/eieio is very likely VITAL in a EDC test is that...
> 
> 1) The EDC is being reconfigured in a way that can cause latent EDC 
> faults.  If a write - for instance a save of a register on the stack - 
> gets deferred inadvertently until _after_ the EDC hardware is configured 
> to test an error, you could end up with the register save performed with 
> inadvertently screwed up EDC.  As a result, when your code executes the 
> return postlog (popping the register off the stack) you will get a 
> totally unexpected EDC error.
> 
> 2) Even if an eieio is used properly, the (EDC reconfiguration, write, 
> read, EDC fixup) sequence may occur in the right order, but it may occur 
> *way later* than you expected which could cause an EDC exception way 
> later in the code than you expected.  This would lead to very flaky 
> results, unexpected EDC failures, etc.  Hmmmmm.
> 
> While the previous scenarios are worst cases, improper sync discipline 
> can cause test failures as well.  In fact, it is actually more likely to 
> cause problems with the test that the worst case scenario.  For 
> instance, if the BIU holds a write and short-circuits subsequent reads, 
> you may *think* you are testing EDC but, if the BIU has the write queued 
> and the read comes from the BIU rather than actual memory, the BIU will 
> inadvertently short circuit your test as well.
> 
> By the way, if interrupts are enabled during this time.......... 
> (shudders) oooh, good choice to run polled, Dan/Wolfgang!
> 
> [major snip]
> 
> HTH,
> gvb

Yes, it does help.  Thanks, Jerry.

When I first modified the (then) LWMON5 ECC POST to run on Korat and
other 440EPx boards, Stefan urged me to replace the memory accesses
using volatile pointers with accesses via "in_be32()" et al.  As I
understood it, this should have eliminated the need for any external
synchronization.  However, after checking the PPC440 documetation, I
now believe that there should be a "sync" (actually, an "msync", which
is the same opcode) between the memory access and access to the SDRAM-
controller registers.

(From what I can tell, "in_be32()" et al. do not not force completion
of the storage access before returning.  Is this correct?)

Stefan, please hold of on this patch, as I expect to be resubmitting it
soon. :-)

Best regards,
Larry