[U-Boot-Users] [PATCH] ppc4xx: Refactor ECC POST for AMCC Denali core
Larry Johnson
lrj at acm.org
Mon Jan 14 18:44:58 CET 2008
Jerry Van Baren wrote:
> Larry Johnson wrote:
>> The ECC POST reported intermittent failures running after power-up on
>> the Korat PPC440EPx board. Even when the test passed, the debugging
>> output occasionally reported additional unexpected ECC errors.
>>
>> This refactoring had two main objectives: (1) minimize the code executed
>> with ECC enabled during the tests, and (2) add more checking of the
>> results so any unexpected ECC errors would cause the test to fail.
>>
>> So far, the refactored test has not reported any intermittent failures.
>> Further, synchronization instructions appear no longer to be require, so
>> have been removed. If intermittent failures do occur in the future, the
>> refactoring should make the causes easier to identify.
>
> WHOOP, WHOOP, WHOOP, red alert! "[S]ynchronization instructions appear
> no longer to be require[d], so have been removed".
>
> Synchronization instructions either *ARE* required or *ARE NOT*
> required, there is no "appear". When sync instructions appear to not be
> required, but actually are required, that is when really obscure bugs
> start happening in the dead of winter off the coast of Alaska / Siberia
> and your boss asks you if you have warm clothes.
>
> I am not familiar with the 4xx family or the PowerPC core that is used
> in it but...
>
> [snip]
>
>> -static int test_ecc(unsigned long ecc_addr)
>> +static int test_ecc(uint32_t ecc_addr)
>> {
>> - unsigned long value;
>> - volatile unsigned *const ecc_mem = (volatile unsigned *) ecc_addr;
>> - int pret;
>> + uint32_t value;
>> + volatile uint32_t *const ecc_mem = (volatile uint32_t *)ecc_addr;
>> int ret = 0;
>>
>> - sync();
>> - eieio();
>> WATCHDOG_RESET();
>
> The combination of "sync" and "eieio" is a strong indication of someone
> sprinkling pixie dust rather than understanding the problem.
>
> "Sync" forces all pending I/O (read/write) operations to be completed
> all the way to memory/hardware register before the instruction
> continues. Sync guarantees *WHEN* the I/O will complete: NOW. This is
> a big hammer: it can cause a significant performance hit because it
> stalls the processor BUT it is guaranteed effective (except for the
> places that need an both an isync and a sync combination - thankfully, I
> believe that is only needed in special cases when playing with the
> processor's control registers).
>
> "Eieio" (enforce in-order execution of I/O) is a barrier that says all
> I/O that goes before it must be completed before any I/O that goes after
> it is started. It *DOES NOT* guarantee *WHEN* the preceding
> reads/writes will be completed. Theoretically, the bus interface unit
> (BIU) could hold the whole shootin' match for 10 minutes before it does
> the preceding I/O followed by the succeeding I/O. Eieio is much less
> draconian to the processor than sync (which is why eieios are preferred)
> but an eieio may or may not cause the intended synchronizing result if
> you are relying on a write or read causing the proper effect *NOW*. Note
> that eieios are NOPs to processor cores that don't reorder I/O.
>
> Some PowerPC cores (e.g. the 74xx family) can reorder reads and writes
> in the bus interface unit (some cores, such as the 603e, do *not*
> reorder reads and writes). This is a performance enhancement... writes
> (generally) are non-blocking to the processor core where a read causes
> the processor to have to wait for the data (which cascades into pipeline
> stalls and performance hits). The bus is a highly oversubscribed
> resource (core speed / bus speed can be 8x or more). As a result, you
> want to get reads done ASAP (if possible) and thus it is beneficial to
> move a read ahead of a write.
>
> As you should have picked up by now, a sync (forcing all I/O to
> complete) followed by eieio is silly - the eieio is superfluous. Seeing
> syncs/isyncs/eieios sprinkled in code is an indication that the author
> didn't understand what was going on and, as a result, kept hitting the
> problem with a bigger and bigger hammer until it appeared to have gone
> away.
>
> Besides read/write reordering problems, the bus interface unit (BIU) can
> "short circuit" a read that follows a write to the same address. This
> is very likely to be implemented in a given core - it offers a very good
> speed up traded off against a modest increase in complexity to the BIU.
> The problem is (for instance), if you configure your EDC to store an
> invalid EDC flag, do a write to a test location (which gets held in the
> BIU because the bus is busy), followed by a read of the test location
> (expecting to see an EDC failure), the BIU could return the queued *but
> unwritten* write value.
>
> OK, enough lecturing...
>
> Repeated disclaimer: What I write here is applicable for more complex
> PowerPC implementations. It may not be applicable for the particular
> 4xx core you are running on. I am not familiar with the 4xx core.
>
> The reason sync/eieio is very likely VITAL in a EDC test is that...
>
> 1) The EDC is being reconfigured in a way that can cause latent EDC
> faults. If a write - for instance a save of a register on the stack -
> gets deferred inadvertently until _after_ the EDC hardware is configured
> to test an error, you could end up with the register save performed with
> inadvertently screwed up EDC. As a result, when your code executes the
> return postlog (popping the register off the stack) you will get a
> totally unexpected EDC error.
>
> 2) Even if an eieio is used properly, the (EDC reconfiguration, write,
> read, EDC fixup) sequence may occur in the right order, but it may occur
> *way later* than you expected which could cause an EDC exception way
> later in the code than you expected. This would lead to very flaky
> results, unexpected EDC failures, etc. Hmmmmm.
>
> While the previous scenarios are worst cases, improper sync discipline
> can cause test failures as well. In fact, it is actually more likely to
> cause problems with the test that the worst case scenario. For
> instance, if the BIU holds a write and short-circuits subsequent reads,
> you may *think* you are testing EDC but, if the BIU has the write queued
> and the read comes from the BIU rather than actual memory, the BIU will
> inadvertently short circuit your test as well.
>
> By the way, if interrupts are enabled during this time..........
> (shudders) oooh, good choice to run polled, Dan/Wolfgang!
>
> [major snip]
>
> HTH,
> gvb
Yes, it does help. Thanks, Jerry.
When I first modified the (then) LWMON5 ECC POST to run on Korat and
other 440EPx boards, Stefan urged me to replace the memory accesses
using volatile pointers with accesses via "in_be32()" et al. As I
understood it, this should have eliminated the need for any external
synchronization. However, after checking the PPC440 documetation, I
now believe that there should be a "sync" (actually, an "msync", which
is the same opcode) between the memory access and access to the SDRAM-
controller registers.
(From what I can tell, "in_be32()" et al. do not not force completion
of the storage access before returning. Is this correct?)
Stefan, please hold of on this patch, as I expect to be resubmitting it
soon. :-)
Best regards,
Larry
More information about the U-Boot
mailing list