[ELDK] hard freeze / spontaneous reboot in native compiles (ELDK 4.2, PPC6xx, NFS root)

Steven A. Falco sfalco at harris.com
Fri Apr 23 14:38:47 CEST 2010


Wolfgang Denk wrote:
> Dear Anthony,
> 
> In message <g39ymlsrs.fsf at dworkin.scrye.com> you wrote:
>>> Are you absolutely sure that your system is really running stable?
>>> symptoms like this are often caused by memory errors, whuich get
>>> triggered under high load (typically when the RAM gets accessed in
>>> burst mode more frequently).
>> I'm not entirely sure, but this isn't a pre-production board or
>> prototype -- it's a shipping product (Actis XCOM-9347).  And it will
>> happily run for months on the vendor's kernel and busybox-based
>> ramdisk.
> 
> That does not necessarily mean much. Eventually this configuration has
> never been stressing the memory interface as much. Try to add some
> serious network load on it, and run some memory intensive applications
> so you can be sure that all of the available RAM actually gets used -
> eventually the problems happen only with some parts of the RAM, like
> the upper bank or similar.
> 
> For example, you can try to NFS mount the ELDK root directory in this
> setup and then "chroot" into it, and then re-run your compile jobs
> there.
> 
> I bet a beer that it will crash then, too.
> 

You might also try running memtester:

http://pyropus.ca/software/memtester/

Give it as much ram to test as you can, and let it run a while.
It has detected problems on hardware with obscure memory problems
for me.

	Steve

>> (Granted, I wasn't stressing it in that mode, but the two boxes I
>> tried it on have both run the vendor kernel+ramdisk for literally
>> months of uptime.)
> 
> We have seen such behaviour quite often before. Memory issues are
> sometimes difficult to detect - they can go unnoticed for years.
> Compiling code with root file system mounted over NFS is our favorite
> stress test for this very purpose - you will get lots of DMA (from the
> network controller) and other burst mode accesses (from cache flushing
> / fetching) which are not easy to acchieve by any memory test
> programs.
> 
>> In the unstable configuration, I found that I could do small compile
>> with both gcc and g++ (e.g., "hello world", and a few-hundred-lines
>> graph path finder).  But when I tried to build boost, it would die
>> randomly.
> 
> Again this smells as if only parts of the RAM show the problem - when
> you haveonly a lightly loaded system these might not be used, or at
> least not stressed enough.
> 
> Of course this is all just speculation - but I think I recognize this
> smell.
> 
>> The errors weren't floating-point exceptions, at least not on the
>> first box.  I'll have to experiement with the second box to see if
>> that was the pattern there.
> 
> A typical pattern is that there is no clear pattern ;-)
> 
> You might try and hook up a logic analyzer to the data bus and check
> the signals, but this is always a LOT of work.
> 
>> The SDRAM seems to be set up correctly by u-boot (again, supplied by
>> the vendor.)  They ship a debian root; I'll have to try again to see
>> if I can't figure out the differences, or if I can do stress tests on
>> the box running on the debian root.
> 
> Try to run the same compile under Debian, mounting the root file
> system over NFS as well. I'm confident that it will crash the same
> way.
> 
> If you want to throw man-power at it, then start with a review of the
> memory initialization. Compare against the RAM chip manuals - make
> sure that all delays, dummy reads or writess and such that are
> mentioned in the manual are actually implemented in the code. Each
> tiny detail might be the critical one.
> 
>> Thank you very much for your assistance, and for providing such an
>> excellent set of tools for us to use.
> 
> You are welcome - and good luck.
> 
> Best regards,
> 
> Wolfgang Denk
> 



More information about the eldk mailing list