[ELDK] hard freeze / spontaneous reboot in native compiles (ELDK 4.2, PPC6xx, NFS root)

Fri Apr 23 09:11:32 CEST 2010

Dear Anthony,

In message <g39ymlsrs.fsf at dworkin.scrye.com> you wrote:
> 
> > Are you absolutely sure that your system is really running stable?
> > symptoms like this are often caused by memory errors, whuich get
> > triggered under high load (typically when the RAM gets accessed in
> > burst mode more frequently).
> 
> I'm not entirely sure, but this isn't a pre-production board or
> prototype -- it's a shipping product (Actis XCOM-9347).  And it will
> happily run for months on the vendor's kernel and busybox-based
> ramdisk.

That does not necessarily mean much. Eventually this configuration has
never been stressing the memory interface as much. Try to add some
serious network load on it, and run some memory intensive applications
so you can be sure that all of the available RAM actually gets used -
eventually the problems happen only with some parts of the RAM, like
the upper bank or similar.

For example, you can try to NFS mount the ELDK root directory in this
setup and then "chroot" into it, and then re-run your compile jobs
there.

I bet a beer that it will crash then, too.

> (Granted, I wasn't stressing it in that mode, but the two boxes I
> tried it on have both run the vendor kernel+ramdisk for literally
> months of uptime.)

We have seen such behaviour quite often before. Memory issues are
sometimes difficult to detect - they can go unnoticed for years.
Compiling code with root file system mounted over NFS is our favorite
stress test for this very purpose - you will get lots of DMA (from the
network controller) and other burst mode accesses (from cache flushing
/ fetching) which are not easy to acchieve by any memory test
programs.

> In the unstable configuration, I found that I could do small compile
> with both gcc and g++ (e.g., "hello world", and a few-hundred-lines
> graph path finder).  But when I tried to build boost, it would die
> randomly.

Again this smells as if only parts of the RAM show the problem - when
you haveonly a lightly loaded system these might not be used, or at
least not stressed enough.

Of course this is all just speculation - but I think I recognize this
smell.

> The errors weren't floating-point exceptions, at least not on the
> first box.  I'll have to experiement with the second box to see if
> that was the pattern there.

A typical pattern is that there is no clear pattern ;-)

You might try and hook up a logic analyzer to the data bus and check
the signals, but this is always a LOT of work.

> The SDRAM seems to be set up correctly by u-boot (again, supplied by
> the vendor.)  They ship a debian root; I'll have to try again to see
> if I can't figure out the differences, or if I can do stress tests on
> the box running on the debian root.

Try to run the same compile under Debian, mounting the root file
system over NFS as well. I'm confident that it will crash the same
way.

If you want to throw man-power at it, then start with a review of the
memory initialization. Compare against the RAM chip manuals - make
sure that all delays, dummy reads or writess and such that are
mentioned in the manual are actually implemented in the code. Each
tiny detail might be the critical one.

> Thank you very much for your assistance, and for providing such an
> excellent set of tools for us to use.

You are welcome - and good luck.

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
My challenge to the goto-less programmer  is  to  recode  tcp_input()
without any gotos ... without any loss of efficiency (there has to be
a catch).                                             - W. R. Stevens