[U-Boot-Users] Re: Redundant environment

Sat May 6 01:28:47 CEST 2006

Dear Tolunay,

in message <445B8086.9000404 at orkun.us> you wrote:
> 
> This patch would solve the issue that exists today that when the 
> "active" environment is lost/corrupted for some reason the "redundant" 
> environment would contain an exact copy of the primary to have the board 
> come up without requiring the need to redo the changes that was lost on 

Actually I think that you will not acchieve  this  with  your  patch.
This  is  why  I'm concerned. You see, if you feel better having this
patch I would not complain, but I am afraid  that  a  lot  of  people
might  just  activate it because they think it would do them any good
when it doesn't (and actually it just hurts).

There is only one occasion when we have any significant likelyhood of
losing the environment data: this is when a call to  "saveenv"  fails
becaue  either  a)  we  have  a  power  loss, b) we have an otherwise
induced reset of the CPU, or  c)  the  flash  sector  that  shall  be
erased/written is failing.

So where exactly does your modification improve  anything?  Let's  go
through this step by step.

Case 1: power loss/reset happens during the first  "saveenv",  i.  e.
        when writing the first copy of the new environment data.

        In this case this first copy  contains  no  valid  data;  the
        second copy of the environment contains valid, but old data.

        This is exactly the same as we have with the  current  imple-
        mentation. I don't see any improvement.

Case 2: power loss/reset happens during the second "saveenv",  i.  e.
        when writing the second copy of the new environment data.

        In this case this first copy contains valid new  data,  while
        the  second  copy  of  the environment does not contain valid
        data.

        In the current implementation, the first (and  only)  saveenv
        would  have  completed,  too,  and  the reset would hit after
        leaving this part of code, so we had valid new  data  in  the
        first copy, and valid (but old) data in the second one.

        Again, this is not  an  improvement.  Actually  I  think  the
        current implementations is even more useful.

Case 3: A flash sector in the first copy of the  environment  becomes
        defective  while  we  erase or write it. In this case we will
        see appropriate error conditions, and the  "saveenv"  command
        will abort.

        This is the same as case 1: no valid data in copy  1,  valid,
        but  old  data  in copy 2; no difference between the existing
        and your new implementation.

Case 4: A flash sector in the second copy of the environment  becomes
        defective  while  we  erase or write it. In this case we will
        see appropriate error conditions, and the  "saveenv"  command
        will abort.

        This is the same as case 2: valid new  data  in  copy  1,  no
        valid  data  in copy 2 with your implementation, but probably
        valid old data with the existing code.

I guess I must have missed some cases  because  there  was  none  yet
where  the  new  implementaion  would improve the reliability. Please
fill in these missing cases.

But, and I think this is an undisputet fact,  the  current  implemen-
tation needs only hald the number of erase/write cycles, so it causes
much less flash wear than your code. [Actually your code will see the
same  level  of  flash  wear  as  you  have now without the redundant
environment enabled; it's that enabling the current implementation of
redundance  *improves*  flash  lifetime  by  halfing  the  number  of
erase/write cycles to the environment.]

> Among the things that can cause one environment to go corrupt would be 
> charge decays in memory cells in aging flash, supply variations/noise 

I think that the likelyhood of such a thing  to  happen  during  read
accesses only is infinitesimal.

> during erase/write and random memory corruption when power is 

I  agree  that  erase/write  cycles  are  the  critical  phase  where
corruption  may  happen, and which we want to try to protect with our
implementation. See above.

> interrupted while another section of flash memory is being written/erased.

I don't see how this could happen to flash. [Well,  I've  seen  flash
corruption  before; this was on Intel flash where you could write the
flash control commands to arbitrary  addresses,  so  just  copying  a
binary  image  to  a  flash  device  could cause random write / erase
actions. But then, such devices should have hardware flash protection
(which you should enable, or you deserve what you get), or if you are
concerned about reliability you would avoid such devices like hell.]

> Sure these could cause other problems as well like if this issue happens 
> for U-Boot code the system might become un-bootable. But at least we 
> have full recovery for the case when it happens within U-Boot environment.

I'm not sure I can follow that logic. If you have some undetected and
unexpected memory corruption in your flash, and  if  you  care  about
reliability,  then you must try to recognize such situations and halt
the system. Trying to continue in such  an  undefined  state  is  too
hazardous.

So, can you please fill in the szenario where your modification would
really help to make the system more reliable?

Best regards,

Wolfgang Denk

-- 
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de
Our business is run on trust.  We trust you will pay in advance.