[U-Boot] UBI fixable bit-flip issue

Richard Weinberger richard at nod.at
Thu Jul 12 08:46:11 UTC 2018


Mark,

Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
> Hello Mark,
> 
> added Richard Weinberger to cc...
> 
> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
> > Hi
> > 
> > In the process of investigating a boot failure on one of our devices, the
> > 
> > UBI: fixable bit-flip detected at PEB
> > 
> > message was seen with the following behaviour during kernel load in u-boot.
> > 
> > Read [2285568] bytes
> > UBI: fixable bit-flip detected at PEB 415
> > UBI: schedule PEB 415 for scrubbing
> > UBI: fixable bit-flip detected at PEB 415
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: schedule PEB 419 for scrubbing
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: schedule PEB 420 for scrubbing
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > 
> > This repeats until reset.

Do you see the same symptom also on Linux?
We need to be very sure that it is actually a UBI problem.

> > This fix is not a root cause fix though. Investigating further led to the following root cause 
> > solution. The following is AFAICT.
> > 
> > When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by EC 
> > (erase count) and then by PEB number.
> > 
> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So the 
> > find_wl_entry function will find a PEB that is better in error count that the current PEB EC. This

error count? You mean erase count?
 
> > can easily cause it to find the PEB that was just moved from if it is the lowest numbered PEB in the 
> > free tree. Waiting for EC to go above 8192 would take a long time and cause premature aging of the 
> > flash PEBs in question.
> > 
> > The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a smaller 
> > EC than the one being replaced. This means it wont use the previously discarded PEB as its first 
> > choice.

For scrubbing this might be a good idea, but not for regular wear-leveling.

See comment in UBI:
/*
 * When a physical eraseblock is moved, the WL sub-system has to pick the target
 * physical eraseblock to move to. The simplest way would be just to pick the
 * one with the highest erase counter. But in certain workloads this could lead
 * to an unlimited wear of one or few physical eraseblock. Indeed, imagine a
 * situation when the picked physical eraseblock is constantly erased after the
 * data is written to it. So, we have a constant which limits the highest erase
 * counter of the free physical eraseblock to pick. Namely, the WL sub-system
 * does not pick eraseblocks with erase counter greater than the lowest erase
 * counter plus %WL_FREE_MAX_DIFF.
 */
#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)

So we could change the logic such that for regular wear-leveling we keep using WL_FREE_MAX_DIFF,
but for scrubbing (which is 1:1 wear-leveling but the source PEB is showing bit-flips) we use
a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
I'm not sure whether 0 is too extreme and might cause other distortions.

Mark, can you please file a patch and send it to linux-mtd mailing list?
Such a change needs to go through Linux and then to u-boot.
But first we need to think about and discuss it in detail.
 
>   I am not sure if it is so easy ...
>
> > This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI 
> > erase and reformat was used between re-tests to get consistent results.
> > 
> > Adding the above 75% correctable bitflip threshold is also a good thing as less movement will ensue 
> > when the FLASH is new, but as the flash ages, the root cause will once again be invoked causing 
> > un-recoverable boot failures.
> > 
> > Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear 
> > leveling implementations. The kernel driver issue may be at fault for android devices locking 
> > up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and 
> > correctable errors causing ping pong PEB moves.
> > 
> > The question is, is my root cause solution sound or have I missed something?
> 
> I have to think about, before I write nonsene, but may Richard has
> here a deeper insight.

Please see my comments. :)

Thanks,
//richard



More information about the U-Boot mailing list