[U-Boot] UBI fixable bit-flip issue
Richard Weinberger
richard at nod.at
Thu Jul 12 08:46:11 UTC 2018
Mark,
Am Donnerstag, 12. Juli 2018, 07:22:13 CEST schrieb Heiko Schocher:
> Hello Mark,
>
> added Richard Weinberger to cc...
>
> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
> > Hi
> >
> > In the process of investigating a boot failure on one of our devices, the
> >
> > UBI: fixable bit-flip detected at PEB
> >
> > message was seen with the following behaviour during kernel load in u-boot.
> >
> > Read [2285568] bytes
> > UBI: fixable bit-flip detected at PEB 415
> > UBI: schedule PEB 415 for scrubbing
> > UBI: fixable bit-flip detected at PEB 415
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: schedule PEB 419 for scrubbing
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: schedule PEB 420 for scrubbing
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> > UBI: fixable bit-flip detected at PEB 420
> > UBI: fixable bit-flip detected at PEB 419
> >
> > This repeats until reset.
Do you see the same symptom also on Linux?
We need to be very sure that it is actually a UBI problem.
> > This fix is not a root cause fix though. Investigating further led to the following root cause
> > solution. The following is AFAICT.
> >
> > When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by EC
> > (erase count) and then by PEB number.
> >
> > The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So the
> > find_wl_entry function will find a PEB that is better in error count that the current PEB EC. This
error count? You mean erase count?
> > can easily cause it to find the PEB that was just moved from if it is the lowest numbered PEB in the
> > free tree. Waiting for EC to go above 8192 would take a long time and cause premature aging of the
> > flash PEBs in question.
> >
> > The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a smaller
> > EC than the one being replaced. This means it wont use the previously discarded PEB as its first
> > choice.
For scrubbing this might be a good idea, but not for regular wear-leveling.
See comment in UBI:
/*
* When a physical eraseblock is moved, the WL sub-system has to pick the target
* physical eraseblock to move to. The simplest way would be just to pick the
* one with the highest erase counter. But in certain workloads this could lead
* to an unlimited wear of one or few physical eraseblock. Indeed, imagine a
* situation when the picked physical eraseblock is constantly erased after the
* data is written to it. So, we have a constant which limits the highest erase
* counter of the free physical eraseblock to pick. Namely, the WL sub-system
* does not pick eraseblocks with erase counter greater than the lowest erase
* counter plus %WL_FREE_MAX_DIFF.
*/
#define WL_FREE_MAX_DIFF (2*UBI_WL_THRESHOLD)
So we could change the logic such that for regular wear-leveling we keep using WL_FREE_MAX_DIFF,
but for scrubbing (which is 1:1 wear-leveling but the source PEB is showing bit-flips) we use
a lower value. IMHO WL_FREE_MAX_DIFF/2 would be a good choice.
I'm not sure whether 0 is too extreme and might cause other distortions.
Mark, can you please file a patch and send it to linux-mtd mailing list?
Such a change needs to go through Linux and then to u-boot.
But first we need to think about and discuss it in detail.
> I am not sure if it is so easy ...
>
> > This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI
> > erase and reformat was used between re-tests to get consistent results.
> >
> > Adding the above 75% correctable bitflip threshold is also a good thing as less movement will ensue
> > when the FLASH is new, but as the flash ages, the root cause will once again be invoked causing
> > un-recoverable boot failures.
> >
> > Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear
> > leveling implementations. The kernel driver issue may be at fault for android devices locking
> > up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and
> > correctable errors causing ping pong PEB moves.
> >
> > The question is, is my root cause solution sound or have I missed something?
>
> I have to think about, before I write nonsene, but may Richard has
> here a deeper insight.
Please see my comments. :)
Thanks,
//richard
More information about the U-Boot
mailing list