[U-Boot] UBI fixable bit-flip issue

Heiko Schocher hs at denx.de
Thu Jul 12 05:22:13 UTC 2018


Hello Mark,

added Richard Weinberger to cc...

Am 12.07.2018 um 02:28 schrieb Mark Spieth:
> Hi
> 
> In the process of investigating a boot failure on one of our devices, the
> 
> UBI: fixable bit-flip detected at PEB
> 
> message was seen with the following behaviour during kernel load in u-boot.
> 
> Read [2285568] bytes
> UBI: fixable bit-flip detected at PEB 415
> UBI: schedule PEB 415 for scrubbing
> UBI: fixable bit-flip detected at PEB 415
> UBI: fixable bit-flip detected at PEB 419
> UBI: schedule PEB 419 for scrubbing
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: schedule PEB 420 for scrubbing
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> UBI: fixable bit-flip detected at PEB 420
> UBI: fixable bit-flip detected at PEB 419
> 
> This repeats until reset.
> 
> U boot is a patched version of 2010.06 supplied by the chip vendor. No newer version is available 
> from the vendor to try.

:-(

Can you use current mainline ? It s hard to say something
about a 8 year old vendor U-Boot version ...

> The patches include the init eba/wl swap.

What do you mean here?

> A more detailed log with debugging available follows:
> 
> UBI: fixable bit-flip detected at PEB 419
> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
> UBI DBG: erase_worker: erase PEB 419 EC 19
> UBI DBG: sync_erase: erase PEB 419, old EC 19
> UBI DBG: do_sync_erase: erase PEB 419
> UBI DBG: sync_erase: erased PEB 419, new EC 20
> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
> UBI DBG: ensure_wear_leveling: schedule scrubbing
> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
> UBI: fixable bit-flip detected at PEB 420
> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
> UBI: fixable bit-flip detected at PEB 419
> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
> UBI DBG: erase_worker: erase PEB 419 EC 20
> UBI DBG: sync_erase: erase PEB 419, old EC 20
> UBI DBG: do_sync_erase: erase PEB 419
> UBI DBG: sync_erase: erased PEB 419, new EC 21
> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
> UBI DBG: ensure_wear_leveling: schedule scrubbing
> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
> UBI: fixable bit-flip detected at PEB 420
> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
> UBI: fixable bit-flip detected at PEB 419
> 
> Investigation showed that a read with correctable bit errors was done returning -EUCLEAN to the ubi 
> read function.
> 
> Having read https://lists.denx.de/pipermail/u-boot/2013-September/161961.html which details a 
> workaround to not return EUCLEAN from the NAND reader unless the number of fixed bits returned was 
> 75% of the total number of correctable bits was exceeded during the read. This was impleneted in 
> this version of ubi in uboot 2010.06 and it does hide the bit-flip infinite issue since this is new 
> NAND FLASH. The original 2010.06 implementation returns EUCLEAN for any number of fixable bit flips 
> and thus causes the PEB move to the best free one (scrub mode in wear_leveling_worker).
> 
> This fix is not a root cause fix though. Investigating further led to the following root cause 
> solution. The following is AFAICT.
> 
> When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by EC 
> (erase count) and then by PEB number.
> 
> The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So the 
> find_wl_entry function will find a PEB that is better in error count that the current PEB EC. This 
> can easily cause it to find the PEB that was just moved from if it is the lowest numbered PEB in the 
> free tree. Waiting for EC to go above 8192 would take a long time and cause premature aging of the 
> flash PEBs in question.
> 
> The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a smaller 
> EC than the one being replaced. This means it wont use the previously discarded PEB as its first 
> choice.

  I am not sure if it is so easy ...

> This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI 
> erase and reformat was used between re-tests to get consistent results.
> 
> Adding the above 75% correctable bitflip threshold is also a good thing as less movement will ensue 
> when the FLASH is new, but as the flash ages, the root cause will once again be invoked causing 
> un-recoverable boot failures.
> 
> Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear 
> leveling implementations. The kernel driver issue may be at fault for android devices locking 
> up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and 
> correctable errors causing ping pong PEB moves.
> 
> The question is, is my root cause solution sound or have I missed something?

I have to think about, before I write nonsene, but may Richard has
here a deeper insight.

> I know an algo change would probably be better or a way to detect move loops to prevent this from 
> occurring, but this solution does work on all the devices that were failing manufacture tests 
> previously.

bye,
Heiko
-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs at denx.de


More information about the U-Boot mailing list