[U-Boot] UBI fixable bit-flip issue

Mark Spieth mspieth at digivation.com.au
Thu Jul 12 00:28:08 UTC 2018


Hi

In the process of investigating a boot failure on one of our devices, the

UBI: fixable bit-flip detected at PEB

message was seen with the following behaviour during kernel load in u-boot.

Read [2285568] bytes
UBI: fixable bit-flip detected at PEB 415
UBI: schedule PEB 415 for scrubbing
UBI: fixable bit-flip detected at PEB 415
UBI: fixable bit-flip detected at PEB 419
UBI: schedule PEB 419 for scrubbing
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: schedule PEB 420 for scrubbing
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419
UBI: fixable bit-flip detected at PEB 420
UBI: fixable bit-flip detected at PEB 419

This repeats until reset.

U boot is a patched version of 2010.06 supplied by the chip vendor. No 
newer version is available from the vendor to try.

The patches include the init eba/wl swap.

A more detailed log with debugging available follows:

UBI: fixable bit-flip detected at PEB 419
UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
UBI DBG: erase_worker: erase PEB 419 EC 19
UBI DBG: sync_erase: erase PEB 419, old EC 19
UBI DBG: do_sync_erase: erase PEB 419
UBI DBG: sync_erase: erased PEB 419, new EC 20
UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
UBI DBG: ensure_wear_leveling: schedule scrubbing
UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
UBI: fixable bit-flip detected at PEB 420
UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
UBI: fixable bit-flip detected at PEB 419
UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
UBI DBG: erase_worker: erase PEB 419 EC 20
UBI DBG: sync_erase: erase PEB 419, old EC 20
UBI DBG: do_sync_erase: erase PEB 419
UBI DBG: sync_erase: erased PEB 419, new EC 21
UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
UBI DBG: ensure_wear_leveling: schedule scrubbing
UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
UBI: fixable bit-flip detected at PEB 420
UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
UBI: fixable bit-flip detected at PEB 419

Investigation showed that a read with correctable bit errors was done 
returning -EUCLEAN to the ubi read function.

Having read 
https://lists.denx.de/pipermail/u-boot/2013-September/161961.html which 
details a workaround to not return EUCLEAN from the NAND reader unless 
the number of fixed bits returned was 75% of the total number of 
correctable bits was exceeded during the read. This was impleneted in 
this version of ubi in uboot 2010.06 and it does hide the bit-flip 
infinite issue since this is new NAND FLASH. The original 2010.06 
implementation returns EUCLEAN for any number of fixable bit flips and 
thus causes the PEB move to the best free one (scrub mode in 
wear_leveling_worker).

This fix is not a root cause fix though. Investigating further led to 
the following root cause solution. The following is AFAICT.

When the scrubber chooses a PEB to move the from the free balanced tree. 
This tree is sorted by EC (erase count) and then by PEB number.

The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 
8192 in this config. So the find_wl_entry function will find a PEB that 
is better in error count that the current PEB EC. This can easily cause 
it to find the PEB that was just moved from if it is the lowest numbered 
PEB in the free tree. Waiting for EC to go above 8192 would take a long 
time and cause premature aging of the flash PEBs in question.

The easy solution is to change the max parameter to this call to 0 so it 
finds a PEB with a smaller EC than the one being replaced. This means it 
wont use the previously discarded PEB as its first choice.

This fix was implemented and fixable bit-flip errors no longer 
hang/freeze the boot process! UBI erase and reformat was used between 
re-tests to get consistent results.

Adding the above 75% correctable bitflip threshold is also a good thing 
as less movement will ensue when the FLASH is new, but as the flash 
ages, the root cause will once again be invoked causing un-recoverable 
boot failures.

Note this fault is also in the latest kernel drivers for UBI and may 
also exist in other wear leveling implementations. The kernel driver 
issue may be at fault for android devices locking up/freezing 
sporadically during FLASH read when scrubbing due to a relatively full 
flash and correctable errors causing ping pong PEB moves.

The question is, is my root cause solution sound or have I missed something?

I know an algo change would probably be better or a way to detect move 
loops to prevent this from occurring, but this solution does work on all 
the devices that were failing manufacture tests previously.

Regards

Mark



More information about the U-Boot mailing list