[U-Boot] UBI fixable bit-flip issue
Mark Spieth
mspieth at digivation.com.au
Thu Jul 12 05:38:44 UTC 2018
On 12/07/18 15:22, Heiko Schocher wrote:
> Hello Mark,
>
> added Richard Weinberger to cc...
>
> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>> Hi
>>
>> In the process of investigating a boot failure on one of our devices,
>> the
>>
>> UBI: fixable bit-flip detected at PEB
>>
>> message was seen with the following behaviour during kernel load in
>> u-boot.
>>
>> Read [2285568] bytes
>> UBI: fixable bit-flip detected at PEB 415
>> UBI: schedule PEB 415 for scrubbing
>> UBI: fixable bit-flip detected at PEB 415
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: schedule PEB 419 for scrubbing
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: schedule PEB 420 for scrubbing
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>> UBI: fixable bit-flip detected at PEB 420
>> UBI: fixable bit-flip detected at PEB 419
>>
>> This repeats until reset.
>>
>> U boot is a patched version of 2010.06 supplied by the chip vendor.
>> No newer version is available from the vendor to try.
>
> :-(
>
> Can you use current mainline ? It s hard to say something
> about a 8 year old vendor U-Boot version ...
I know. I did look at the current 2018.07 and 2014.10 as comparison.
There are many patches applied by the vendor so porting them with the
large changes to driver structure would be difficult and time consuming.
The vendor is Lantiq and the SDK is current (this year).
>
>> The patches include the init eba/wl swap.
>
> What do you mean here?
https://lists.denx.de/pipermail/u-boot/2013-January/143199.html
This patch was already applied by the vendor.
ubi_eba_init_scan() must be initialised before ubi_wl_init_scan() and in
that baseline they were the wrong way around.
There is only 1 other message chain for fixable bit flips (2011) and
that was not useful for this problem.
>
>> A more detailed log with debugging available follows:
>>
>> UBI: fixable bit-flip detected at PEB 419
>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
>> UBI DBG: erase_worker: erase PEB 419 EC 19
>> UBI DBG: sync_erase: erase PEB 419, old EC 19
>> UBI DBG: do_sync_erase: erase PEB 419
>> UBI DBG: sync_erase: erased PEB 419, new EC 20
>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>> UBI: fixable bit-flip detected at PEB 420
>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>> UBI: fixable bit-flip detected at PEB 419
>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
>> UBI DBG: erase_worker: erase PEB 419 EC 20
>> UBI DBG: sync_erase: erase PEB 419, old EC 20
>> UBI DBG: do_sync_erase: erase PEB 419
>> UBI DBG: sync_erase: erased PEB 419, new EC 21
>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>> UBI: fixable bit-flip detected at PEB 420
>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>> UBI: fixable bit-flip detected at PEB 419
>>
>> Investigation showed that a read with correctable bit errors was done
>> returning -EUCLEAN to the ubi read function.
>>
>> Having read
>> https://lists.denx.de/pipermail/u-boot/2013-September/161961.html
>> which details a workaround to not return EUCLEAN from the NAND reader
>> unless the number of fixed bits returned was 75% of the total number
>> of correctable bits was exceeded during the read. This was impleneted
>> in this version of ubi in uboot 2010.06 and it does hide the bit-flip
>> infinite issue since this is new NAND FLASH. The original 2010.06
>> implementation returns EUCLEAN for any number of fixable bit flips
>> and thus causes the PEB move to the best free one (scrub mode in
>> wear_leveling_worker).
>>
>> This fix is not a root cause fix though. Investigating further led to
>> the following root cause solution. The following is AFAICT.
>>
>> When the scrubber chooses a PEB to move the from the free balanced
>> tree. This tree is sorted by EC (erase count) and then by PEB number.
>>
>> The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which
>> is 8192 in this config. So the find_wl_entry function will find a PEB
>> that is better in error count that the current PEB EC. This can
>> easily cause it to find the PEB that was just moved from if it is the
>> lowest numbered PEB in the free tree. Waiting for EC to go above 8192
>> would take a long time and cause premature aging of the flash PEBs in
>> question.
>>
>> The easy solution is to change the max parameter to this call to 0 so
>> it finds a PEB with a smaller EC than the one being replaced. This
>> means it wont use the previously discarded PEB as its first choice.
>
> I am not sure if it is so easy ...
This is why I'm asking :-)
>
>> This fix was implemented and fixable bit-flip errors no longer
>> hang/freeze the boot process! UBI erase and reformat was used between
>> re-tests to get consistent results.
>>
>> Adding the above 75% correctable bitflip threshold is also a good
>> thing as less movement will ensue when the FLASH is new, but as the
>> flash ages, the root cause will once again be invoked causing
>> un-recoverable boot failures.
>>
>> Note this fault is also in the latest kernel drivers for UBI and may
>> also exist in other wear leveling implementations. The kernel driver
>> issue may be at fault for android devices locking up/freezing
>> sporadically during FLASH read when scrubbing due to a relatively
>> full flash and correctable errors causing ping pong PEB moves.
>>
>> The question is, is my root cause solution sound or have I missed
>> something?
>
> I have to think about, before I write nonsene, but may Richard has
> here a deeper insight.
>
>> I know an algo change would probably be better or a way to detect
>> move loops to prevent this from occurring, but this solution does
>> work on all the devices that were failing manufacture tests previously.
>
Is there another message board that deal with the mtd ubi driver
specifically?
Thanks
Mark
More information about the U-Boot
mailing list