[U-Boot] UBI fixable bit-flip issue

Heiko Schocher hs at denx.de
Thu Jul 12 08:08:16 UTC 2018


Hello Mark,

Am 12.07.2018 um 07:38 schrieb Mark Spieth:
> 
> On 12/07/18 15:22, Heiko Schocher wrote:
>> Hello Mark,
>>
>> added Richard Weinberger to cc...
>>
>> Am 12.07.2018 um 02:28 schrieb Mark Spieth:
>>> Hi
>>>
>>> In the process of investigating a boot failure on one of our devices, the
>>>
>>> UBI: fixable bit-flip detected at PEB
>>>
>>> message was seen with the following behaviour during kernel load in u-boot.
>>>
>>> Read [2285568] bytes
>>> UBI: fixable bit-flip detected at PEB 415
>>> UBI: schedule PEB 415 for scrubbing
>>> UBI: fixable bit-flip detected at PEB 415
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: schedule PEB 419 for scrubbing
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: schedule PEB 420 for scrubbing
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI: fixable bit-flip detected at PEB 419
>>>
>>> This repeats until reset.
>>>
>>> U boot is a patched version of 2010.06 supplied by the chip vendor. No newer version is available 
>>> from the vendor to try.
>>
>> :-(
>>
>> Can you use current mainline ? It s hard to say something
>> about a 8 year old vendor U-Boot version ...
> I know. I did look at the current 2018.07 and 2014.10 as comparison.
> 
> There are many patches applied by the vendor so porting them with the large changes to driver 
> structure would be difficult and time consuming.
> The vendor is Lantiq and the SDK is current (this year).
>>
>>> The patches include the init eba/wl swap.
>>
>> What do you mean here?
> https://lists.denx.de/pipermail/u-boot/2013-January/143199.html
> This patch was already applied by the vendor.
> 
> ubi_eba_init_scan() must be initialised before ubi_wl_init_scan() and in that baseline they were the 
> wrong way around.
> 
> There is only 1 other message chain for fixable bit flips (2011) and that was not useful for this 
> problem.
>>
>>> A more detailed log with debugging available follows:
>>>
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 19, torture 0
>>> UBI DBG: erase_worker: erase PEB 419 EC 19
>>> UBI DBG: sync_erase: erase PEB 419, old EC 19
>>> UBI DBG: do_sync_erase: erase PEB 419
>>> UBI DBG: sync_erase: erased PEB 419, new EC 20
>>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>>> UBI: fixable bit-flip detected at PEB 419
>>> UBI DBG: schedule_erase: schedule erasure of PEB 419, EC 20, torture 0
>>> UBI DBG: erase_worker: erase PEB 419 EC 20
>>> UBI DBG: sync_erase: erase PEB 419, old EC 20
>>> UBI DBG: do_sync_erase: erase PEB 419
>>> UBI DBG: sync_erase: erased PEB 419, new EC 21
>>> UBI DBG: ubi_io_write_ec_hdr: write EC header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:0
>>> UBI DBG: ensure_wear_leveling: schedule scrubbing
>>> UBI DBG: wear_leveling_worker: scrub PEB 420 to PEB 419
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 420
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 420:2048
>>> UBI DBG: ubi_eba_copy_leb: copy LEB 6:11, PEB 420 to PEB 419
>>> UBI DBG: ubi_eba_copy_leb: read 126976 bytes of data
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 420:4096
>>> UBI: fixable bit-flip detected at PEB 420
>>> UBI DBG: ubi_io_write_vid_hdr: write VID header to PEB 419
>>> UBI DBG: ubi_io_write: write 2048 bytes to PEB 419:2048
>>> UBI DBG: ubi_io_read_vid_hdr: read VID header from PEB 419
>>> UBI DBG: ubi_io_read: read 2048 bytes from PEB 419:2048
>>> UBI DBG: ubi_io_write: write 126976 bytes to PEB 419:4096
>>> UBI DBG: ubi_io_read: read 126976 bytes from PEB 419:4096
>>> UBI: fixable bit-flip detected at PEB 419
>>>
>>> Investigation showed that a read with correctable bit errors was done returning -EUCLEAN to the 
>>> ubi read function.
>>>
>>> Having read https://lists.denx.de/pipermail/u-boot/2013-September/161961.html which details a 
>>> workaround to not return EUCLEAN from the NAND reader unless the number of fixed bits returned 
>>> was 75% of the total number of correctable bits was exceeded during the read. This was impleneted 
>>> in this version of ubi in uboot 2010.06 and it does hide the bit-flip infinite issue since this 
>>> is new NAND FLASH. The original 2010.06 implementation returns EUCLEAN for any number of fixable 
>>> bit flips and thus causes the PEB move to the best free one (scrub mode in wear_leveling_worker).
>>>
>>> This fix is not a root cause fix though. Investigating further led to the following root cause 
>>> solution. The following is AFAICT.
>>>
>>> When the scrubber chooses a PEB to move the from the free balanced tree. This tree is sorted by 
>>> EC (erase count) and then by PEB number.
>>>
>>> The find_wl_entry call uses a max parameter of WL_FREE_MAX_DIFF which is 8192 in this config. So 
>>> the find_wl_entry function will find a PEB that is better in error count that the current PEB EC. 
>>> This can easily cause it to find the PEB that was just moved from if it is the lowest numbered 
>>> PEB in the free tree. Waiting for EC to go above 8192 would take a long time and cause premature 
>>> aging of the flash PEBs in question.
>>>
>>> The easy solution is to change the max parameter to this call to 0 so it finds a PEB with a 
>>> smaller EC than the one being replaced. This means it wont use the previously discarded PEB as 
>>> its first choice.
>>
>>  I am not sure if it is so easy ...
> This is why I'm asking :-)
>>
>>> This fix was implemented and fixable bit-flip errors no longer hang/freeze the boot process! UBI 
>>> erase and reformat was used between re-tests to get consistent results.
>>>
>>> Adding the above 75% correctable bitflip threshold is also a good thing as less movement will 
>>> ensue when the FLASH is new, but as the flash ages, the root cause will once again be invoked 
>>> causing un-recoverable boot failures.
>>>
>>> Note this fault is also in the latest kernel drivers for UBI and may also exist in other wear 
>>> leveling implementations. The kernel driver issue may be at fault for android devices locking 
>>> up/freezing sporadically during FLASH read when scrubbing due to a relatively full flash and 
>>> correctable errors causing ping pong PEB moves.
>>>
>>> The question is, is my root cause solution sound or have I missed something?
>>
>> I have to think about, before I write nonsene, but may Richard has
>> here a deeper insight.
>>
>>> I know an algo change would probably be better or a way to detect move loops to prevent this from 
>>> occurring, but this solution does work on all the devices that were failing manufacture tests 
>>> previously.
>>
> Is there another message board that deal with the mtd ubi driver specifically?

Yes of course ...

bye,
Heiko
-- 
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: +49-8142-66989-52   Fax: +49-8142-66989-80   Email: hs at denx.de


More information about the U-Boot mailing list