[U-Boot] Bricked when trying to attach UBI

Wed Jan 2 15:37:21 CET 2013

Luca Ceresoli wrote:
> Hi,
>
> I'm Cc'ing the linux-mtd list as well as the authors of the Linux
> commits cited below.
>
> For these new readers: I reported a problem with U-Boot 2012.04.01 not
> being able to attach an UBI partition in NAND, while Linux (2.6.37) can
> attach and repair it.
>
> It looks like an U-Boot bug, but I discovered strange things around the
> chip->badblockbits variable (in the NAND code) by comparing the
> relevant code in U-Boot and Linux.
>
> Sorry for Cc'ing so many people, but following this issue I was lead
> from one subsystem to another (and from U-Boot to Linux).
>
> Previous discussion is here:
> http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/149624
>
> Luca Ceresoli wrote:
>> Hi Andreas,
>>
>> Andreas Bießmann wrote:
>>> Hi Luca,
>>>
>>> On 19.12.2012 16:56, Luca Ceresoli wrote:
>>>> Hi Andreas,
>>>>
>>>> Andreas Bießmann wrote:
>>>> ...
>>>>>> Creating 1 MTD partitions on "nand0":
>>>>>> 0x000000100000-0x000010000000 : "mtd=3"
>>>>>> UBI: attaching mtd1 to ubi0
>>>>>> UBI: physical eraseblock size:   131072 bytes (128 KiB)
>>>>>> UBI: logical eraseblock size:    129024 bytes
>>>>>> UBI: smallest flash I/O unit:    2048
>>>>>> UBI: sub-page size:              512
>>>>>> UBI: VID header offset:          512 (aligned 512)
>>>>>> UBI: data offset:                2048
>>>>>> UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0,
>>>>>> need 1)
>>>>>>
>>>>>> Now the device is totally blocked, and power cycling does not change
>>>>>> the result.
>>>>>
>>>>> have you tried to increase the malloc arena in u-boot
>>>>> (CONIG_SYS_MALLOC_LEN)?
>>>>> We had errors like this before [1],[2] and [3], maybe others -
>>>>> apparently with another error message, but please give it a try. We
>>>>> know
>>>>> ubi recovery needs some ram and 1MiB may be not enough.
>>>>
>>>> Thanks for your suggestion.
>>>>
>>>> Unfortunately this does not seem to be the cause of my problem: I 
>>>> tried
>>>> increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from
>>>> (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any
>>>> difference.
>>>
>>> Well, ok ... Malloc arena is always my first thought if I read about
>>> problems with ubi in u-boot.
>>> Have you looked up the differences in drivers/mtd/ubi/ in your u-boot
>>> and linux tree? Maybe you can see something obviously different in the
>>> ubi_wl_init_scan()?
>>
>> I had some days ago, but I double-checked now as you suggested. Indeed
>> there is an important difference: attach_by_scanning() (build.c) calls
>> ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in
>> a swapped order!
>>
>> This swap dates back to:
>>
>> commit d63894654df72b010de2abb4b3f07d0d755f65b6
>> Author: Holger Brunck <holger.brunck at keymile.com>
>> Date:   Mon Oct 10 13:08:19 2011 +0200
>>
>>      UBI: init eba tables before wl when attaching a device
>>
>>      This fixes that u-boot gets stuck when a bitflip was detected
>>      during "ubi part <ubi_device>". If a bitflip was detected UBI tries
>>      to copy the PEB to a different place. This needs that the eba table
>>      are initialized, but this was done after the wear levelling worker
>>      detects the bitflip. So changes the initialisation of these two
>>      tasks in u-boot.
>>
>>      This is a u-boot specific patch and not needed in the linux layer,
>>      because due to commit 1b1f9a9d00447d
>>      UBI: Ensure that "background thread" operations are really executed
>>      we schedule these tasks in place and not as in linux after the 
>> inital
>>      task which schedule this new task is finished.
>>
>>      Signed-off-by: Holger Brunck <holger.brunck at keymile.com>
>>      cc: Stefan Roese <sr at denx.de>
>>      Signed-off-by: Stefan Roese <sr at denx.de>
>>
>> I tried reverting that commit and... surprise! U-Boot can now attach UBI
>> and boot properly!
>>
>> But the cited commit actually fixed a bug that bite our board a few
>> months back, so it should not be reverted without thinking twice. Now
>> it apparently introduced another bug. :-(
>>
>> I'm Cc:ing the commit author for comments.
>>
>> Nonetheless, I have evidence of a different behaviour between U-Boot
>> and Linux even before the two swapped functions are called.
>>
>> What attach_by_scanning() does in Linux is (abbreviated):
>>
>> static int attach_by_scanning(struct ubi_device *ubi)
>> {
>>          si = ubi_scan(ubi);
>>      ...fill ubi->some_fields...;
>>          err = ubi_read_volume_table(ubi, si);
>>      /* MARK */
>>          err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */
>>          err = ubi_wl_init_scan(ubi, si);  /* swapped in U-Boot */
>>          ubi_scan_destroy_si(si);
>>          return 0;
>> }
>>
>> See the two swapped calls.
>>
>> At MARK, I printed some of the peb counters in *ubi, and I got
>> different results for ubi->avail_pebs between U-Boot and Linux:
>> U-Boot: UBI: POST_TBL: rsvd=2018, avail=21, beb_rsvd_{pebs,level}=0,0
>> Linux:  UBI: POST_TBL: rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0
>>
>> The printed values were equal before calling ubi_read_volume_table().
>> I have no idea about where this difference comes from, nor if this
>> difference can cause my troubles.
>> I will better investigate tomorrow looking into ubi_read_volume_table().
>
> After half a day of debugging and an insane amount of printk()s added to
> both U-Boot and Linux, I have some more hints to understand the problem.
>
> The two different results quoted above show that U-Boot counted 21
> available eraseblocks, while Linux counts 22. I am not sure if this can
> cause my problem, but it's the first visible difference between U-Boot
> and Linux.
>
> This originates from ubi_scan() (scan.c): in U-Boot, it sets
> si->bad_peb_count to 1, in Linux to 0. U-Boot's ubi_scan() is very
> similar to Linux's, and the differences do not seem to relevant in my 
> case. So let's dig down...
>
> - ubi_scan() (scan.c) calls process_eb() (scan.c) for each EB
> - process_eb() calls ubi_io_is_bad() (io.c), and if it returns >0 it
>   increments si->bad_peb_count, which is what is happening to my board
>   when executing U-Boot
> - ubi_io_is_bad() calls mtd->block_isbad(), which points to
>   nand_block_isbad() (nand_base.c)
> - nand_block_isbad() is a wrapper to nand_block_checkbad() (nand_base.c)
> - nand_block_checkbad() differs from the Linux code in something
>   related to lazy bad block scanning (commit fb49454b1b6c7c6, Feb 2012),
>   but this does not seem to change the behaviour I observe;
> - nand_block_checkbad() calls either chip->block_bad() or
>   nand_isbad_bbt(); I tracked only into the former, but I suspect the
>   latter produces the same effects with regard to the problem I'm facing
> - chip->block_bad() points to nand_block_bad() (nand_base.c)
>
> nand_block_bad() (nand_base.c) does the following:
> static int nand_block_bad(struct mtd_info *mtd, loff_t ofs, int getchip)
> {
>     ...
>
>         if (likely(chip->badblockbits == 8))
>                 res = bad != 0xFF;
>         else
>                 res = hweight8(bad) < chip->badblockbits;
>
>         if (getchip)
>                 nand_release_device(mtd);
>
>         return res;
> }
>
> I don't understand the algorithm, but the relevant variables have these
> values:
> U-Boot: nand_block_bad: chip->badblockbits=8, bad=0000, hweight8(bad)=0
> Linux:  nand_block_bad: chip->badblockbits=0, bad=0000, hweight8(bad)=0
>                                            ^
>
> Obviously the U-Boot and Linux produce a different return value.
> This propagates up to ubi->bad_peb_count in ubi_scan(), and from there
> it changes the behaviour of the following code, leading to a block in
> U-Boot and a successful attach in Linux.
>
> chip->badblockbits in current Linux master is described as
> "minimum number of set bits in a good block's bad block marker
> position; i.e., BBM == 11110111b is not bad when badblockbits == 7".
>
> Still a bit obscure to me because I don't have a general picture.
> Anyway, here's how its value comes to be different between U-Boot
> (2012.04.01) and Linux (2.6.37).
>
> Linux:
> a) commit e0b58d0a7005, Feb 2010:
>    mtd: nand: add ->badblockbits for minimum number of set bits in bad
>    block byte
>    declared the new variable and introduced in nand_get_flash_type()
>    (nand_base.c) the following line:
>      chip->badblockbits = 8;
> b) commit c7b28e25cb9, Jul 2010:
>    mtd: nand: refactor BB marker detection
>    removed from nand_get_flash_type() (nand_base.c) the same line:
>      chip->badblockbits = 8;
> c) commit 26d9be11485e, Apr 2011:
>    mtd: return badblockbits back
>    restored in nand_get_flash_type() (nand_base.c) the following line:
>      chip->badblockbits = 8;
>   claiming it had been accidentally removed in commit b).
>
> The version of Linux I'm using (2.6.37), contains commits a) and b), so
> it has chip->badblockbits equal to 0. According to the log message of
> commit c), this should be wrong, but the resulting kernel works!
>
> The version of U-Boot (2012.04.01) contains the result of all 3 commits,
> since
>
>   commit 2a8e0fc8b3dc31a3c571e439fbf04b882c8986be
>   Author: Christian Hitz <christian.hitz at aizo.com>
>   Date:   Wed Oct 12 09:32:02 2011 +0200
>
>       nand: Merge changes from Linux nand driver
>
>       [backport from linux commit
>           02f8c6aee8df3cdc935e9bdd4f2d020306035dbe]
>
>       This patch synchronizes the nand driver with the Linux 3.0 state.
>
> This looks like an improvement, but it bricks my board!
>
> I could not resist, and without even trying to understand what I was
> doing, I did in U-Boot's nand_get_flash_type() (nand_base.c):
>
> -       chip->badblockbits = 8;
> +       chip->badblockbits = 0;
>
> and guess what? U-Boot attached UBI, loaded Linux from it and booted
> successfully!
>
> No, I don't think changing lines here and there without any real
> understanding is a way to produce reliable software. But I'm unable
> to understand why the software that should work better actually bricks
> the board and the other one runs fine? And how do I know what the
> correct value for chip->badblockbits should be?
>
> And last but most important: how can I properly fix U-Boot?

I had another look at the commit that swapped the calls to
ubi_eba_init_scan() and ubi_wl_init_scan(), and I noticed that it 
changed the
computationof the available PEB count.

In the original (pre-swap) code, running on a working board:

static int attach_by_scanning(struct ubi_device *ubi)
{
          si = ubi_scan(ubi);
      ...fill ubi->some_fields...;
          err = ubi_read_volume_table(ubi, si);

            /* here rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0 */

          err = ubi_wl_init_scan(ubi, si);  /* swapped in U-Boot */

            /* herersvd=2019, avail=21, beb_rsvd_{pebs,level}=0,0 ***** */

          err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */
          ubi_scan_destroy_si(si);
          return 0;
}

In the current (post-swap) code, running on the same board:

static int attach_by_scanning(struct ubi_device *ubi)
{
          si = ubi_scan(ubi);
      ...fill ubi->some_fields...;
          err = ubi_read_volume_table(ubi, si);

            /* here rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0 */

          err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */

            /* here rsvd=2039, avail=1, beb_rsvd_{pebs,level}=20,20***** */

          err = ubi_wl_init_scan(ubi, si);  /* swapped in U-Boot */
          ubi_scan_destroy_si(si);
          return 0;
}

Notice the difference on the line marked with "*****": after the swap, the
number of available PEBs changed from 21to 1.

According to the docs, UBI reserves some PEBs for bad PEB handling. By
default, in my 2048-PEBs NAND, it reserved 20 PEBs, wihch are far enough to
recover from a few bad PEBs. These should be computed as part of the 
"available"
PEBs. But current U-Boot (incorrectly?) thinks thereis only 1 available PEB.
On a bricked board, it thinks there are 0, so it cannotattach UBI.

I have no fix for this, but I tried a simple workaround: instead of 
using all
the available space for my logical volumes, I created them with a smaller
size, leaving 32 unused PEBs. Now, in attach_by_scanning(), I got:

pre-swap:  rsvd=1987, avail=53, beb_rsvd_{pebs,level}=0,0
post-swap: rsvd=2007, avail=33, beb_rsvd_{pebs,level}=20,20

The computed number of available PEB is exactly 32 units bigger than it used
to be. This means, also after the swap, U-Boot thinks there are plenty of
available PEBs.

To try to simulate a board that has bad blocks, I then marked some blocks as
bad using 'nand markbad' in U-Boot. The number of available PEBs decreases
accordingly, but is still >0 and U-Boot can attach UBI and boot.

So, it seems that leaving some unused PEBs is a workaround to this problem!
I'm not 100% sure this is ok and will go on to better understand the 
problem.
Any comments are welcome.

Luca