[U-Boot] Bricked when trying to attach UBI
Luca Ceresoli
luca.ceresoli at comelit.it
Thu Dec 20 17:02:00 CET 2012
Hi,
I'm Cc'ing the linux-mtd list as well as the authors of the Linux
commits cited below.
For these new readers: I reported a problem with U-Boot 2012.04.01 not
being able to attach an UBI partition in NAND, while Linux (2.6.37) can
attach and repair it.
It looks like an U-Boot bug, but I discovered strange things around the
chip->badblockbits variable (in the NAND code) by comparing the
relevant code in U-Boot and Linux.
Sorry for Cc'ing so many people, but following this issue I was lead
from one subsystem to another (and from U-Boot to Linux).
Previous discussion is here:
http://thread.gmane.org/gmane.comp.boot-loaders.u-boot/149624
Luca Ceresoli wrote:
> Hi Andreas,
>
> Andreas Bießmann wrote:
>> Hi Luca,
>>
>> On 19.12.2012 16:56, Luca Ceresoli wrote:
>>> Hi Andreas,
>>>
>>> Andreas Bießmann wrote:
>>> ...
>>>>> Creating 1 MTD partitions on "nand0":
>>>>> 0x000000100000-0x000010000000 : "mtd=3"
>>>>> UBI: attaching mtd1 to ubi0
>>>>> UBI: physical eraseblock size: 131072 bytes (128 KiB)
>>>>> UBI: logical eraseblock size: 129024 bytes
>>>>> UBI: smallest flash I/O unit: 2048
>>>>> UBI: sub-page size: 512
>>>>> UBI: VID header offset: 512 (aligned 512)
>>>>> UBI: data offset: 2048
>>>>> UBI error: ubi_wl_init_scan: no enough physical eraseblocks (0,
>>>>> need 1)
>>>>>
>>>>> Now the device is totally blocked, and power cycling does not change
>>>>> the result.
>>>>
>>>> have you tried to increase the malloc arena in u-boot
>>>> (CONIG_SYS_MALLOC_LEN)?
>>>> We had errors like this before [1],[2] and [3], maybe others -
>>>> apparently with another error message, but please give it a try. We
>>>> know
>>>> ubi recovery needs some ram and 1MiB may be not enough.
>>>
>>> Thanks for your suggestion.
>>>
>>> Unfortunately this does not seem to be the cause of my problem: I tried
>>> increasing my CONFIG_SYS_MALLOC_LEN in include/configs/dig297.h from
>>> (1024 << 10) to both (1024 << 12) and (1024 << 14), but without any
>>> difference.
>>
>> Well, ok ... Malloc arena is always my first thought if I read about
>> problems with ubi in u-boot.
>> Have you looked up the differences in drivers/mtd/ubi/ in your u-boot
>> and linux tree? Maybe you can see something obviously different in the
>> ubi_wl_init_scan()?
>
> I had some days ago, but I double-checked now as you suggested. Indeed
> there is an important difference: attach_by_scanning() (build.c) calls
> ubi_wl_init_scan() and ubi_eba_init_scan() just like Linux does, but in
> a swapped order!
>
> This swap dates back to:
>
> commit d63894654df72b010de2abb4b3f07d0d755f65b6
> Author: Holger Brunck <holger.brunck at keymile.com>
> Date: Mon Oct 10 13:08:19 2011 +0200
>
> UBI: init eba tables before wl when attaching a device
>
> This fixes that u-boot gets stuck when a bitflip was detected
> during "ubi part <ubi_device>". If a bitflip was detected UBI tries
> to copy the PEB to a different place. This needs that the eba table
> are initialized, but this was done after the wear levelling worker
> detects the bitflip. So changes the initialisation of these two
> tasks in u-boot.
>
> This is a u-boot specific patch and not needed in the linux layer,
> because due to commit 1b1f9a9d00447d
> UBI: Ensure that "background thread" operations are really executed
> we schedule these tasks in place and not as in linux after the inital
> task which schedule this new task is finished.
>
> Signed-off-by: Holger Brunck <holger.brunck at keymile.com>
> cc: Stefan Roese <sr at denx.de>
> Signed-off-by: Stefan Roese <sr at denx.de>
>
> I tried reverting that commit and... surprise! U-Boot can now attach UBI
> and boot properly!
>
> But the cited commit actually fixed a bug that bite our board a few
> months back, so it should not be reverted without thinking twice. Now
> it apparently introduced another bug. :-(
>
> I'm Cc:ing the commit author for comments.
>
> Nonetheless, I have evidence of a different behaviour between U-Boot
> and Linux even before the two swapped functions are called.
>
> What attach_by_scanning() does in Linux is (abbreviated):
>
> static int attach_by_scanning(struct ubi_device *ubi)
> {
> si = ubi_scan(ubi);
> ...fill ubi->some_fields...;
> err = ubi_read_volume_table(ubi, si);
> /* MARK */
> err = ubi_eba_init_scan(ubi, si); /* swapped in U-Boot */
> err = ubi_wl_init_scan(ubi, si); /* swapped in U-Boot */
> ubi_scan_destroy_si(si);
> return 0;
> }
>
> See the two swapped calls.
>
> At MARK, I printed some of the peb counters in *ubi, and I got
> different results for ubi->avail_pebs between U-Boot and Linux:
> U-Boot: UBI: POST_TBL: rsvd=2018, avail=21, beb_rsvd_{pebs,level}=0,0
> Linux: UBI: POST_TBL: rsvd=2018, avail=22, beb_rsvd_{pebs,level}=0,0
>
> The printed values were equal before calling ubi_read_volume_table().
> I have no idea about where this difference comes from, nor if this
> difference can cause my troubles.
> I will better investigate tomorrow looking into ubi_read_volume_table().
After half a day of debugging and an insane amount of printk()s added to
both U-Boot and Linux, I have some more hints to understand the problem.
The two different results quoted above show that U-Boot counted 21
available eraseblocks, while Linux counts 22. I am not sure if this can
cause my problem, but it's the first visible difference between U-Boot
and Linux.
This originates from ubi_scan() (scan.c): in U-Boot, it sets
si->bad_peb_count to 1, in Linux to 0. U-Boot's ubi_scan() is very
similar to Linux's, and the differences do not seem to relevant in my
case. So let's dig down...
- ubi_scan() (scan.c) calls process_eb() (scan.c) for each EB
- process_eb() calls ubi_io_is_bad() (io.c), and if it returns >0 it
increments si->bad_peb_count, which is what is happening to my board
when executing U-Boot
- ubi_io_is_bad() calls mtd->block_isbad(), which points to
nand_block_isbad() (nand_base.c)
- nand_block_isbad() is a wrapper to nand_block_checkbad() (nand_base.c)
- nand_block_checkbad() differs from the Linux code in something
related to lazy bad block scanning (commit fb49454b1b6c7c6, Feb 2012),
but this does not seem to change the behaviour I observe;
- nand_block_checkbad() calls either chip->block_bad() or
nand_isbad_bbt(); I tracked only into the former, but I suspect the
latter produces the same effects with regard to the problem I'm facing
- chip->block_bad() points to nand_block_bad() (nand_base.c)
nand_block_bad() (nand_base.c) does the following:
static int nand_block_bad(struct mtd_info *mtd, loff_t ofs, int getchip)
{
...
if (likely(chip->badblockbits == 8))
res = bad != 0xFF;
else
res = hweight8(bad) < chip->badblockbits;
if (getchip)
nand_release_device(mtd);
return res;
}
I don't understand the algorithm, but the relevant variables have these
values:
U-Boot: nand_block_bad: chip->badblockbits=8, bad=0000, hweight8(bad)=0
Linux: nand_block_bad: chip->badblockbits=0, bad=0000, hweight8(bad)=0
^
Obviously the U-Boot and Linux produce a different return value.
This propagates up to ubi->bad_peb_count in ubi_scan(), and from there
it changes the behaviour of the following code, leading to a block in
U-Boot and a successful attach in Linux.
chip->badblockbits in current Linux master is described as
"minimum number of set bits in a good block's bad block marker
position; i.e., BBM == 11110111b is not bad when badblockbits == 7".
Still a bit obscure to me because I don't have a general picture.
Anyway, here's how its value comes to be different between U-Boot
(2012.04.01) and Linux (2.6.37).
Linux:
a) commit e0b58d0a7005, Feb 2010:
mtd: nand: add ->badblockbits for minimum number of set bits in bad
block byte
declared the new variable and introduced in nand_get_flash_type()
(nand_base.c) the following line:
chip->badblockbits = 8;
b) commit c7b28e25cb9, Jul 2010:
mtd: nand: refactor BB marker detection
removed from nand_get_flash_type() (nand_base.c) the same line:
chip->badblockbits = 8;
c) commit 26d9be11485e, Apr 2011:
mtd: return badblockbits back
restored in nand_get_flash_type() (nand_base.c) the following line:
chip->badblockbits = 8;
claiming it had been accidentally removed in commit b).
The version of Linux I'm using (2.6.37), contains commits a) and b), so
it has chip->badblockbits equal to 0. According to the log message of
commit c), this should be wrong, but the resulting kernel works!
The version of U-Boot (2012.04.01) contains the result of all 3 commits,
since
commit 2a8e0fc8b3dc31a3c571e439fbf04b882c8986be
Author: Christian Hitz <christian.hitz at aizo.com>
Date: Wed Oct 12 09:32:02 2011 +0200
nand: Merge changes from Linux nand driver
[backport from linux commit
02f8c6aee8df3cdc935e9bdd4f2d020306035dbe]
This patch synchronizes the nand driver with the Linux 3.0 state.
This looks like an improvement, but it bricks my board!
I could not resist, and without even trying to understand what I was
doing, I did in U-Boot's nand_get_flash_type() (nand_base.c):
- chip->badblockbits = 8;
+ chip->badblockbits = 0;
and guess what? U-Boot attached UBI, loaded Linux from it and booted
successfully!
No, I don't think changing lines here and there without any real
understanding is a way to produce reliable software. But I'm unable
to understand why the software that should work better actually bricks
the board and the other one runs fine? And how do I know what the
correct value for chip->badblockbits should be?
And last but most important: how can I properly fix U-Boot?
Thanks,
Luca
More information about the U-Boot
mailing list