data abort when run 'dhcp'
Miquel Raynal
miquel.raynal at bootlin.com
Thu Jul 20 18:39:17 CEST 2023
Hello,
qianfanguijin at 163.com wrote on Fri, 25 Mar 2022 18:04:46 +0800:
> It's very strange. And I can't detect it's a bug of usb or dlmalloc.
>
> 1. Starting u-boot and dhcp via am335x's ethernet(cpsw driver), it's ok.
>
> 2. Starting u-boot and dhcp via am335x's usb net, data abort.
>
> 3. start fastboot, and CTRL C right now, dhcp via am335x's usb net, it's ok.
I am sorry to re-open a thread that is one year old but this is
still an open bug. The BBB is affected. In particular the BBBW
because there is no Ethernet connector, which makes the Eth-over-USB
emulation even more important. All U-Boots since 2021 are affected:
spurious data aborts, usually at the end of network interactions (tftp,
ping). I could not bisect it because the boot was deeply broken as
well on a significant range of commits :-/.
On my side I narrowed it down to an env update which fails in malloc as
well. If I comment the env update, it fails a bit later. It really
looks like a stack corruption which is either related to the Ethernet
USB gadget or the USB controller driver itself. Network transfers on
the BBBW using regular Ethernet does not trigger any error.
I also observe the very strange "fix" mentioned above: starting and
killing fastboot makes all tftp pass... If anyone has more details to
share, or perhaps a subsequent thread giving more details, I would
really like to see this fixed upstream, I suppose I am not the only one
:-)
Thanks,
Miquèl
>
> 在 2022/3/24 17:33, qianfan 写道:
> >
> > 在 2022/3/23 18:12, Heinrich Schuchardt 写道:
> >> On 3/23/22 11:07, qianfan wrote:
> >>>
> >>> 在 2022/3/23 17:51, Heinrich Schuchardt 写道:
> >>>> On 3/23/22 10:13, qianfan wrote:
> >>>>>
> >>>>> 在 2022/3/23 16:02, qianfan 写道:
> >>>>>>
> >>>>>>
> >>>>>> 在 2022/3/23 15:45, qianfan 写道:
> >>>>>>>
> >>>>>>>
> >>>>>>> 在 2022/3/23 10:28, qianfan 写道:
> >>>>>>>>
> >>>>>>>> Hi:
> >>>>>>>>
> >>>>>>>> I had a custom AM335X board connected my computer by usbnet. It
> >>>>>>>> always report data abort when 'dhcp':
> >>>>>>>>
> >>>>>>>> Next it the log:
> >>>>>>>>
> >>>>>>>> U-Boot 2022.01-rc1-00183-gfa5b4e2d19-dirty (Feb 25 2022 - 15:45:02
> >>>>>>>> +0800)
> >>>>>>>>
> >>>>>>>> CPU : AM335X-GP rev 2.1
> >>>>>>>> Model: WISDOM AM335X CCT
> >>>>>>>> DRAM: 512 MiB
> >>>>>>>> NAND: 256 MiB
> >>>>>>>> MMC: OMAP SD/MMC: 0
> >>>>>>>> Loading Environment from NAND... *** Warning - bad CRC, using
> >>>>>>>> default environment
> >>>>>>>>
> >>>>>>>> Net: Could not get PHY for ethernet at 4a100000: addr 0
> >>>>>>>> eth2: ethernet at 4a100000, eth3: usb_ether
> >>>>>>>> Hit any key to stop autoboot: 0
> >>>>>>>> => setenv autoload no
> >>>>>>>> => dhcp
> >>>>>>>> using musb-hdrc, OUT ep1out IN ep1in STATUS ep2in
> >>>>>>>> MAC de:ad:be:ef:00:01
> >>>>>>>> HOST MAC de:ad:be:ef:00:00
> >>>>>>>> RNDIS ready
> >>>>>>>> musb-hdrc: peripheral reset irq lost!
> >>>>>>>> high speed config #2: 2 mA, Ethernet Gadget, using RNDIS
> >>>>>>>> USB RNDIS network up!
> >>>>>>>> BOOTP broadcast 1
> >>>>>>>> BOOTP broadcast 2
> >>>>>>>> BOOTP broadcast 3
> >>>>>>>> DHCP client bound to address 192.168.200.4 (757 ms)
> >>>>>>>> data abort
> >>>>>>>> pc : [<9fe9b0a2>] lr : [<9febbc3f>]
> >>>>>>>> reloc pc : [<808130a2>] lr : [<80833c3f>]
> >>>>>>>> sp : 9de53410 ip : 9de53578 fp : 00000001
> >>>>>>>> r10: 9de5345c r9 : 9de67e80 r8 : 9febbae5
> >>>>>>>> r7 : 9de72c30 r6 : 9feec710 r5 : 0000000d r4 : 00000018
> >>>>>>>> r3 : 3fdd8e04 r2 : 00000002 r1 : 9feec728 r0 : 9feec700
> >>>>>>>> Flags: Nzcv IRQs off FIQs on Mode SVC_32 (T)
> >>>>>>>> Code: f023 0303 60ca 4403 (6091) 685a
> >>>>>>>> Resetting CPU ...
> >>>>>>>>
> >>>>>>>> resetting ...
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It's there has any doc about how to debug data abort? Or is the bug
> >>>>>>>> is already fixed?
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>>
> >>>>>>> This bug doesn't fixed on master code. I found v2021.01 is good and
> >>>>>>> v2021.04-rc2 is bad.
> >>>>>>>
> >>>>>>> Also I had tested this on beaglebone black with am335x_evm_defconfig,
> >>>>>>> has the simliar problem.
> >>>>>>>
> >>>>>>> find the first bug commit via 'git bisect': it told me that commit
> >>>>>>> e97eb638de0dc8f6e989e20eaeb0342f103cb917 broke it. But it is very
> >>>>>>> strange due to this commit doesn't touch any dhcp or network code.
> >>>>>>>
> >>>>>>> ➜ u-boot-main git:(e97eb638de) ✗ git bisect bug
> >>>>>>> e97eb638de0dc8f6e989e20eaeb0342f103cb917 is the first bug commit
> >>>>>>> commit e97eb638de0dc8f6e989e20eaeb0342f103cb917
> >>>>>>> Author: Heinrich Schuchardt <xypron.glpk at gmx.de>
> >>>>>>> Date: Wed Jan 20 22:21:53 2021 +0100
> >>>>>>>
> >>>>>>> fs: fat: consistent error handling for flush_dir()
> >>>>>>>
> >>>>>>> Provide function description for flush_dir().
> >>>>>>> Move all error messages for flush_dir() from the callers to the
> >>>>>>> function.
> >>>>>>> Move mapping of errors to -EIO to the function.
> >>>>>>> Always check return value of flush_dir() (Coverity CID 316362).
> >>>>>>>
> >>>>>>> In fat_unlink() return -EIO if flush_dirty_fat_buffer() fails.
> >>>>>>>
> >>>>>>> Signed-off-by: Heinrich Schuchardt <xypron.glpk at gmx.de>
> >>>>>>>
> >>>>>>> :040000 040000 2281a449f2d134078d7faa1ee735a367b55aad7e
> >>>>>>> 77d188b1c99181fd71f2167fdeee3434a09db209 M fs
> >>>>>>>
> >>>>>>>
> >>>>>>> 184aa6504143b452132e28cd3ebecc7b941cdfa1 is the first commit before
> >>>>>>> e97eb638de0dc8f6e989e20eaeb0342f103cb917:
> >>>>>>>
> >>>>>>> * e97eb638de0dc8f6e989e20eaeb0342f103cb917 fs: fat: consistent error
> >>>>>>> handling for flush_dir()
> >>>>>>> * 184aa6504143b452132e28cd3ebecc7b941cdfa1 Merge tag
> >>>>>>> 'u-boot-rockchip-20210121' of
> >>>>>>> https://gitlab.denx.de/u-boot/custodians/u-boot-rockchip
> >>>>>>> |\
> >>>>>>> | * 9ddc0787bd660214366e386ce689dd78299ac9d0 pci: Add Rockchip dwc
> >>>>>>> based PCIe controller driver
> >>>>>>>
> >>>>>>> I checked 184aa6504143b452132e28cd3ebecc7b941cdfa1 can work fine.
> >>>>>>>
> >>>>>>> U-Boot 2021.01-00688-g184aa65041-dirty (Mar 23 2022 - 15:07:56 +0800)
> >>>>>>>
> >>>>>>> CPU : AM335X-GP rev 2.1
> >>>>>>> Model: TI AM335x BeagleBone Black
> >>>>>>> DRAM: 512 MiB
> >>>>>>> WDT: Started with servicing (60s timeout)
> >>>>>>> NAND: 0 MiB
> >>>>>>> MMC: OMAP SD/MMC: 0, OMAP SD/MMC: 1
> >>>>>>> Loading Environment from FAT... <ethaddr> not set. Validating first
> >>>>>>> E-fuse MAC
> >>>>>>> Net: eth2: ethernet at 4a100000, eth3: usb_ether
> >>>>>>> Hit any key to stop autoboot: 0
> >>>>>>> => dhcp
> >>>>>>> ethernet at 4a100000 Waiting for PHY auto negotiation to
> >>>>>>> complete......... TIMEOUT !
> >>>>>>> using musb-hdrc, OUT ep1out IN ep1in STATUS ep2in
> >>>>>>> MAC de:ad:be:ef:00:01
> >>>>>>> HOST MAC de:ad:be:ef:00:00
> >>>>>>> RNDIS ready
> >>>>>>> musb-hdrc: peripheral reset irq lost!
> >>>>>>> high speed config #2: 2 mA, Ethernet Gadget, using RNDIS
> >>>>>>> USB RNDIS network up!
> >>>>>>> BOOTP broadcast 1
> >>>>>>> BOOTP broadcast 2
> >>>>>>> BOOTP broadcast 3
> >>>>>>> DHCP client bound to address 192.168.200.157 (757 ms)
> >>>>>>> Using usb_ether device
> >>>>>>> TFTP from server 192.168.200.1; our IP address is 192.168.200.157
> >>>>>>> Filename 'u-boot.img'.
> >>>>>>> Load address: 0x82000000
> >>>>>>> Loading:
> >>>>>>> #################################################################
> >>>>>>> #################################################################
> >>>>>>> #################################################################
> >>>>>>> #########################
> >>>>>>> 2.5 MiB/s
> >>>>>>> done
> >>>>>>> Bytes transferred = 1123888 (112630 hex)
> >>>>>>> =>
> >>>>>>>
> >>>>> "data abort" messages:
> >>>>>
> >>>>> data abort
> >>>>> pc : [<9ff8196c>] lr : [<9ffa1cd7>]
> >>>>> reloc pc : [<8081496c>] lr : [<80834cd7>]
> >>>>> sp : 9df38e60 ip : 9df38fc8 fp : 00000001
> >>>>> r10: 9df38eac r9 : 9df4ceb0 r8 : 9ffa1b7d
> >>>>> r7 : 9df52fd0 r6 : 9ffdbba8 r5 : 0000000d r4 : 00000018
> >>>>> r3 : 3ff589e0 r2 : 9ffafa11 r1 : 9ffdbbc0 r0 : 9ffdbb00
> >>>>> Flags: Nzcv IRQs off FIQs on Mode SVC_32 (T)
> >>>>> Code: 0303 60ca 4403 6091 (685a) f042
> >>>>> Resetting CPU ...
> >>>>>
> >>>>> objdump u-boot:pc is in malloc and lr is in env_attr_walk
> >>>>>
> >>>>> unlink(victim, bck, fwd);
> >>>>> 80814966: 60ca str r2, [r1, #12]
> >>>>> set_inuse_bit_at_offset(victim, victim_size);
> >>>>> 80814968: 4403 add r3, r0
> >>>>> unlink(victim, bck, fwd);
> >>>>> 8081496a: 6091 str r1, [r2, #8]
> >>>>> set_inuse_bit_at_offset(victim, victim_size);
> >>>>> 8081496c: 685a ldr r2, [r3, #4]
> >>>>> 8081496e: f042 0201 orr.w r2, r2, #1
> >>>>> 80814972: 605a str r2, [r3, #4]
> >>>>>
> >>>>> r3 is 3ff589e0 and it's not a valid ram address on am335x.
> >>>>>
> >>>>>
> >>>>
> >>>> I have seen crashes in common/dlmalloc.c before after double free() or
> >>>> free() with an incorrect pointer.
> >>>>
> >>>> The assert() statements in do_check_inuse_chunk() are meant to catch
> >>>> this but assert() as defined in include/log.h does not stop the code and
> >>>> even does not print without _DEBUG=1.
> >>>>
> >>>> You should be able to get the assert output with
> >>>>
> >>>> #include <common.h>
> >>>> #define _DEBUG 1
> >>>> #include <log.h>
> >>>>
> >>>> at the top of common/dlmalloc.c.
> >>>>
> >>>> You should get full malloc debug output with
> >>>
> >>> Hi: I had try add DEBUG marco before <log.h> and no other malloc message
> >>
> >> assert() checks for _DEBUG. Defining DEBUG after common.h will not
> >> define _DEBUG.
> >
> > Finally I got a malloc error message on console:
> >
> > TFTP from server 192.168.200.1; our IP address is 192.168.200.39
> > Filename 'u-boot.img'.
> > Load address: 0x82000000
> > Loading: #################################################################
> > #################################################################
> > #################################################################
> > ###################################################### 0 Bytes
> > 1.9 MiB/s
> > done
> > Bytes transferred = 1274816 (1373c0 hex)
> > common/dlmalloc.c:819: do_check_chunk: Assertion `(char*)p + sz <= (char*)top' > failed.
> >
> > I had tried many times, do_check_chunk not always failed, and sometimes it > report common/dlmalloc.c:802: do_check_chunk: Assertion `!chunk_is_mmapped(p)' > failed. The situation is not the same.
> >
> > I got a bt stack when malloc failed:
> >
> > (gdb) bt
> > #0 0x9ffb5684 in panic_finish () at lib/panic.c:23
> > #1 panic (fmt=0x9ffbd96b "%s:%u: %s: Assertion `%s' failed.") at lib/panic.c:49
> > #2 0x9ffb5696 in __assert_fail (assertion=<optimized out>, file=<optimized > out>, line=<optimized out>, function=<optimized out>) at lib/panic.c:56
> > #3 0x9ff76910 in do_check_inuse_chunk (p=p at entry=0x9ffd7200) at > common/dlmalloc.c:866
> > #4 0x9ff769d6 in do_check_malloced_chunk (p=p at entry=0x9ffd7200, s=s at entry=24) > at common/dlmalloc.c:900
> > #5 0x9ff76da6 in malloc (bytes=<optimized out>) at common/dlmalloc.c:1552
> > #6 0x9ff96b72 in env_attr_walk (attr_list=<optimized out>, > callback=0x9ff969f9 <regex_callback>, priv=0x9df28dc8) at env/attr.c:70
> > #7 0x9ff96bc2 in env_attr_lookup (attr_list=<optimized out>, name=<optimized > out>, attributes=0x9df28dec "") at env/attr.c:184
> > #8 0x9ff97146 in env_callback_init (var_entry=0x9df46f60) at env/callback.c:67
> > #9 0x9ffb36fc in hsearch_r (item=..., action=ENV_ENTER, retval=0x9df28f60, > htab=0x9ffdbce8, flag=512) at lib/hashtable.c:403
> > #10 0x9ff7090e in _do_env_set (argc=<optimized out>, argv=<optimized out>, > env_flag=512, flag=0) at cmd/nvedit.c:296
> > #11 0x9ff70b64 in env_set (varname=<optimized out>, varvalue=<optimized out>) > at cmd/nvedit.c:318
> > #12 0x9ff6d522 in netboot_update_env () at cmd/net.c:133
> > #13 netboot_common (proto=DHCP, cmdtp=0x9ffdd0e8, argc=<optimized out>, > argv=0x9df442c8) at cmd/net.c:268
> > #14 0x9ff783a4 in cmd_call (repeatable=0x9df29008, argv=0x9df442c8, argc=1, > flag=0, cmdtp=0x9ffdd0e8) at common/command.c:580
> > #15 cmd_process (flag=<optimized out>, argc=1, argv=0x9df442c8, > repeatable=0x9ffdf6a0, ticks=0x0) at common/command.c:635
> > #16 0x9ff71d16 in run_pipe_real (pi=0x9df44220) at common/cli_hush.c:1676
> > #17 run_list_real (pi=<optimized out>) at common/cli_hush.c:1873
> > #18 0x9ff71e28 in run_list (pi=0x9df44220) at common/cli_hush.c:2022
> > #19 parse_stream_outer (inp=inp at entry=0x9df290e8, flag=flag at entry=2) at > common/cli_hush.c:3206
> > #20 0x9ff721ba in parse_file_outer () at common/cli_hush.c:3289
> > #21 0x9ff77c1a in cli_loop () at common/cli.c:229
> > #22 0x9ff70d3e in main_loop () at common/main.c:66
> > #23 0x9ff72672 in run_main_loop () at common/board_r.c:584
> > #24 0x9ff72830 in initcall_run_list (init_sequence=0x9ffd7224) at > include/initcall.h:46
> > #25 board_init_r (new_gd=<optimized out>, dest_addr=<optimized out>) at > common/board_r.c:822
> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> >
> >>
> >> Best regards
> >>
> >> Heinrich
> >>
> >>> printed.
> >>>
> >>>>
> >>>> #define DEBUG 1
> >>>> #include <common.h>
> >>>> #include <log.h>
> >>>>
> >>>> Best regards
> >>>>
> >>>> Heinrich
> >>>
More information about the U-Boot
mailing list