data abort when run 'dhcp'

Miquel Raynal miquel.raynal at bootlin.com
Fri Jul 21 13:54:23 CEST 2023


Hi Heinrich,

xypron.glpk at gmx.de wrote on Thu, 20 Jul 2023 19:55:39 +0200:

> Am 20. Juli 2023 18:39:17 MESZ schrieb Miquel Raynal <miquel.raynal at bootlin.com>:
> >Hello,
> >
> >qianfanguijin at 163.com wrote on Fri, 25 Mar 2022 18:04:46 +0800:
> >  
> >> It's very strange. And I can't detect it's a bug of usb or dlmalloc.
> >> 
> >> 1. Starting u-boot and dhcp via am335x's ethernet(cpsw driver), it's ok.
> >> 
> >> 2. Starting u-boot and dhcp via am335x's usb net, data abort.
> >> 
> >> 3. start fastboot, and CTRL C right now, dhcp via am335x's usb net, it's ok.  
> >
> >I am sorry to re-open a thread that is one year old but this is
> >still an open bug. The BBB is affected. In particular the BBBW
> >because there is no Ethernet connector, which makes the Eth-over-USB
> >emulation even more important. All U-Boots since 2021 are affected:
> >spurious data aborts, usually at the end of network interactions (tftp,
> >ping). I could not bisect it because the boot was deeply broken as
> >well on a significant range of commits :-/.
> >
> >On my side I narrowed it down to an env update which fails in malloc as
> >well. If I comment the env update, it fails a bit later. It really
> >looks like a stack corruption which is either related to the Ethernet
> >USB gadget or the USB controller driver itself. Network transfers on
> >the BBBW using regular Ethernet does not trigger any error.
> >
> >I also observe the very strange "fix" mentioned above: starting and
> >killing fastboot makes all tftp pass... If anyone has more details to
> >share, or perhaps a subsequent thread giving more details, I would
> >really like to see this fixed upstream, I suppose I am not the only one
> >:-)
> >
> >Thanks,
> >Miquèl  
> 
> 
> Can this problem be reproduced on QEMU?

I haven't tried on QEMU, what do you have in mind? What should we try
to do?

Thanks,
Miquèl

> 
> Best regards
> 
> Heinrich 
> 
> >  
> >> 
> >> 在 2022/3/24 17:33, qianfan 写道:  
> >> >
> >> > 在 2022/3/23 18:12, Heinrich Schuchardt 写道:    
> >> >> On 3/23/22 11:07, qianfan wrote:    
> >> >>>
> >> >>> 在 2022/3/23 17:51, Heinrich Schuchardt 写道:    
> >> >>>> On 3/23/22 10:13, qianfan wrote:    
> >> >>>>>
> >> >>>>> 在 2022/3/23 16:02, qianfan 写道:    
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 在 2022/3/23 15:45, qianfan 写道:    
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> 在 2022/3/23 10:28, qianfan 写道:    
> >> >>>>>>>>
> >> >>>>>>>> Hi:
> >> >>>>>>>>
> >> >>>>>>>> I had a custom AM335X board connected my computer by usbnet. It
> >> >>>>>>>> always report data abort when 'dhcp':
> >> >>>>>>>>
> >> >>>>>>>> Next it the log:
> >> >>>>>>>>
> >> >>>>>>>> U-Boot 2022.01-rc1-00183-gfa5b4e2d19-dirty (Feb 25 2022 - 15:45:02
> >> >>>>>>>> +0800)
> >> >>>>>>>>
> >> >>>>>>>> CPU  : AM335X-GP rev 2.1
> >> >>>>>>>> Model: WISDOM AM335X CCT
> >> >>>>>>>> DRAM:  512 MiB
> >> >>>>>>>> NAND:  256 MiB
> >> >>>>>>>> MMC:   OMAP SD/MMC: 0
> >> >>>>>>>> Loading Environment from NAND... *** Warning - bad CRC, using
> >> >>>>>>>> default environment
> >> >>>>>>>>
> >> >>>>>>>> Net:   Could not get PHY for ethernet at 4a100000: addr 0
> >> >>>>>>>> eth2: ethernet at 4a100000, eth3: usb_ether
> >> >>>>>>>> Hit any key to stop autoboot:  0    
> >> >>>>>>>> => setenv autoload no
> >> >>>>>>>> => dhcp    
> >> >>>>>>>> using musb-hdrc, OUT ep1out IN ep1in STATUS ep2in
> >> >>>>>>>> MAC de:ad:be:ef:00:01
> >> >>>>>>>> HOST MAC de:ad:be:ef:00:00
> >> >>>>>>>> RNDIS ready
> >> >>>>>>>> musb-hdrc: peripheral reset irq lost!
> >> >>>>>>>> high speed config #2: 2 mA, Ethernet Gadget, using RNDIS
> >> >>>>>>>> USB RNDIS network up!
> >> >>>>>>>> BOOTP broadcast 1
> >> >>>>>>>> BOOTP broadcast 2
> >> >>>>>>>> BOOTP broadcast 3
> >> >>>>>>>> DHCP client bound to address 192.168.200.4 (757 ms)
> >> >>>>>>>> data abort
> >> >>>>>>>> pc : [<9fe9b0a2>]          lr : [<9febbc3f>]
> >> >>>>>>>> reloc pc : [<808130a2>]    lr : [<80833c3f>]
> >> >>>>>>>> sp : 9de53410  ip : 9de53578     fp : 00000001
> >> >>>>>>>> r10: 9de5345c  r9 : 9de67e80     r8 : 9febbae5
> >> >>>>>>>> r7 : 9de72c30  r6 : 9feec710     r5 : 0000000d  r4 : 00000018
> >> >>>>>>>> r3 : 3fdd8e04  r2 : 00000002     r1 : 9feec728  r0 : 9feec700
> >> >>>>>>>> Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32 (T)
> >> >>>>>>>> Code: f023 0303 60ca 4403 (6091) 685a
> >> >>>>>>>> Resetting CPU ...
> >> >>>>>>>>
> >> >>>>>>>> resetting ...
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> It's there has any doc about how to debug data abort? Or is the bug
> >> >>>>>>>> is already fixed?
> >> >>>>>>>>
> >> >>>>>>>> Thanks
> >> >>>>>>>>    
> >> >>>>>>> This bug doesn't fixed on master code. I found v2021.01 is good and
> >> >>>>>>> v2021.04-rc2 is bad.
> >> >>>>>>>
> >> >>>>>>> Also I had tested this on beaglebone black with am335x_evm_defconfig,
> >> >>>>>>> has the simliar problem.
> >> >>>>>>>
> >> >>>>>>> find the first bug commit via 'git bisect': it told me that commit
> >> >>>>>>> e97eb638de0dc8f6e989e20eaeb0342f103cb917 broke it. But it is very
> >> >>>>>>> strange due to this commit doesn't touch any dhcp or network code.
> >> >>>>>>>
> >> >>>>>>> ➜  u-boot-main git:(e97eb638de) ✗ git bisect bug
> >> >>>>>>> e97eb638de0dc8f6e989e20eaeb0342f103cb917 is the first bug commit
> >> >>>>>>> commit e97eb638de0dc8f6e989e20eaeb0342f103cb917
> >> >>>>>>> Author: Heinrich Schuchardt <xypron.glpk at gmx.de>
> >> >>>>>>> Date:   Wed Jan 20 22:21:53 2021 +0100
> >> >>>>>>>
> >> >>>>>>>     fs: fat: consistent error handling for flush_dir()
> >> >>>>>>>
> >> >>>>>>>     Provide function description for flush_dir().
> >> >>>>>>>     Move all error messages for flush_dir() from the callers to the
> >> >>>>>>> function.
> >> >>>>>>>     Move mapping of errors to -EIO to the function.
> >> >>>>>>>     Always check return value of flush_dir() (Coverity CID 316362).
> >> >>>>>>>
> >> >>>>>>>     In fat_unlink() return -EIO if flush_dirty_fat_buffer() fails.
> >> >>>>>>>
> >> >>>>>>>     Signed-off-by: Heinrich Schuchardt <xypron.glpk at gmx.de>
> >> >>>>>>>
> >> >>>>>>> :040000 040000 2281a449f2d134078d7faa1ee735a367b55aad7e
> >> >>>>>>> 77d188b1c99181fd71f2167fdeee3434a09db209 M      fs
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> 184aa6504143b452132e28cd3ebecc7b941cdfa1 is the first commit before
> >> >>>>>>> e97eb638de0dc8f6e989e20eaeb0342f103cb917:
> >> >>>>>>>
> >> >>>>>>> * e97eb638de0dc8f6e989e20eaeb0342f103cb917 fs: fat: consistent error
> >> >>>>>>> handling for flush_dir()
> >> >>>>>>> *   184aa6504143b452132e28cd3ebecc7b941cdfa1 Merge tag
> >> >>>>>>> 'u-boot-rockchip-20210121' of
> >> >>>>>>> https://gitlab.denx.de/u-boot/custodians/u-boot-rockchip
> >> >>>>>>> |\
> >> >>>>>>> | * 9ddc0787bd660214366e386ce689dd78299ac9d0 pci: Add Rockchip dwc
> >> >>>>>>> based PCIe controller driver
> >> >>>>>>>
> >> >>>>>>> I checked 184aa6504143b452132e28cd3ebecc7b941cdfa1 can work fine.
> >> >>>>>>>
> >> >>>>>>> U-Boot 2021.01-00688-g184aa65041-dirty (Mar 23 2022 - 15:07:56 +0800)
> >> >>>>>>>
> >> >>>>>>> CPU  : AM335X-GP rev 2.1
> >> >>>>>>> Model: TI AM335x BeagleBone Black
> >> >>>>>>> DRAM:  512 MiB
> >> >>>>>>> WDT:   Started with servicing (60s timeout)
> >> >>>>>>> NAND:  0 MiB
> >> >>>>>>> MMC:   OMAP SD/MMC: 0, OMAP SD/MMC: 1
> >> >>>>>>> Loading Environment from FAT... <ethaddr> not set. Validating first
> >> >>>>>>> E-fuse MAC
> >> >>>>>>> Net:   eth2: ethernet at 4a100000, eth3: usb_ether
> >> >>>>>>> Hit any key to stop autoboot:  0    
> >> >>>>>>> => dhcp    
> >> >>>>>>> ethernet at 4a100000 Waiting for PHY auto negotiation to
> >> >>>>>>> complete......... TIMEOUT !
> >> >>>>>>> using musb-hdrc, OUT ep1out IN ep1in STATUS ep2in
> >> >>>>>>> MAC de:ad:be:ef:00:01
> >> >>>>>>> HOST MAC de:ad:be:ef:00:00
> >> >>>>>>> RNDIS ready
> >> >>>>>>> musb-hdrc: peripheral reset irq lost!
> >> >>>>>>> high speed config #2: 2 mA, Ethernet Gadget, using RNDIS
> >> >>>>>>> USB RNDIS network up!
> >> >>>>>>> BOOTP broadcast 1
> >> >>>>>>> BOOTP broadcast 2
> >> >>>>>>> BOOTP broadcast 3
> >> >>>>>>> DHCP client bound to address 192.168.200.157 (757 ms)
> >> >>>>>>> Using usb_ether device
> >> >>>>>>> TFTP from server 192.168.200.1; our IP address is 192.168.200.157
> >> >>>>>>> Filename 'u-boot.img'.
> >> >>>>>>> Load address: 0x82000000
> >> >>>>>>> Loading:
> >> >>>>>>> #################################################################
> >> >>>>>>> #################################################################
> >> >>>>>>> #################################################################
> >> >>>>>>>          #########################
> >> >>>>>>>          2.5 MiB/s
> >> >>>>>>> done
> >> >>>>>>> Bytes transferred = 1123888 (112630 hex)    
> >> >>>>>>> =>    
> >> >>>>>>>    
> >> >>>>> "data abort" messages:
> >> >>>>>
> >> >>>>> data abort
> >> >>>>> pc : [<9ff8196c>]          lr : [<9ffa1cd7>]
> >> >>>>> reloc pc : [<8081496c>]    lr : [<80834cd7>]
> >> >>>>> sp : 9df38e60  ip : 9df38fc8     fp : 00000001
> >> >>>>> r10: 9df38eac  r9 : 9df4ceb0     r8 : 9ffa1b7d
> >> >>>>> r7 : 9df52fd0  r6 : 9ffdbba8     r5 : 0000000d  r4 : 00000018
> >> >>>>> r3 : 3ff589e0  r2 : 9ffafa11     r1 : 9ffdbbc0  r0 : 9ffdbb00
> >> >>>>> Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32 (T)
> >> >>>>> Code: 0303 60ca 4403 6091 (685a) f042
> >> >>>>> Resetting CPU ...
> >> >>>>>
> >> >>>>> objdump u-boot:pc is in malloc and lr is in env_attr_walk
> >> >>>>>
> >> >>>>>        unlink(victim, bck, fwd);
> >> >>>>> 80814966:    60ca          str    r2, [r1, #12]
> >> >>>>>        set_inuse_bit_at_offset(victim, victim_size);
> >> >>>>> 80814968:    4403          add    r3, r0
> >> >>>>>        unlink(victim, bck, fwd);
> >> >>>>> 8081496a:    6091          str    r1, [r2, #8]
> >> >>>>>      set_inuse_bit_at_offset(victim, victim_size);
> >> >>>>> 8081496c:    685a          ldr    r2, [r3, #4]
> >> >>>>> 8081496e:    f042 0201     orr.w    r2, r2, #1
> >> >>>>> 80814972:    605a          str    r2, [r3, #4]
> >> >>>>>
> >> >>>>> r3 is 3ff589e0 and it's not a valid ram address on am335x.
> >> >>>>>
> >> >>>>>    
> >> >>>>
> >> >>>> I have seen crashes in common/dlmalloc.c before after double free() or
> >> >>>> free() with an incorrect pointer.
> >> >>>>
> >> >>>> The assert() statements in do_check_inuse_chunk() are meant to catch
> >> >>>> this but assert() as defined in include/log.h does not stop the code and
> >> >>>> even does not print without _DEBUG=1.
> >> >>>>
> >> >>>> You should be able to get the assert output with
> >> >>>>
> >> >>>> #include <common.h>
> >> >>>> #define _DEBUG 1
> >> >>>> #include <log.h>
> >> >>>>
> >> >>>> at the top of common/dlmalloc.c.
> >> >>>>
> >> >>>> You should get full malloc debug output with    
> >> >>>
> >> >>> Hi: I had try add DEBUG marco before <log.h> and no other malloc message    
> >> >>
> >> >> assert() checks for _DEBUG. Defining DEBUG after common.h will not
> >> >> define _DEBUG.    
> >> >
> >> > Finally I got a malloc error message on console:
> >> >
> >> > TFTP from server 192.168.200.1; our IP address is 192.168.200.39
> >> > Filename 'u-boot.img'.
> >> > Load address: 0x82000000
> >> > Loading: #################################################################
> >> > #################################################################
> >> > #################################################################
> >> >          ######################################################  0 Bytes
> >> >          1.9 MiB/s
> >> > done
> >> > Bytes transferred = 1274816 (1373c0 hex)
> >> > common/dlmalloc.c:819: do_check_chunk: Assertion `(char*)p + sz <= (char*)top' > failed.
> >> >
> >> > I had tried many times, do_check_chunk not always failed, and sometimes it > report common/dlmalloc.c:802: do_check_chunk: Assertion `!chunk_is_mmapped(p)' > failed. The situation is not the same.
> >> >
> >> > I got a bt stack when malloc failed:
> >> >
> >> > (gdb) bt
> >> > #0  0x9ffb5684 in panic_finish () at lib/panic.c:23
> >> > #1  panic (fmt=0x9ffbd96b "%s:%u: %s: Assertion `%s' failed.") at lib/panic.c:49
> >> > #2  0x9ffb5696 in __assert_fail (assertion=<optimized out>, file=<optimized > out>, line=<optimized out>, function=<optimized out>) at lib/panic.c:56
> >> > #3  0x9ff76910 in do_check_inuse_chunk (p=p at entry=0x9ffd7200) at > common/dlmalloc.c:866
> >> > #4  0x9ff769d6 in do_check_malloced_chunk (p=p at entry=0x9ffd7200, s=s at entry=24) > at common/dlmalloc.c:900
> >> > #5  0x9ff76da6 in malloc (bytes=<optimized out>) at common/dlmalloc.c:1552
> >> > #6  0x9ff96b72 in env_attr_walk (attr_list=<optimized out>, > callback=0x9ff969f9 <regex_callback>, priv=0x9df28dc8) at env/attr.c:70
> >> > #7  0x9ff96bc2 in env_attr_lookup (attr_list=<optimized out>, name=<optimized > out>, attributes=0x9df28dec "") at env/attr.c:184
> >> > #8  0x9ff97146 in env_callback_init (var_entry=0x9df46f60) at env/callback.c:67
> >> > #9  0x9ffb36fc in hsearch_r (item=..., action=ENV_ENTER, retval=0x9df28f60, > htab=0x9ffdbce8, flag=512) at lib/hashtable.c:403
> >> > #10 0x9ff7090e in _do_env_set (argc=<optimized out>, argv=<optimized out>, > env_flag=512, flag=0) at cmd/nvedit.c:296
> >> > #11 0x9ff70b64 in env_set (varname=<optimized out>, varvalue=<optimized out>) > at cmd/nvedit.c:318
> >> > #12 0x9ff6d522 in netboot_update_env () at cmd/net.c:133
> >> > #13 netboot_common (proto=DHCP, cmdtp=0x9ffdd0e8, argc=<optimized out>, > argv=0x9df442c8) at cmd/net.c:268
> >> > #14 0x9ff783a4 in cmd_call (repeatable=0x9df29008, argv=0x9df442c8, argc=1, > flag=0, cmdtp=0x9ffdd0e8) at common/command.c:580
> >> > #15 cmd_process (flag=<optimized out>, argc=1, argv=0x9df442c8, > repeatable=0x9ffdf6a0, ticks=0x0) at common/command.c:635
> >> > #16 0x9ff71d16 in run_pipe_real (pi=0x9df44220) at common/cli_hush.c:1676
> >> > #17 run_list_real (pi=<optimized out>) at common/cli_hush.c:1873
> >> > #18 0x9ff71e28 in run_list (pi=0x9df44220) at common/cli_hush.c:2022
> >> > #19 parse_stream_outer (inp=inp at entry=0x9df290e8, flag=flag at entry=2) at > common/cli_hush.c:3206
> >> > #20 0x9ff721ba in parse_file_outer () at common/cli_hush.c:3289
> >> > #21 0x9ff77c1a in cli_loop () at common/cli.c:229
> >> > #22 0x9ff70d3e in main_loop () at common/main.c:66
> >> > #23 0x9ff72672 in run_main_loop () at common/board_r.c:584
> >> > #24 0x9ff72830 in initcall_run_list (init_sequence=0x9ffd7224) at > include/initcall.h:46
> >> > #25 board_init_r (new_gd=<optimized out>, dest_addr=<optimized out>) at > common/board_r.c:822
> >> > Backtrace stopped: previous frame identical to this frame (corrupt stack?)
> >> >    
> >> >>
> >> >> Best regards
> >> >>
> >> >> Heinrich
> >> >>    
> >> >>> printed.
> >> >>>    
> >> >>>>
> >> >>>> #define DEBUG 1
> >> >>>> #include <common.h>
> >> >>>> #include <log.h>
> >> >>>>
> >> >>>> Best regards
> >> >>>>
> >> >>>> Heinrich    
> >> >>>    



More information about the U-Boot mailing list