Odd error with cn9130, asix88179 and xhci
Chris Packham
judge.packham at gmail.com
Tue Nov 26 22:46:14 CET 2024
On Mon, Nov 25, 2024 at 1:09 PM Chris Packham <judge.packham at gmail.com> wrote:
>
> On Sat, Nov 23, 2024 at 3:40 PM Tom Rini <trini at konsulko.com> wrote:
> >
> > On Wed, Nov 20, 2024 at 11:29:43AM +1300, Chris Packham wrote:
> > > Hi U-Boot,
> > >
> > > We've hit a weird problem at $dayjob with a board using the Marvell
> > > CN9130 SoC and using the asix88179 USB-Eth adapter.
> > >
> > > The problem is after enabling and unrelated feature in u-boot the
> > > asix88179 fails to receive data (I can confirm that the link partner
> > > does see packets in the transmit direction)
> > >
> > > => version
> > > U-Boot 2022.01 (Nov 08 2024 - 09:45:44 +0000)
> > > => usb start
> > > starting USB...
> > > Bus usb3 at 500000: Register 2000120 NbrPorts 2
> > > Starting the controller
> > > USB XHCI 1.00
> > > scanning bus usb3 at 500000 for devices... 2 USB Device(s) found
> > > scanning usb for storage devices... 0 Storage Device(s) found
> > > => ping ${serverip}
> > > Waiting for Ethernet connection... unable to connect.
> > > Reset Ethernet Device
> > > Waiting for Ethernet connection... done.
> > > Using ax88179_eth device
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > > Rx: failed to receive: -5
> > >
> > > Abort
> > > ping failed; host 10.37.233.65 is not alive
> > > => <INTERRUPT>
> > >
> > > Debugging a little we can see that the -EIO is actually because
> > > xhci_bulk_tx() hits a timeout from xhci_wait_for_event().
> > >
> > > We think this is triggered by the u-boot image size crossing some
> > > boundary (the problem seems to start when .bss_end crosses
> > > 0x00000000000f0000) although I've so far been unable to find
> > > specifically why that might be. As far as I can tell u-boot is being
> > > built relocatably and nothing is overlapping. I also considered that
> > > ATF might be preventing access to something but so far I see no
> > > evidence of this.
> > >
> > > If I turn off some features to reduce the build size the problem goes
> > > away. That is actually how we've avoided the immediate issue, although
> > > that means the problem will likely come back and an inopportune time.
> > >
> > > Does anyone have any ideas as to what the true root cause might be?
> > > I'm a bit stumped.
> >
> > Hummmm. Since you note it seems to be when a threshold is crossed in BSS
> > size, add something to the BSS of a variable size that you control, and
> > after confirming that you can replicate the problem this way, grow it
> > just past the limit and compare u-boot.map files in the works/fails
> > cases to see just what's being moved around?
>
> So I tried a little experiment
>
> diff --git a/net/net.c b/net/net.c
> index b003b84b3537..a6def9785133 100644
> --- a/net/net.c
> +++ b/net/net.c
> @@ -180,6 +180,10 @@ u32 net_boot_file_size;
> /* Boot file size in blocks as reported by the DHCP server */
> u32 net_boot_file_expected_size_in_blocks;
>
> +#define DUMMY_SIZE (1 << 11)
> +
> +int dummy[DUMMY_SIZE] = {0};
> +
> static uchar net_pkt_buf[(PKTBUFSRX+1) * PKTSIZE_ALIGN + PKTALIGN];
> /* Receive packets */
> uchar *net_rx_packets[PKTBUFSRX];
> @@ -211,6 +215,7 @@ int __maybe_unused net_busy_flag;
> static int on_ipaddr(const char *name, const char *value, enum env_op op,
> int flags)
> {
> + dummy[DUMMY_SIZE - 1] = -1;
> if (flags & H_PROGRAMMATIC)
> return 0;
>
>
> If I make DUMMY_SIZE (1 << 10) I don't see the problem. With
> DUMMY_SIZE (1 << 11) I can see the problem. If I make it DUMMY_SIZE (1
> << 14) then the problem goes away again.
>
> The obvious things that are moving are net_rx_packet,
> net_rx_packet_len and net_rx_packets. I'll see if I can narrow things
> down to specifically which of these is being problematic.
>
The plot thickens on this one. First I found that even if I moved my
dummy block after the symbols I suspected the failure would remain. I
kept narrowing things down and found that my dummy array needed to
have a length between 0x800 and 0x8e0 to cause an issue. As I was
trying to debug why this was, I found that I could fix a failing
system with a `usb reset`. I'm now suspecting there's something in the
mix that is relying on uninitialised memory (or perhaps the
calculation for clearing out .bss is slightly off).
More information about the U-Boot
mailing list