i.MX8MP SPL failures due to memory corruption/overflow?

Rasmus Villemoes rasmus.villemoes at prevas.dk
Wed Mar 15 16:59:05 CET 2023


On 15/03/2023 16.24, Frieder Schrempf wrote:
> On 15.03.23 15:42, Frieder Schrempf wrote:
>> On 15.03.23 15:17, Michael Nazzareno Trimarchi wrote:
>>> Hi
>>>
>>> On Wed, Mar 15, 2023 at 3:13 PM Frieder Schrempf
>>> <frieder.schrempf at kontron.de> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to bring up a new board based on the i.MX8MP and I have an
>>>> issue I'm hoping someone can help solving.
>>>>
>>>> I'm seeing failures in the early SPL code, usually in the DDR
>>>> initialization. Often they look like:
>>>>
>>>>   U-Boot SPL 2023.04-rc3 (Mar 07 2023 - 14:32:34 +0000)
>>>>   Training FAILED
>>>>   Failed to initialize DDR RAM!
>>>>   ### ERROR ### Please RESET the board ###
>>>>
>>>> But sometimes ddr_init() doesn't even return an error and only the
>>>> get_ram_size() afterwards which tries to allocate the memory fails.
>>>>
>>>
>>> In my experience you don't have space inside the cpu internal memory. It means
>>> that you overlap some stack with the code. Change the printf means
>>> move a bit. So you have
>>> problem but depends what you are going to destroy
>>
>> Thanks for your reply. That's exactly what I'm thinking, too.
>>
>>>
>>>> The strange thing is that the issues appear or disappear
>>>> deterministically on the binary level. This means I sometimes get a
>>>> U-Boot binary which runs just fine in 100% of cases. Then I change for
>>>> example one of the following:
>>>>
>>>> * Adding a single printf() somewhere in the boards spl.c
>>>> * Using the same binary but booting from SD card instead of USB loader
>>>> * Using the same source but switching from the OS cross compiler to the
>>>> one from Yocto/OE
>>>>
>>>> And afterwards I get 100% failure rate with an error as described above.
>>>>
>>>> My suspicion is that there is some memory corruption/conflict. My SPL is
>>>> quite large and I wonder if it exceeds some limit.
>>>>
>>>> SPL is loaded to 0x920000 and CONFIG_SPL_STACK is set to 0x960000, which
>>>> leaves 256 KiB in between for the SPL. But all i.MX8MP boards seem to
>>>> set CONFIG_SPL_MAX_SIZE=0x26000 (152 KiB) for some reason. My
>>>> u-boot-spl-ddr.bin currently has around 193 KiB but I don't get any
>>>> warning about exceeding the SPL_MAX_SIZE.

I also ran into this problem a while back, but that was back when the
ddr firmware files were padded to 16K and 32K each to make the magic
offset computations work; now that binman symbols are used, they only
take up as much space as they actually use (give or take some 4-byte
padding perhaps), and I no longer need the debug code I put in place in
our 2022.07 branch.

Remember that from the stack, the initial (and in SPL only) malloc arena
is carved out, and if you haven't adjusted SPL_SYS_MALLOC_F_LEN, you
probably have that set to the default SYS_MALLOC_F_LEN, which in turn
(on imx8m) defaults to 0x10000 aka 64KiB. So that could easily explain
why you collide with the firmware.

Maybe you can use the debug code I added to our copy of spl.c; I also
include most of my commit-message-for-future-me. But just something as
simple as

  int dummy;
  printf("stack is around %p\n", &dummy);

can be quite valuable.

===

add sanity check of SPL stack versus DDR firmware

The DDR firmware blobs are added via binman to the end of the SPL, and
they are found by the DDR init code via somewhat magic offset
tricks. To make those offset tricks work, binman has to pad the
firmware files to known sizes of 32KiB and 16KiB, so all four files
end up occupying 2*(32+16) = 96KiB in the image.

Since the SPL image is loaded at 0x920000, and we use a stack growing
down from 0x960000 (recently changed to 0x968000), from which we also
carve out space for the malloc arena and global data, there's a risk
that normal use of the stack accidentally overwrites parts of the
firmware data. That's exactly what happened when I added another 14K
table of DDR training data.

We can compute exactly where the firmware lives, and also where our
stack starts (after the mentioned reservations), so to help future me,
do a sanity check as soon as we have the console. We might already
have scribbled over the DDR firmware (the main stack use is actually
in the DM init code which runs before we do), but at least we then now
to stop and tell the developer what's wrong, instead of just silently
failing when loading corrupt firmware.

 #include <asm/mach-imx/iomux-v3.h>
 #include <asm/mach-imx/mxc_i2c.h>
 #include <asm/arch/ddr.h>
+#include <asm/sections.h>
 #include <power/pmic.h>
 #include <power/pca9450.h>

@@ -102,6 +103,35 @@ int board_fit_config_name_match(const char *name)
 }
 #endif

+#define TOP_OF_STACK \
+	((CONFIG_SPL_STACK - CONFIG_VAL(SYS_MALLOC_F_LEN) - sizeof(struct
global_data)) & ~15);
+
+static void check_stack_versus_ddr_firmware(void)
+{
+	unsigned long fdt_end = roundup((unsigned long)&_end +
fdt_totalsize(gd->fdt_blob), 4);
+	unsigned long ddr_firmware_end = fdt_end + 2*(16 + 32)*1024;
+	unsigned long top_of_stack = TOP_OF_STACK;
+	unsigned long available	= top_of_stack - ddr_firmware_end;
+
+	if (0) {
+		printf("fdt_end         = 0x%08lx\n", fdt_end);
+		printf("ddr_end         = 0x%08lx\n", ddr_firmware_end);
+		printf("stack           = 0x%08lx\n", top_of_stack);
+		printf("available stack =   %8ld\n", available);
+	}
+
+	if (top_of_stack <= ddr_firmware_end)
+		panic("Top of stack 0x%08lx already below DDR firmware end 0x%08lx!\n",
+		      top_of_stack, ddr_firmware_end);
+	/*
+	 * Some ad hoc instrumentation shows that our worst case stack
+	 * usage is around 3.5K. Requiring at least 16K should be
+	 * future-proof for a while.
+	 */
+	if (available <= 16*1024)
+		panic("Available stack space %ld too small!\n", available);
+}
+
 /* Do not use BSS area in this phase */
 void board_init_f(ulong dummy)
 {
@@ -118,6 +148,8 @@ void board_init_f(ulong dummy)

 	preloader_console_init();

+	check_stack_versus_ddr_firmware();
+
 	enable_tzc380();

 	power_init_board();




More information about the U-Boot mailing list