32-bit DMA limit for devices (and drivers)

Fri Apr 30 15:34:28 CEST 2021

On Fri, 30 Apr 2021 14:02:52 +0200 (CEST)
Mark Kettenis <mark.kettenis at xs4all.nl> wrote:

Hi Mark,

thanks for the reply!

(CC:ing Alex and Heinrich for the UEFI questions below)

> > Date: Fri, 30 Apr 2021 12:21:21 +0100
> > From: Andre Przywara <andre.przywara at arm.com>
> > 
> > Hi,
> > 
> > We now see the first Allwinner devices [1] having DRAM located above
> > 4GB in address space (4GB DRAM starting at 1GB). After one fix[2]
> > this works somewhat fine, but the sun8i-emac network device is still
> > limited to 32-bit DMA addresses. With U-Boot relocating itself (plus
> > stack and heap) to the end of DRAM, it now runs completely beyond 4GB
> > on those machines, so not giving pure 32-bit addresses for buffers
> > anymore.
> > In Linux we handle this easily by just keeping the default DMA
> > mask at 32 bits, and letting the DMA framework deal with the nasty
> > details.
> > 
> > I was wondering how this should be handled in U-Boot? The straight
> > forward solution would be:
> > - Let the driver allocate the RX and TX buffers separately, placing them
> >   below 4GB in the address space (using lmb_reserve(), I guess?)
> > - Use those RX buffers and hand the addresses back to the upper layers.
> > - We already copy TX packets, so this would also be covered, in this
> >   situation. Other drivers might need to introduce copying.  
> 
> What you describe here is called a bounce buffer approach.  I believe
> Linux developers also refer to this as swiotlb.

Yes, but it's not entirely the same as bounce buffering in Linux,
since we allocate the buffers ourselves, in the driver, so we have full
control over it. The problem I face is that malloc() works on the heap
(which is high), or we use the automatic priv_alloc mechanism, which
uses the heap as well, IIUC.

> > This sounds like a common problem, so I was wondering if there is a
> > more generic solution to this? Maybe there are already platforms or
> > devices affected? Or should the whole heap and stack be moved below 4GB
> > (if this is easily possible)?
> > In our case we make the buffers part of our priv struct, so should
> > there be an option to let the priv_auto allocation come from below 4GB?
> > 
> > Grateful for any input on this!  
> 
> I looked into this a bit when I was trying to figure out what to do on
> Apple M1 systems where I have a somewhat related issue.  These systems
> have an IOMMU that can't be bypassed.  Since I don't want to add IOMMU
> infrastructure to U-Boot, I set up the IOMMU to map a fixed block of
> physical memory and make sure that all allocations of memory come from
> that block of memory.  In this case this is fairly easy to achieve.
> U-Boot allocates memory from the top of usable memory, so as long as I
> let the IOMMU map that high memory, things work.  U-Boot doesn't need
> a lot of memory, so a block of 512MB is more than sufficient.

I'd rather not play around with the visible memory size (see below).
And while technically there is a (scatter/gather) IOMMU in the SoC, it
would be too big guns for that small problem.

> In your case this means that as long as you set the top of usable
> memory to an address < 4G, U-Boot itself should be fine and no bounce
> buffers are needed.  You have to make sure the addresses in the U-Boot
> environment for loading things like the kernel and the FDT are set to
> an address < 4G as well.
> 
> For EFI things are different though.  You want to expose all physical
> memory in the EFI memory map.

Not only for UEFI, since U-Boot populates the DT memory node even for
booti/bootm, in arch/arm/lib/bootm-fdt.c:arch_fixup_fdt().
So limiting the memory is not an option, since this would be passed on
to the OS.

> This means that an EFI application
> (such as an OS loader) may pick memory > 4G and use it to do I/O.

I think we should be safe here, as the driver has full control over the
buffers: For TX we copy already, to use "fire-and-forget", so we
just start the DMA and return. And for RX U-Boot network drivers
return the buffer address, so it's our own buffer again. So wherever
higher layers put the packets, we should be good (given our own buffers
are).

So I guess my question boils down to: How can I best allocate buffers
from "low" memory? And do those buffers carveouts make it into the UEFI
memory map, as reserved regions? Or can UEFI differentiate between
boot services and runtime services allocations? The buffers would be
needed during boot services, for the UEFI network protocol. But later
on they can be abandoned.

> this purpose U-Boot already implements bounce buffers.  See the
> CONFIG_EFI_LOADER_BOUNCE_BUFFER option.

Interesting, thanks, I will have a look at that. Maybe that contains
some useful traces to other code.

Cheers,
Andre