[PATCH] nvme: Fix cache alignment

Andre Przywara andre.przywara at arm.com
Thu Feb 4 11:26:36 CET 2021


On Wed, 3 Feb 2021 14:08:39 +0100
Marek Vasut <marek.vasut at gmail.com> wrote:

Hi Marek,

I think our opinions actually don't differ that much, but we might be
either misunderstanding ourselves or talk about different things.
See below.

> On 2/3/21 11:42 AM, Andre Przywara wrote:
> 
> [...]
> 
> >>>>>    drivers/nvme/nvme.c | 50 +++++++++++++++++++++++++++++----------------
> >>>>>    1 file changed, 32 insertions(+), 18 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/nvme/nvme.c b/drivers/nvme/nvme.c
> >>>>> index 5d6331ad34..758415a53b 100644
> >>>>> --- a/drivers/nvme/nvme.c
> >>>>> +++ b/drivers/nvme/nvme.c
> >>>>> @@ -53,6 +53,27 @@ struct nvme_queue {
> >>>>>           unsigned long cmdid_data[];
> >>>>>    };
> >>>>>
> >>>>> +static void nvme_align_dcache_range(void *start, unsigned long size,
> >>>>> +                                   unsigned long *s, unsigned long *e)
> >>>>> +{
> >>>>> +       *s = rounddown((uintptr_t)start, ARCH_DMA_MINALIGN);
> >>>>> +       *e = roundup((uintptr_t)start + size, ARCH_DMA_MINALIGN);
> >>>>> +}  
> >>>
> >>> As mentioned in the other email, just rounding the value that we are
> >>> going to pass to the cache maintenance operation does not make sense,
> >>> so this function should go.  
> >>
> >> You keep saying that this patch makes no sense, but I don't see any
> >> explanation _why_ there is a problem.  
> > 
> > My main point is that you merely align the values that you pass to the
> > cache alignment function, but not the actual buffer addresses.  
> 
> Yes, which is explained in the commit message, in this driver it is 
> sufficient to around the start address down and end address up.

I haven't checked the whole driver with scrutiny, but this is totally
non-obvious, and also fragile. For instance single cqes entries are not
padded, and the padding of the whole array just works by chance, due to
the following memalign and an implementation detail of memalign_simple.

So this deserves a comment in the code, for the contentious calls.
 
> > Not sure if this is clear, but the VAs used in cache maintenance
> > instructions do NOT need to be cache line aligned. The silicon
> > implementation will take care of this:  
> 
> No, this is incorrect.
> The flush/invalidate_dcache_range() functions 

We are talking about different things here. The CPU instructions do not
require alignment, so it's a pure U-Boot property.
I see that's it's not the worst idea to demand alignment, to make
driver author's aware of the side effects of cache alignment.
But the ARM CPU instructions have the rounding built-in.

> only support cacheline aligned addresses, if the address is not aligned, 
> then these functions do nothing. If arm64 has some special behavior, 
> then that's irrelevant here, since this is a generic driver and must 
> work on all architectures.

But your bugfix here was particularly for arm64, so what does it do
then? arm64 would work with or without the rounding, in fact the
implementation in cache.S rounds before the CPU instructions (which
would round again themselves).

> > "When a DC instruction requires an address argument this takes the
> > form of a 64-bit register that holds the VA argument. No alignment
> > restrictions apply for this address."
> > (ARMv8 ARM DDI 0487F.c D4.4.8 A64 Cache maintenance instructions:
> > The data cache maintenance instruction (DC), pg. D4-2505)
> > 
> > "When the data is stated to be an MVA, it does not have to be cache
> > line aligned." (ARMv7 ARM DDI 0406C.d B4.2.1 Cache and branch
> > predictor maintenance operations, VMSA; section "MVA", pg. B4-1736)
> > 
> > I think I once chased this back to even ARMv6.
> > 
> > What requires alignment are the buffers used with cache
> > maintenance, to avoid a cache maintenance operation inadvertently
> > affecting other data.
> > 
> > One dangerous operation is a "pure" invalidate on a *dirty* cache
> > line, as this may potentially lose data.
> > The other harm I see is when cleaning a line, when the hardware
> > (some DMA) has updated the backing memory meanwhile (for instance
> > to update a status bit or to fill a buffer with data from the
> > hardware.
> > 
> > Avoiding those two situations requires careful alignment and
> > padding of the *buffers*, blanketly rounding any addresses handed
> > to the cache maintenance operation will not help.  
> 
> If you read the patch, you will notice that these conditions either 
> happened before too, because the cache operations were in place
> already OR they do not apply at all, because the buffers in question
> are already well aligned.

Great! But why do you need to round then? If the buffers are already all
aligned, we just hand their addresses to *_cache_range(), and things
would be fine.
Just looking at the invalidates, we just need to round in
nvme_read_completion_status, because a single entry is *not* aligned.
And in *this particular case* this is fine, because they are only read
by the CPU and the cache line is shared only with other CQ entries (see
above).
Actually I would suggest in this case:
- blow up the cqes allocation size to cover whole cache lines:
  memalign(4096, ALIGN(NVME_CQ_SIZE(depth), ARCH_DMA_MINALIGN)
- have a wrapper that always invalidates the whole region, to not give
  the illusion that invalidating just a single entry is a thing:
  invalidate_all_cqes()
  {
	ulong start = (ulong)&nvmeq->cqes[0];
	ulong end = start + ALIGN(NVME_CQ_SIZE(NVME_Q_DEPTH),
				  ARCH_DMA_MINALIGN);

	invalidate_cache_range(start, stop);
  }

All other invalidates don't need rounding, and my argument is that this
blanket rounding is jeopardizing the very reason you ask for the
alignment in the first place.

> >>>>> +
> >>>>> +static void nvme_flush_dcache_range(void *start, unsigned long size)
> >>>>> +{
> >>>>> +       unsigned long s, e;
> >>>>> +       nvme_align_dcache_range(start, size, &s, &e);
> >>>>> +       flush_dcache_range(s, e);  
> >>>
> >>> There is no good reason for alignment restrictions when it comes to
> >>> clean (& invalidate), so there is no need for this wrapper.  
> >>
> >> Is that on ARM64-specific or is that applicable in general ? The driver
> >> is expected to work on any CPU.  
> > 
> > Cache clean (actually: cache clean&invalidate) is what happens on evictions
> > all of the time, at the cache controller's discretion. So there is no
> > real harm in that operation per se. When an eviction happens on a
> > *clean* cache line, this is basically just an invalidate, which is also not
> > harmful.
> > 
> > There are harmful cases when buffers sharing a cache line are both "just invalidated"
> > and "cleaned" at different points in time.  
> 
> Is that on ARM64-specific or is that applicable in general ? (the above 
> does not answer that question)

I would say that's a property of *every* write-back cache
implementation:
https://en.wikipedia.org/wiki/Cache_(computing)#/media/File:Write-back_with_write-allocation.svg

> 
> >> The flushed area start/end address must
> >> be cache line aligned, hence the wrapper is correct and needed.  
> > 
> > Why? (See below)
> >   
> >>>>> +}
> >>>>> +
> >>>>> +static void nvme_invalidate_dcache_range(void *start, unsigned long size)
> >>>>> +{
> >>>>> +       unsigned long s, e;
> >>>>> +       nvme_align_dcache_range(start, size, &s, &e);
> >>>>> +       invalidate_dcache_range(s, e);  
> >>>
> >>> invalidate is tricky, we could use a wrapper similar to
> >>> invalidate_dcache_check() in bcmgenet.c. However I am not sure this is
> >>> very useful here, because most requests are already for a whole buffer
> >>> size (page or block).
> >>>
> >>> Alignment of the buffer address will be checked by the implementation
> >>> of invalidate_dcache_range().  
> >>
> >> Both invalidate_dcache_range() and flush_dcache_range() will perform no
> >> operation if either start or end address is not cache-line aligned.  
> > 
> > Where does this come from? I don't see this. ARMv8 goes directly to
> > assembly, where there is no check or bail out.  
> 
> So ARMv8 has custom behavior, but that is not how these functions are 
> supposed to work, look at check_cache_range() usage elsewhere. If the 
> documentation is lacking, see 397b5697ad ("arm: Move check_cache_range() 
> into a common place") and co.

Well, check_cache_range() is a separate discussion I would say. I would
argue that it's less useful for flush (because clean&invalidates
happen everytime). Also, in reality, it just pushes drivers to just
round the arguments, to pass this check.

And, I wonder if this: "if (!check_cache_range(start, stop)) return;"
is the right answer to this idea then. I am not sure that skipping the
cache maintenance is really better than allowing an unaligned address.

> 
> It actually aligns the
> > values passed in:
> >          /* x2 <- minimal cache line size in cache system */
> >          sub     x3, x2, #1
> >          bic     x0, x0, x3
> > 1:      dc      ivac, x0        /* invalidate data or unified cache */
> > (arch/arm/cpu/armv8/cache.S:__asm_invalidate_dcache_range)
> > I think it does this to simplify the algorithm to walk over the range.
> > 
> > On v7 there is indeed a check, but just for inval, not for flush.  
> 
> Again, arch specific behavior on which you cannot depend in generic driver.

Your commit message says: " ... which permits the nvme to work on
arm64."

> > Can you say what the reason for this v7 check was? I would say it's a
> > heads up to the programmer, to make sure we invalidate only a certain
> > buffer, but nothing else.  
> 
> The reason for those checks is because not all architectures can 
> flush/invalidate at sub-cache-line granularity, which results in obscure 
> problems.

When you say "can flush/invalidate at sub-cache-line granularity", do
you mean "will fault"? Because by design I think you can only ever
flush/invalidate whole cache lines, everywhere. The question is whether
all architectures permit unaligned addresses (ARM does). This could be
handled by the implementation (I think it actually is).

> 
> > BUT: just aligning the address for the *cache operation* is missing
> > the point, as the instruction works on the whole cache line anyway. We
> > need to make sure that no other data (stack, heap, buffer) is sharing
> > the same cache line *and* is dirty, at the time when we invalidate.
> > Just rounding the addresses lets you merely pass the check_cache_range()
> > function, but will not solve the actual issue.
> > 
> > Take an example:
> >     nvmeq->cqes = (void *)memalign(4096, NVME_CQ_SIZE(depth))
> >       will result in this allocation (0x4000 is just some address):
> >   0x4000       0x4010       0x4020
> >      |  cqes[0]   |   cqes[1]  |
> > 
> > So struct nvme_completion seems to be exactly 16 bytes long, so both
> > entries will occupy the same cache line (cache line size >= 32 bytes).
> > With the common 64 byte size any data after that (at 0x4020) would be
> > affected as well.
> > So if you invalidate one entry, the other will also be invalidated. Any
> > address between 0x4000 and 0x403f will do that, whether this is the
> > start address of cqes[1], cqes[0] or any of those, rounded down.  
> 
> And is either of the neighboring cges[] used, so does this cause any 
> issues ? I believe the answer is no.

I agree, and in *this* case we checked carefully, and can indeed round.
But this should rather be done like I sketched above, by explicitly
invalidating the whole array.

> > We need to be aware of that. Looking again more closely at this
> > case, it seems fine here, as we only ever seem to *read* from those
> > (after initialisation), so they would never become dirty. Invalidating
> > neighboring entries should do no harm then.
> > As for other data after the buffer sharing the cache line: This is fine
> > *here* because there is a memalign(4096, ...) after this memalign above,
> > so the (4096 - 32) bytes after cqes[1] are not used. But honestly this
> > should either be documented or forced, by blowing up the size
> > of the allocation to a multiple of the cache line size.
> > 
> > So to summarise: The *whole* of the CQ entries are correctly aligned and
> > padded, but only because of the *next* memalign has an alignment
> > bigger than the cache line size.
> > A *single* CQ entry is not fine, as it's never starting *and* ending at
> > a cache line boundary. In this case this is fine, as all CQ entries are
> > only ever read, so invalidating those always clean cache lines does no
> > harm.  
> 
> Good, I am happy to hear we finally reached the same conclusion.
> 
> So what do you propose is adjusted in this bugfix ?

That we just do this in nvme_read_completion_status(), ideally as in the
snippet above. The other invalidates don't need rounding, and doing the
blanket rounding papers over any potential bugs, namely unaligned
buffer addresses. They should be already page aligned, AFAICT.

Need to have a closer look at the flush's.
 
> >> If a
> >> driver passes such unaligned address to either function, then the driver
> >> must be fixed, because it will fail to perform cache flush/invalidate
> >> ops silently and then fail silently, e.g. like this nvme driver does.  
> > 
> > If a driver *uses* such unaligned address for a *pure invalidate*, it
> > should be carefully investigated if this is safe, or if anything else
> > is potentially sharing the same cache line. A blanket rounding will not
> > help with this, and will just paper over the issue.  
> 
> See check_cache_range(), the sole reason for its existence was to notify 
> driver developers about various cache alignment problems. I believe it 
> should be added on armv8 too.

That could be a discussion to have, yes, at least for invalidate. But
from what I saw in drivers, they just work around this check, by forcing
alignment on the *arguments*, without carefully checking that the
actual buffers are aligned and padded.

Cheers,
Andre


More information about the U-Boot mailing list