[PATCH] nvme: Fix cache alignment

Wed Feb 3 11:42:49 CET 2021

On Tue, 2 Feb 2021 22:18:47 +0100
Marek Vasut <marek.vasut at gmail.com> wrote:

Hi,

> On 2/2/21 5:23 PM, Andre Przywara wrote:
> [...]
> 
> >>>   drivers/nvme/nvme.c | 50 +++++++++++++++++++++++++++++----------------
> >>>   1 file changed, 32 insertions(+), 18 deletions(-)
> >>>
> >>> diff --git a/drivers/nvme/nvme.c b/drivers/nvme/nvme.c
> >>> index 5d6331ad34..758415a53b 100644
> >>> --- a/drivers/nvme/nvme.c
> >>> +++ b/drivers/nvme/nvme.c
> >>> @@ -53,6 +53,27 @@ struct nvme_queue {
> >>>          unsigned long cmdid_data[];
> >>>   };
> >>>
> >>> +static void nvme_align_dcache_range(void *start, unsigned long size,
> >>> +                                   unsigned long *s, unsigned long *e)
> >>> +{
> >>> +       *s = rounddown((uintptr_t)start, ARCH_DMA_MINALIGN);
> >>> +       *e = roundup((uintptr_t)start + size, ARCH_DMA_MINALIGN);
> >>> +}  
> > 
> > As mentioned in the other email, just rounding the value that we are
> > going to pass to the cache maintenance operation does not make sense,
> > so this function should go.  
> 
> You keep saying that this patch makes no sense, but I don't see any 
> explanation _why_ there is a problem.

My main point is that you merely align the values that you pass to the
cache alignment function, but not the actual buffer addresses.

Not sure if this is clear, but the VAs used in cache maintenance
instructions do NOT need to be cache line aligned. The silicon
implementation will take care of this:
"When a DC instruction requires an address argument this takes the form
of a 64-bit register that holds the VA argument. No alignment
restrictions apply for this address."
(ARMv8 ARM DDI 0487F.c D4.4.8 A64 Cache maintenance instructions: The
data cache maintenance instruction (DC), pg. D4-2505)

"When the data is stated to be an MVA, it does not have to be cache line aligned."
(ARMv7 ARM DDI 0406C.d B4.2.1 Cache and branch predictor maintenance
operations, VMSA; section "MVA", pg. B4-1736)

I think I once chased this back to even ARMv6.

What requires alignment are the buffers used with cache maintenance, to
avoid a cache maintenance operation inadvertently affecting other data.

One dangerous operation is a "pure" invalidate on a *dirty* cache line,
as this may potentially lose data.
The other harm I see is when cleaning a line, when the hardware (some
DMA) has updated the backing memory meanwhile (for instance to update a
status bit or to fill a buffer with data from the hardware.

Avoiding those two situations requires careful alignment and padding of
the *buffers*, blanketly rounding any addresses handed to the cache
maintenance operation will not help. 

> >>> +
> >>> +static void nvme_flush_dcache_range(void *start, unsigned long size)
> >>> +{
> >>> +       unsigned long s, e;
> >>> +       nvme_align_dcache_range(start, size, &s, &e);
> >>> +       flush_dcache_range(s, e);  
> > 
> > There is no good reason for alignment restrictions when it comes to
> > clean (& invalidate), so there is no need for this wrapper.  
> 
> Is that on ARM64-specific or is that applicable in general ? The driver 
> is expected to work on any CPU.

Cache clean (actually: cache clean&invalidate) is what happens on evictions
all of the time, at the cache controller's discretion. So there is no
real harm in that operation per se. When an eviction happens on a
*clean* cache line, this is basically just an invalidate, which is also not
harmful.

There are harmful cases when buffers sharing a cache line are both "just invalidated"
and "cleaned" at different points in time.

> The flushed area start/end address must 
> be cache line aligned, hence the wrapper is correct and needed.

Why? (See below)

> >>> +}
> >>> +
> >>> +static void nvme_invalidate_dcache_range(void *start, unsigned long size)
> >>> +{
> >>> +       unsigned long s, e;
> >>> +       nvme_align_dcache_range(start, size, &s, &e);
> >>> +       invalidate_dcache_range(s, e);  
> > 
> > invalidate is tricky, we could use a wrapper similar to
> > invalidate_dcache_check() in bcmgenet.c. However I am not sure this is
> > very useful here, because most requests are already for a whole buffer
> > size (page or block).
> > 
> > Alignment of the buffer address will be checked by the implementation
> > of invalidate_dcache_range().  
> 
> Both invalidate_dcache_range() and flush_dcache_range() will perform no 
> operation if either start or end address is not cache-line aligned.

Where does this come from? I don't see this. ARMv8 goes directly to
assembly, where there is no check or bail out. It actually aligns the
values passed in:
        /* x2 <- minimal cache line size in cache system */
        sub     x3, x2, #1
        bic     x0, x0, x3
1:      dc      ivac, x0        /* invalidate data or unified cache */
(arch/arm/cpu/armv8/cache.S:__asm_invalidate_dcache_range)
I think it does this to simplify the algorithm to walk over the range.

On v7 there is indeed a check, but just for inval, not for flush.

Can you say what the reason for this v7 check was? I would say it's a
heads up to the programmer, to make sure we invalidate only a certain
buffer, but nothing else.

BUT: just aligning the address for the *cache operation* is missing
the point, as the instruction works on the whole cache line anyway. We
need to make sure that no other data (stack, heap, buffer) is sharing
the same cache line *and* is dirty, at the time when we invalidate.
Just rounding the addresses lets you merely pass the check_cache_range()
function, but will not solve the actual issue.

Take an example:
   nvmeq->cqes = (void *)memalign(4096, NVME_CQ_SIZE(depth))
     will result in this allocation (0x4000 is just some address):
 0x4000       0x4010       0x4020
    |  cqes[0]   |   cqes[1]  |

So struct nvme_completion seems to be exactly 16 bytes long, so both
entries will occupy the same cache line (cache line size >= 32 bytes).
With the common 64 byte size any data after that (at 0x4020) would be
affected as well.
So if you invalidate one entry, the other will also be invalidated. Any
address between 0x4000 and 0x403f will do that, whether this is the
start address of cqes[1], cqes[0] or any of those, rounded down.

We need to be aware of that. Looking again more closely at this
case, it seems fine here, as we only ever seem to *read* from those
(after initialisation), so they would never become dirty. Invalidating
neighboring entries should do no harm then.
As for other data after the buffer sharing the cache line: This is fine
*here* because there is a memalign(4096, ...) after this memalign above,
so the (4096 - 32) bytes after cqes[1] are not used. But honestly this
should either be documented or forced, by blowing up the size
of the allocation to a multiple of the cache line size.

So to summarise: The *whole* of the CQ entries are correctly aligned and
padded, but only because of the *next* memalign has an alignment
bigger than the cache line size.
A *single* CQ entry is not fine, as it's never starting *and* ending at
a cache line boundary. In this case this is fine, as all CQ entries are
only ever read, so invalidating those always clean cache lines does no
harm.

> If a 
> driver passes such unaligned address to either function, then the driver 
> must be fixed, because it will fail to perform cache flush/invalidate 
> ops silently and then fail silently, e.g. like this nvme driver does.

If a driver *uses* such unaligned address for a *pure invalidate*, it
should be carefully investigated if this is safe, or if anything else
is potentially sharing the same cache line. A blanket rounding will not
help with this, and will just paper over the issue.

Cheers,
Andre

> >>> +}
> >>> +
> >>>   static int nvme_wait_ready(struct nvme_dev *dev, bool enabled)
> >>>   {
> >>>          u32 bit = enabled ? NVME_CSTS_RDY : 0;
> >>> @@ -129,8 +150,7 @@ static int nvme_setup_prps(struct nvme_dev *dev, u64 *prp2,
> >>>          }
> >>>          *prp2 = (ulong)dev->prp_pool;
> >>>
> >>> -       flush_dcache_range((ulong)dev->prp_pool, (ulong)dev->prp_pool +
> >>> -                          dev->prp_entry_num * sizeof(u64));
> >>> +       nvme_flush_dcache_range(dev->prp_pool, dev->prp_entry_num * sizeof(u64));  
> > 
> > As mentioned above, there should be no reason to check or round addresses for flush().  
> 
> Again, if the start/end address of the flush/invalidate_dcache_range 
> operation is unaligned, the operation is not performed, hence the 
> alignment is mandatory.
> 
> > It gets a bit more tricky if buffers overlap with buffers that are also
> > invalidated at some point, but that does not seem to be the case here.  
> 
> The buffers are correctly padded already, hence there is no problem with 
> overlap, and hence the rounding is exactly what has to be done.
> 
> >>>
> >>>          return 0;
> >>>   }
> >>> @@ -144,10 +164,8 @@ static __le16 nvme_get_cmd_id(void)
> >>>
> >>>   static u16 nvme_read_completion_status(struct nvme_queue *nvmeq, u16 index)
> >>>   {
> >>> -       u64 start = (ulong)&nvmeq->cqes[index];
> >>> -       u64 stop = start + sizeof(struct nvme_completion);
> >>> -
> >>> -       invalidate_dcache_range(start, stop);  
> > 
> > IIUC, this wants to invalidate a single CQ entry, which is 16 bytes
> > long. So this will also invalidate the neighboring entry, at least
> > (with 64 byte cache line size even three others).
> > Does the hardware require the queue entries to be consecutive?
> > If not, we could blow up struct nvme_completion to have the size of a
> > cache line.
> > If yes, this gets a bit more tricky.  
> 
> The CQ entry is correctly padded, so this concern does not apply.
> Note that this is also mentioned in the commit message.
> 
> > One generic solution would be to use "coherent" memory, which is
> > basically mapped as uncached. And then put the CQ array in there. I
> > think we already have some support in U-Boot for that?  
> 
> Yes, some drivers seems to be misusing coherent memory where it makes 
> zero sense to use it. Here it makes zero sense, since the CQ entries are 
> cacheline aligned and all that needs to be done is correctly apply the 
> cache operations, which is done by this patch.
> 
> >>> +       nvme_invalidate_dcache_range(&nvmeq->cqes[index],
> >>> +                                    sizeof(struct nvme_completion));
> >>>
> >>>          return le16_to_cpu(readw(&(nvmeq->cqes[index].status)));
> >>>   }
> >>> @@ -163,8 +181,7 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
> >>>          u16 tail = nvmeq->sq_tail;
> >>>
> >>>          memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> >>> -       flush_dcache_range((ulong)&nvmeq->sq_cmds[tail],
> >>> -                          (ulong)&nvmeq->sq_cmds[tail] + sizeof(*cmd));
> >>> +       nvme_flush_dcache_range(&nvmeq->sq_cmds[tail], sizeof(*cmd));
> >>>
> >>>          if (++tail == nvmeq->q_depth)
> >>>                  tail = 0;
> >>> @@ -338,8 +355,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
> >>>          nvmeq->cq_phase = 1;
> >>>          nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
> >>>          memset((void *)nvmeq->cqes, 0, NVME_CQ_SIZE(nvmeq->q_depth));
> >>> -       flush_dcache_range((ulong)nvmeq->cqes,
> >>> -                          (ulong)nvmeq->cqes + NVME_CQ_SIZE(nvmeq->q_depth));
> >>> +       nvme_flush_dcache_range(nvmeq->cqes, NVME_CQ_SIZE(nvmeq->q_depth));  
> > 
> > This will flush multiple CQ entries, same as above. If a neighboring
> > entry is waiting for the status bit to flip (as above), this might be
> > harmful, as it might be overwritten.
> > If we can't blow up a single CQ entry, uncached memory might be better?  
> 
> No, see above.
> 
> [...]