[PATCH] nvme: Fix cache alignment

Tue Feb 2 22:18:47 CET 2021

On 2/2/21 5:23 PM, Andre Przywara wrote:
[...]

>>>   drivers/nvme/nvme.c | 50 +++++++++++++++++++++++++++++----------------
>>>   1 file changed, 32 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/nvme/nvme.c b/drivers/nvme/nvme.c
>>> index 5d6331ad34..758415a53b 100644
>>> --- a/drivers/nvme/nvme.c
>>> +++ b/drivers/nvme/nvme.c
>>> @@ -53,6 +53,27 @@ struct nvme_queue {
>>>          unsigned long cmdid_data[];
>>>   };
>>>
>>> +static void nvme_align_dcache_range(void *start, unsigned long size,
>>> +                                   unsigned long *s, unsigned long *e)
>>> +{
>>> +       *s = rounddown((uintptr_t)start, ARCH_DMA_MINALIGN);
>>> +       *e = roundup((uintptr_t)start + size, ARCH_DMA_MINALIGN);
>>> +}
> 
> As mentioned in the other email, just rounding the value that we are
> going to pass to the cache maintenance operation does not make sense,
> so this function should go.

You keep saying that this patch makes no sense, but I don't see any 
explanation _why_ there is a problem.

>>> +
>>> +static void nvme_flush_dcache_range(void *start, unsigned long size)
>>> +{
>>> +       unsigned long s, e;
>>> +       nvme_align_dcache_range(start, size, &s, &e);
>>> +       flush_dcache_range(s, e);
> 
> There is no good reason for alignment restrictions when it comes to
> clean (& invalidate), so there is no need for this wrapper.

Is that on ARM64-specific or is that applicable in general ? The driver 
is expected to work on any CPU. The flushed area start/end address must 
be cache line aligned, hence the wrapper is correct and needed.

>>> +}
>>> +
>>> +static void nvme_invalidate_dcache_range(void *start, unsigned long size)
>>> +{
>>> +       unsigned long s, e;
>>> +       nvme_align_dcache_range(start, size, &s, &e);
>>> +       invalidate_dcache_range(s, e);
> 
> invalidate is tricky, we could use a wrapper similar to
> invalidate_dcache_check() in bcmgenet.c. However I am not sure this is
> very useful here, because most requests are already for a whole buffer
> size (page or block).
> 
> Alignment of the buffer address will be checked by the implementation
> of invalidate_dcache_range().

Both invalidate_dcache_range() and flush_dcache_range() will perform no 
operation if either start or end address is not cache-line aligned. If a 
driver passes such unaligned address to either function, then the driver 
must be fixed, because it will fail to perform cache flush/invalidate 
ops silently and then fail silently, e.g. like this nvme driver does.

>>> +}
>>> +
>>>   static int nvme_wait_ready(struct nvme_dev *dev, bool enabled)
>>>   {
>>>          u32 bit = enabled ? NVME_CSTS_RDY : 0;
>>> @@ -129,8 +150,7 @@ static int nvme_setup_prps(struct nvme_dev *dev, u64 *prp2,
>>>          }
>>>          *prp2 = (ulong)dev->prp_pool;
>>>
>>> -       flush_dcache_range((ulong)dev->prp_pool, (ulong)dev->prp_pool +
>>> -                          dev->prp_entry_num * sizeof(u64));
>>> +       nvme_flush_dcache_range(dev->prp_pool, dev->prp_entry_num * sizeof(u64));
> 
> As mentioned above, there should be no reason to check or round addresses for flush().

Again, if the start/end address of the flush/invalidate_dcache_range 
operation is unaligned, the operation is not performed, hence the 
alignment is mandatory.

> It gets a bit more tricky if buffers overlap with buffers that are also
> invalidated at some point, but that does not seem to be the case here.

The buffers are correctly padded already, hence there is no problem with 
overlap, and hence the rounding is exactly what has to be done.

>>>
>>>          return 0;
>>>   }
>>> @@ -144,10 +164,8 @@ static __le16 nvme_get_cmd_id(void)
>>>
>>>   static u16 nvme_read_completion_status(struct nvme_queue *nvmeq, u16 index)
>>>   {
>>> -       u64 start = (ulong)&nvmeq->cqes[index];
>>> -       u64 stop = start + sizeof(struct nvme_completion);
>>> -
>>> -       invalidate_dcache_range(start, stop);
> 
> IIUC, this wants to invalidate a single CQ entry, which is 16 bytes
> long. So this will also invalidate the neighboring entry, at least
> (with 64 byte cache line size even three others).
> Does the hardware require the queue entries to be consecutive?
> If not, we could blow up struct nvme_completion to have the size of a
> cache line.
> If yes, this gets a bit more tricky.

The CQ entry is correctly padded, so this concern does not apply.
Note that this is also mentioned in the commit message.

> One generic solution would be to use "coherent" memory, which is
> basically mapped as uncached. And then put the CQ array in there. I
> think we already have some support in U-Boot for that?

Yes, some drivers seems to be misusing coherent memory where it makes 
zero sense to use it. Here it makes zero sense, since the CQ entries are 
cacheline aligned and all that needs to be done is correctly apply the 
cache operations, which is done by this patch.

>>> +       nvme_invalidate_dcache_range(&nvmeq->cqes[index],
>>> +                                    sizeof(struct nvme_completion));
>>>
>>>          return le16_to_cpu(readw(&(nvmeq->cqes[index].status)));
>>>   }
>>> @@ -163,8 +181,7 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
>>>          u16 tail = nvmeq->sq_tail;
>>>
>>>          memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
>>> -       flush_dcache_range((ulong)&nvmeq->sq_cmds[tail],
>>> -                          (ulong)&nvmeq->sq_cmds[tail] + sizeof(*cmd));
>>> +       nvme_flush_dcache_range(&nvmeq->sq_cmds[tail], sizeof(*cmd));
>>>
>>>          if (++tail == nvmeq->q_depth)
>>>                  tail = 0;
>>> @@ -338,8 +355,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
>>>          nvmeq->cq_phase = 1;
>>>          nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
>>>          memset((void *)nvmeq->cqes, 0, NVME_CQ_SIZE(nvmeq->q_depth));
>>> -       flush_dcache_range((ulong)nvmeq->cqes,
>>> -                          (ulong)nvmeq->cqes + NVME_CQ_SIZE(nvmeq->q_depth));
>>> +       nvme_flush_dcache_range(nvmeq->cqes, NVME_CQ_SIZE(nvmeq->q_depth));
> 
> This will flush multiple CQ entries, same as above. If a neighboring
> entry is waiting for the status bit to flip (as above), this might be
> harmful, as it might be overwritten.
> If we can't blow up a single CQ entry, uncached memory might be better?

No, see above.

[...]