[PATCH] nvme: Fix cache alignment

Tue Feb 2 17:23:31 CET 2021

On Tue, 2 Feb 2021 11:55:50 +0800
Bin Meng <bmeng.cn at gmail.com> wrote:

Hi,

had a look at the code, those are my findings:

> On Sun, Jan 31, 2021 at 1:53 AM Marek Vasut <marek.vasut at gmail.com> wrote:
> >
> > The various structures in the driver are already correcty padded and  
> 
> typo: correctly
> 
> > cache aligned in memory, however the cache operations are called on
> > the structure sizes, which themselves might not be cache aligned. Add
> > the necessary rounding to fix this, which permits the nvme to work on
> > arm64.  
> 
> +ARM guys
> 
> Which ARM64 SoC did you test this with?
> 
> The round down in this patch should be unnecessary. But it's better to
> figure out which call to dcache_xxx() with an unaligned end address.
> 
> >
> > Signed-off-by: Marek Vasut <marek.vasut+renesas at gmail.com>
> > Cc: Bin Meng <bmeng.cn at gmail.com>
> > ---
> >  drivers/nvme/nvme.c | 50 +++++++++++++++++++++++++++++----------------
> >  1 file changed, 32 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/nvme/nvme.c b/drivers/nvme/nvme.c
> > index 5d6331ad34..758415a53b 100644
> > --- a/drivers/nvme/nvme.c
> > +++ b/drivers/nvme/nvme.c
> > @@ -53,6 +53,27 @@ struct nvme_queue {
> >         unsigned long cmdid_data[];
> >  };
> >
> > +static void nvme_align_dcache_range(void *start, unsigned long size,
> > +                                   unsigned long *s, unsigned long *e)
> > +{
> > +       *s = rounddown((uintptr_t)start, ARCH_DMA_MINALIGN);
> > +       *e = roundup((uintptr_t)start + size, ARCH_DMA_MINALIGN);
> > +}

As mentioned in the other email, just rounding the value that we are
going to pass to the cache maintenance operation does not make sense,
so this function should go.

> > +
> > +static void nvme_flush_dcache_range(void *start, unsigned long size)
> > +{
> > +       unsigned long s, e;
> > +       nvme_align_dcache_range(start, size, &s, &e);
> > +       flush_dcache_range(s, e);

There is no good reason for alignment restrictions when it comes to
clean (& invalidate), so there is no need for this wrapper.

> > +}
> > +
> > +static void nvme_invalidate_dcache_range(void *start, unsigned long size)
> > +{
> > +       unsigned long s, e;
> > +       nvme_align_dcache_range(start, size, &s, &e);
> > +       invalidate_dcache_range(s, e);

invalidate is tricky, we could use a wrapper similar to
invalidate_dcache_check() in bcmgenet.c. However I am not sure this is
very useful here, because most requests are already for a whole buffer
size (page or block).

Alignment of the buffer address will be checked by the implementation
of invalidate_dcache_range().

> > +}
> > +
> >  static int nvme_wait_ready(struct nvme_dev *dev, bool enabled)
> >  {
> >         u32 bit = enabled ? NVME_CSTS_RDY : 0;
> > @@ -129,8 +150,7 @@ static int nvme_setup_prps(struct nvme_dev *dev, u64 *prp2,
> >         }
> >         *prp2 = (ulong)dev->prp_pool;
> >
> > -       flush_dcache_range((ulong)dev->prp_pool, (ulong)dev->prp_pool +
> > -                          dev->prp_entry_num * sizeof(u64));
> > +       nvme_flush_dcache_range(dev->prp_pool, dev->prp_entry_num * sizeof(u64));

As mentioned above, there should be no reason to check or round addresses for flush().
It gets a bit more tricky if buffers overlap with buffers that are also
invalidated at some point, but that does not seem to be the case here.

> >
> >         return 0;
> >  }
> > @@ -144,10 +164,8 @@ static __le16 nvme_get_cmd_id(void)
> >
> >  static u16 nvme_read_completion_status(struct nvme_queue *nvmeq, u16 index)
> >  {
> > -       u64 start = (ulong)&nvmeq->cqes[index];
> > -       u64 stop = start + sizeof(struct nvme_completion);
> > -
> > -       invalidate_dcache_range(start, stop);

IIUC, this wants to invalidate a single CQ entry, which is 16 bytes
long. So this will also invalidate the neighboring entry, at least
(with 64 byte cache line size even three others).
Does the hardware require the queue entries to be consecutive?
If not, we could blow up struct nvme_completion to have the size of a
cache line.
If yes, this gets a bit more tricky.

One generic solution would be to use "coherent" memory, which is
basically mapped as uncached. And then put the CQ array in there. I
think we already have some support in U-Boot for that?

> > +       nvme_invalidate_dcache_range(&nvmeq->cqes[index],
> > +                                    sizeof(struct nvme_completion));
> >
> >         return le16_to_cpu(readw(&(nvmeq->cqes[index].status)));
> >  }
> > @@ -163,8 +181,7 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd)
> >         u16 tail = nvmeq->sq_tail;
> >
> >         memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd));
> > -       flush_dcache_range((ulong)&nvmeq->sq_cmds[tail],
> > -                          (ulong)&nvmeq->sq_cmds[tail] + sizeof(*cmd));
> > +       nvme_flush_dcache_range(&nvmeq->sq_cmds[tail], sizeof(*cmd));
> >
> >         if (++tail == nvmeq->q_depth)
> >                 tail = 0;
> > @@ -338,8 +355,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
> >         nvmeq->cq_phase = 1;
> >         nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
> >         memset((void *)nvmeq->cqes, 0, NVME_CQ_SIZE(nvmeq->q_depth));
> > -       flush_dcache_range((ulong)nvmeq->cqes,
> > -                          (ulong)nvmeq->cqes + NVME_CQ_SIZE(nvmeq->q_depth));
> > +       nvme_flush_dcache_range(nvmeq->cqes, NVME_CQ_SIZE(nvmeq->q_depth));

This will flush multiple CQ entries, same as above. If a neighboring
entry is waiting for the status bit to flip (as above), this might be
harmful, as it might be overwritten.
If we can't blow up a single CQ entry, uncached memory might be better?

> >         dev->online_queues++;
> >  }
> >
> > @@ -466,13 +482,13 @@ int nvme_identify(struct nvme_dev *dev, unsigned nsid,
> >
> >         c.identify.cns = cpu_to_le32(cns);
> >
> > -       invalidate_dcache_range(dma_addr,
> > -                               dma_addr + sizeof(struct nvme_id_ctrl));

If I counted correctly, then nvme_id_ctrl is 4K in size, so that should
be fine. And dma_addr seems to come from a "page_size" aligned
allocation, so that should be fine as well.

> > +       nvme_invalidate_dcache_range((void *)dma_addr,
> > +                                    sizeof(struct nvme_id_ctrl));
> >
> >         ret = nvme_submit_admin_cmd(dev, &c, NULL);
> >         if (!ret)
> > -               invalidate_dcache_range(dma_addr,
> > -                                       dma_addr + sizeof(struct nvme_id_ctrl));

same as above.

> > +               nvme_invalidate_dcache_range((void *)dma_addr,
> > +                                            sizeof(struct nvme_id_ctrl));
> >
> >         return ret;
> >  }
> > @@ -729,8 +745,7 @@ static ulong nvme_blk_rw(struct udevice *udev, lbaint_t blknr,
> >         u16 lbas = 1 << (dev->max_transfer_shift - ns->lba_shift);
> >         u64 total_lbas = blkcnt;
> >
> > -       flush_dcache_range((unsigned long)buffer,
> > -                          (unsigned long)buffer + total_len);
> > +       nvme_flush_dcache_range(buffer, total_len);
> >
> >         c.rw.opcode = read ? nvme_cmd_read : nvme_cmd_write;
> >         c.rw.flags = 0;
> > @@ -767,8 +782,7 @@ static ulong nvme_blk_rw(struct udevice *udev, lbaint_t blknr,
> >         }
> >
> >         if (read)
> > -               invalidate_dcache_range((unsigned long)buffer,
> > -                                       (unsigned long)buffer + total_len);

total_len is probably fine, that seems to be at least as big as
the block size, which is hopefully bigger than a cache line.
"buffer" comes directly from the UCLASS_BLK layer, does anyone know
where those buffers are allocated?

Cheers,
Andre

> > +               nvme_invalidate_dcache_range(buffer, total_len);
> >
> >         return (total_len - temp_len) >> desc->log2blksz;
> >  }  
> 
> Regards,
> Bin