[PATCH] pci: Do not enable PCIe ASMedia ASM2824 workaround by default

Thu May 5 13:46:07 CEST 2022

On Sun, 1 May 2022, Pali Rohár wrote:

> >  By now I have become somewhat tired arguing and explaining matters over 
> > and over again as things have been moving as slow as molasses in this 
> > area, but one point I want to raise here is while it is indeed the ASM2824 
> > device that seems problematic, it may actually be downstream, so you won't 
> > know it's there until you go through the workaround, as observed with the 
> > root port of the SiFive FU740-C000 SOC (which has a separate workaround in 
> > U-boot, clearly for the same issue; cf. `pcie_sifive_force_gen1').  So it 
> > looks like the erratum is going to show up with some device combinations 
> > in which the device enumerator may not have a way to know an ASM2824 is 
> > there until the workaround applied to an upstream device has let the link 
> > work.
> 
> I see that Linux patch was not not merged yet and there are already
> comments that this issue is probably board or arch specific and maybe
> should be in arch/riscv linux dir:
> https://lore.kernel.org/linux-pci/20220421202711.GA1415244@bhelgaas/

 Maybe, maybe not, and given the lack of cooperation shown by ASMedia I am 
not going to spend any money on their hardware to get this figured out, 
such as <https://www.amazon.com/dp/B07PRN2QCV> previously mentioned (plus 
the necessary M.2 to PCIe slot adapters).

 Main concerns with my change so far appear to be some coding style issues 
which are due to my lack of following of the development of newer internal 
interfaces.  That'll be fixed with the next iteration when I get to it 
(hopefully this coming weekend).  I have a lot of stuff in progress (and a 
job and personal life too) and need to prioritise.

> >From that comment I have feeling that the issue could be really specific
> to board or combination of connected devices (ASM2824+PI7C9X2G304) as
> there is really no other report about this issue.
> 
> In any case it is weird.

 As I say, from the findings so far this seems to me like a protocol 
rather than signalling (electrical) issue, i.e. a design problem with 
either or both chips.  A board design fault (such as improper decoupling, 
crosstalk, or whatever) could only result in a signalling issue.  And if 
it was a signalling issue, then the 5GT/s rate couldn't be negotiated, 
whether automatically or by hand as with this workaround, and then work 
reliably over days to weeks.

> >  And as I previously already mentioned the Linux version of the workaround 
> > is only activated by the vendor:device ID because you cannot busy-loop 
> > polling on the Link Training bit in Linux (while you can do it in U-Boot, 
> > because U-Boot is not an OS).
> 
> Is is not _only_ because of this. For a longer time there is a direction

 It is exactly and only for this.  I think I know why I wrote it this way.

> to specify exact list of _affected_ hw which needs workaround. And not
> usage of wildcard which match all hardware, even unaffected. See for
> example comments which are adding workaround for broken GIC HW:
> https://lore.kernel.org/lkml/87ilsutb6w.wl-maz@kernel.org/
> 
> And this makes sense, workarounds should be targeted.

 Your observation only makes sense for a problem contained within a device 
itself and not for a problem with establishing a link between two devices.

 If the problem is caused by a downstream device, then you cannot match on 
it to activate a workaround because you won't know it's there until you 
have activated the workaround.  It hasn't been determined here whether the 
problem is the upstream device, the downstream device or a combination of 
the two.

 And I don't think it is appropriate to put the burden onto the users to 
determine what the case is, by deliberately releasing a limited fix in a 
hope that something will break for someone and they will report it.  It is 
an engineer's responsibility to ensure the quality of the solution made is 
of the highest quality feasible.  If we knew what the exact actual cause 
is, then we could keep the workaround limited to the cases identified by 
the cause.

 Also FWIW the next iteration of my Linux change will no longer match the 
quirk on a device ID and will instead use the PCI bridge class ID with a 
further qualification within.  I had a concern about the safety of the 
change, but I have got it mentally resolved now and only the potentially 
unsafe part will be further qualified by a device ID.

 Sadly I still need to limit the workaround to data link layer link active 
reporting capable devices, which will undoubtedly exclude some up to 5GT/s 
only hardware, due to the need to avoid busy-looping on the Link Training 
bit, which is a no-no in an OS.  So the U-Boot code will remain the only 
workaround for some configurations.

  Maciej