[U-Boot] [EXT] Re: Cavium/Marvell Octeon Support

Sun Oct 27 02:34:06 UTC 2019

Hi Daniel,

On Friday, October 25, 2019 8:13:57 AM PDT Daniel Schwierzeck wrote:
> External Email
> 
> ----------------------------------------------------------------------
> Hi Aaron,
> 
> Am 23.10.19 um 05:50 schrieb Aaron Williams:
> > Hi all,
> > 
> > I have been tasked with porting our Octeon U-Boot to the latest U-Boot
> > and merging it upstream. This will involve a very significant amount of
> > code that generally will not be compatible with other MIPS processors
> > due to our needs and requirements. For example, the start.S will need to
> > be completely different than what is present. For example, our existing
> > start.S is 3577 lines of code in order to deal with things like RAS,
> > exceptions, virtual memory and more. We need to use virtual memory since
> > U-Boot can be loaded at any 4MB boundary in memory, not just 0xbfc00000.
> > A number of drivers will need to be updated in order to properly map
> > pointers to physical addresses. This is needed anyway, since I see
> > numerous drivers that assume that a pointer is a DMA address. For MIPS
> > this is never the case (I'm looking at XHCI).
> 
> Good to see some progress in mainline Octeon support. Could you briefly
> describe the differences and commonalities in booting an Octeon CPU
> compared to other "generic" MIPS cores? Or could you point me to a
> public Git tree? It can't be that different because Linux kernel is also
> able to share most of the code ;)
> 

Actually the low level code is significantly different. First of all, we need 
the U-Boot bootloader to be able to boot from different memory locations. 
Because of this, we use mapped memory for U-Boot. A side effect of this is 
that it eliminates the need for relocation when it is shifted to the top of 
memory. All we need to do is just set a couple of TLB entries.

The assembly code is significantly different and is far more extensive.

Additionally, the way Octeon Linux is booted is different.

The generic start.S is not usable in our case.

We have a significant amount of code for dealing with the cache and for things 
like copying U-Boot from flash into the L2 cache. We also have to deal with 
taking other cores out of reset in our start.S. Our exception handler has also 
been extended to handle multiple cores.

Some other things we have included are a native API that allows Simple 
Executive applications to make calls into U-Boot for such things as 
environment variable access as well as access to block devices and 
filesystems.

We used to have our Octeon SDK available for download but it seems this has 
been taken down :( I'm trying to find out how I can make it available but I'm 
getting pushback in sharing our GPLed U-Boot even though it is GPL.

> In principle you could compile an own start.S in your mach-octeon
> directory, but you should try to use the generic start.S which is
> already customisable and extensible. If needed, we could add more
> extension points to it. Booting from any custom memory address is
> already supported and very common for other MIPS based SoC's. Exception
> support is also already there.
> 

The bootloader needs to be able to start from multiple memory locations 
without recompiling. Our existing bootloader can run from any 4MB boundary 
without recompiling or relocation. It can start out of flash (from any sector 
boundary, not just 0) or L2 cache. Starting by L2 cache is supported by eMMC, 
SPI and PCI target bootloaders. Additionally the same bootloader can be 
started from RAM such as when the failsafe bootloader starts the main 
bootloader. In most cases, the failsafe is the same full-featured bootloader 
since it fits entirely within the L2 cache. Our only bootloader requirement is 
that it fits in the L2 cache (except when booting from Flash, though this is 
preferred for speed) and that it remain under 4 MiB in size.

I believe our exception handling is more extensive than the standard U-Boot 
exception handler. It includes the stack output as well as numerous COP0 
registers and decoding the cause of the exception. The exception handler is 
also independent of a working C environment. We also need to handle exceptions 
occurring on multiple cores as they're brought out of reset and not all cases 
are exceptions. Cores are first powered on and kept in a halted state, then 
later when we start the Linux kernel or simple executive applications, the 
exception handler is updated (via a bootbus moveable memory region)  and an 
NMI is generated for the cores where they will begin executing code out of 
start.S before moving to the code that sets up the environment for booting 
Linux and/or simple executive applications. In the latter case, TLB entries 
are programmed in for each core.

> > The new Octeon U-Boot will be native 64-bit instead of how the earlier
> > one was 32-bit using the N32 ABI (so 64-bit addresses could be
> > accessed). We had to jump through some hoops to make a 32-bit U-Boot
> > fully support 64-bit hardware.
> 
> We have 64 bit support for MIPS. I even sync'ed the asm/io stuff from
> Linux in the past (which includes support for Octeon) so that you would
> be able to use the standard IO primitives and ioremap stuff and hook in
> your platform-specifc memory mappings.
> 
That is good to know. What I have run into is the fact that many drivers do 
not support I/O remapping. I.e. XHCI assumes that a pointer is a DMA address. 
Also, does the 64-bit support handle multiple cores in U-Boot?

I agree about using the standard ioremap stuff. I'm only pointing out that 
there are places where it is missing in the common U-Boot code. Where it is 
present, there won't be any issues since traditionally I used those methods to 
call our platform specific remapping. I will look to see what is present and 
if it will work or not.

> > I think we can shrink the code by removing support for starting "simple
> > executive" tasks. Simple executive tasks are bare metal applications
> > that can run on dedicated cores beside Linux (or without Linux). I will
> > also not be porting any support for anything older than Octeon3.
> > 
> > We also make heavy use of our SDK in order to perform hardware
> > initialization and networking. In our old U-Boot, we have almost 900K
> > lines of code. I can cut out much of this but much will remain.
> > 
> > We also have added extensive infrastructure for handling SFP and QSFP
> > cables as well as very extensive phy support for phys from
> > Aquantia/Marvell, Vitesse/Microsemi, Inphi/Cortina and an Avago gearbox.
> > Our customer wants us to port all of this to the new U-Boot and upstream
> > it. I'm worried about the sheer amount of code since it is absolutely
> > massive.
> 
> Maybe you should cut down your customers expectations a bit. According
> to sloccount we currently have 1.6M SLOC for the whole U-Boot. I guess
> Tom or Wolfgang wouldn't agree with adding another 900k only for one
> CPU. Actually what should be upstream is the basic CPU, driver and board
> support to be able to boot a mainline kernel. Everything else like
> custom bare metal applications or the SFP/PHY handling stuff mentioned
> below could also be maintained in a downstream tree. Maybe Wolfgang is
> willing to host one on gitlab.denx.de.
> 

I will try and cut it down. Much of the code is register definitions. The 
register definition files are auto-generated and tend to be huge. They're 
fully commented and include both big and little endian bitfields. In this case 
I can do like I did for OcteonTX and modify the scripts that generate these 
headers to strip out the little-endian and comments. There is a huge amount of 
code for configuring our QLM hardware interfaces. We also have a lot of code 
for SFP/QSFP ports. 

There are some other huge files that can also be eliminated by dropping 
support for Octeon II and earlier. The error handling files are massive for 
those chips.

Much of the rest can be shrunk somewhat, but a lot of that code is still 
required.

There is a huge amount of code for dealing with our quad-lane modules (QLMs). 
The QLMs can be configured to run in a variety of modes, from PCIe, SGMII, 
SATA, XLAUI, XFI, Interlaken, SVRIO, QSGMII, XAUI, RXAUI and more. There is a 
lot of tuning and configuration code needed in order to handle different 
clocks, equalization, gain, AGC and a whole host of other serdes issues.

The MAC code is also quite large and complex since there are many coprocessors 
that must be configured. These chips are designed as network processors. While 
it makes their networking quite powerful and fast, it also means that a lot of 
programming is needed before they will work. There are input parser engines, 
buffer management engines, queueing engines, output engines and more that must 
be fully configured before any packets can be sent or received.

There is a fair bit of code used to bring additional cores out of reset. In 
our biggest configuration, there can be two Octeon CN78XX chips connected in 
tandem where each chip has 48 cores. In this case there is a lot of tuning 
that needs to happen with the lanes connecting the two chips before this 
configuration works reliably. There is a tuning process that is required to 
run on both sides (and the second chip runs a small binary image as well to 
perform its half of the tuning).

I do not know if this will change or not but the way the Linux kernel is 
booted on Octeon is not compatible with the standard boot commands. Part of 
this is due to the fact that Linux can be run in parallel with Simple 
Executive applications. It's even possible to run two copies of Linux 
simultaneously on different cores. To go along with this, there is also a 
mechanism with named memory blocks that is used. When bring cores out of reset  
for SE applications, the TLB entries need to be configured. There also is a 
fair bit of code dealing with core masks when choosing which cores are used 
for what.

We also have a named memory block feature which is used by Linux and simple 
executive applications where blocks of memory can be carved up. U-Boot needs 
to tie into this.

There are also a numerous other I/O interfaces that we also need to 
initialize. Unfortunately we also have some erratas we need to work around as 
well and a few are non-trivial.

The DRAM initialization code is also massive.  It handles DDR3 and DDR4 for 
both registered and unregistered memory with ECC.

In many cases, the reason for the size of the code is due to the complexity of 
the SoC and the platforms built around it. You can think of CN78XX as being 
more like an enterprise-class server than a simple embedded device. The CN73XX 
is not too far behind the CN78XX. The only reason our Octeon TX2 U-Boot is so 
much smaller is that most of the early initialization takes place before U-
Boot is started and the fact that a lot of the networking support (such as SFP 
management and PHY support) is handled by ATF as well as on-chip managment 
cores. This is necessary because Linux does not have any SFP management 
support nor can it handle the complex typologies we're frequently running into 
today.  The requirements of Redhat also preclude any additional software being 
installed in order for the networking support to run.

One thing I may need to re-introduce to U-Boot is the temperature sensor 
support for devices like this, since thermal monitoring is important.

Some boards require a background task to perform periodic monitoring for 
certain events, including the board that needs to be upstreamed. I haven't 
checked if anything is available now, but what I did in the past was hook into 
the input function and while waiting for input it calls a user-defined polling 
function.

If interrupts are supported it makes the polling job easier.
> > Some of these phy drivers are extremely complex and need to tie
> > into the SFP management. We also need to use a background polling thread
> > while at the command prompt. A fair bit of our phy code is not in the
> > normal phy drivers because it did not fit the model. Some of these phy
> > drivers need to interact with the SFP support code in order to handle
> > hot plug events in order to reconfigure themselves based on the cable
> > type. The existing SFP code handles everything from SFP to SFP28 as well
> > as QSFP and 100G QSFP (never tested).
> > 
> > In the old U-Boot the PHY support had to be significantly enhanced due
> > to requirements for hot-plugging and how some of the PHYs are
> > configured. It gets quite complicated with phys like the Inphi where one
> > phy can handle either four ports (XFI/SGMII) or a single 4-lane port
> > (XLAUI). It gets even worse since in some boards we use reclocking chips
> > and there is one chip that handles the receive path of a QSFP and
> > another that handles the transmit path. Further complicating things,
> > with a QSFP it can be treated either as XLAUI or as four XFI ports, so
> > you can have four ports spread across two chips, with each port using
> > different slices of each chip. In the case of the Inphi/Cortina chip, a
> > single device can handle one or four ports based on the configuration
> > and it is configured by "slice" which is basically an offset into the
> > MDIO register space. We had to jump through hoops in order to have this
> > stuff work in a sane way in the device tree. We added entries for SFP
> > and QSFP slots in the device tree which point to the MACs, GPIOs and I2C
> > bus because pointing them to the phys just got too insane. This will
> > need to be ported to the new U-Boot. It should not break the existing
> > support since most of it was implemented outside of the core PHY
> > handling code. In the port, it would be far better if this could be
> > integrated in. The SFP management code is architecture agnostic as is
> > all of the PHY support. The callbacks for the SFP support are used by
> > the MAC which then notifies the PHY since the MAC often needs to
> > reconfigure itself. It can handle some crazy configurations.
> > 
> > While I see some phy drivers that we also support, i.e. Cortina, our
> > drivers tend to have a lot more functionality. For example, all of our
> > phy drivers that support firmware support commands for upgrading the
> > firmware as well as things like cable testing and other features.
> 
> PHY drivers and ethernet drivers should be really reduced to the
> required functionality to enable basic networking like Ping, DHCP, TFTP.
> U-Boot is still "just" a bootloader and not a system managemnt tool ;)
> You should do that stuff either in Linux or in a downstream fork.
> 

This is the case for the most part. Unfortunately, many of these drivers 
require a lot of code and some require frequent monitoring to make 
adjustments. The SFP support is required to monitor what cable type is plugged 
in and to reprogram the phy as needed based on the type of cable. The 10G and 
25G phys need different settings for optical/active vs passive copper vs SFP 
connectors. In addition, some require different settings based on the cable 
length and in some cases exceptions are needed for certain modules (there are 
a series of Avago SFP to Gigabit modules that require autonegotiation to be 
disabled in 1000Base-X mode). In at least one case there needs to be frequent 
polling to make adjustments (25G) as the equalization settings can change 
based on temperature. The SFP management code identifies the type of cable 
connected and its parameters so that the phy driver can adjust the appropriate 
settings. The SFP management code is generic and not tied to any one type of 
phy or MAC or brand of module. It also monitors all of the GPIO pins and will 
make callbacks when needed. Many phys lack the support for doing this 
themselves. Phys I have worked with that need this support include Cortina/
Inphi and several Microsemi/Vitesse devices.

The Inphy devices will typically handle four XFI lanes with four bi-
directional slices with each slice given a different register range. Further 
complicating matters is that a QSFP port can either be four XFI interfaces or 
a single XLAUI interface. We have code to update the firmware for the Inphi 
chips, but this is small compared to the rest of the initialization code. 
These chips require that equalization and gain be configured on each slice 
based on the board and cable characteristics as well as LED configuration.

With the Microsemi reclocking chips, each chip has four unidirectional lanes. 
For a QSFP port, two chips are required with one chip configured for ingress 
and the other for egress. This can support either XLAUI or four XFI 
interfaces. When it is configured for XFI there are four XFI interfaces, since 
now four MACs are shared with two chips with each MAC going to one lane on 
each chip.

Also making things fun is that Inphi and the reclocking chips do not conform 
to the clause 45 standard at all. In the case of Inphi, the ID registers are 
0.0 and 0.1 instead of 1.2 and 1.3 as they are in Clause 45.

The MAC drivers are also non-trivial. The Octeon chips are designed as network 
processors with a lot of hardware offloading and coprocessors. Bringing up a 
"simple Ethernet" interface is anything but simple. There are numerous offload 
engines that must be configured before it will work. While we do have one 
"simple" interface that can be configured, it often isn't because it's usually 
only good for a management port and many boards do not have this and the 
customers desire to be able to use any port.

Just configuring the interface between the MAC and PHY is also non-trivial. 
The Octeon (and later CPUs) have what are called "QLMs" or quad lane modules. 
These QLMs contain programmable serdes which can be configured for PCIe, SATA, 
XFI, XAUI, RXAUI, SGMII, 1000Base-X, XLAUI and a whole host of other interface 
types with a lot of tuning for things like equalization and clocks. The amount 
of QLM initialization code is quite large but necessary. There are a lot of 
clock and analog tuning parameters and sequences that must be run.

Sadly all of this is needed just for basic ping and DHCP. This isn't like a 
simple e1000 NIC or the NICs common with most SoCs.

Think of scaling from a Raspberry Pi to a dual-CPU XEON enterprise-class 
server with 96 cores and 256GiB of RAM with 10, 25 and 40Gbe ports but without 
a BCM or MCU to handle low-level board changes while also having many 
enterprise-class requirements for RAS, etc. That is why our code is so large 
and complex. There are a lot of hardware engines for offloading a lot of tasks 
since the chips are often used in security appliances. There are engines for 
ZIP compression, hardware regex engines, packet ordering engines, packet 
parsing engines, buffer management engines, RAID engines and a whole host of 
others. Many are not used in U-Boot, but a fair number are required for basic 
packet I/O.

For example, one of the boxes contains a CN78XX with 8 10G ports (where either 
can also be configured in XLAUI using 4 to 1 using a QSFP to SFP+ splitter 
cable. It has 128GiB of registered DDR4 DIMMS, 4 SATA drives, redundant power 
supplies and a whole host of other things including multiple temperature 
monitors. This uses an Inphi/Cortina phy chip that requires full SFP 
management support. With Inphi phys, the phy cannot drive LEDs based on 
traffic since it has no concept of packets, especially in XLAUI mode since 
each lane is independent of the others.

Another board, one I specifically have been told to upstream is a NIC that 
contains a CN73XX and two 10G/25G ports that go through a complex gearbox 
chip. Since there is no hardware support for LEDs in the Octeon SoC to 
indicate link and packet I/O this must be done in software (including U-Boot, 
customer requirement) and SFP port management is also a must. The phy is not 
at all a traditional phy. It uses i2c instead of MDIO and requires frequent 
monitoring of the link parameters (it's an older custom gearbox chip, there 
are newer and better chips that don't require this now). I have a hook while 
U-Boot is sitting at the prompt which allows for background tasks to operate 
while it's sitting.

I have several other NICs to support that use a Microsemi reclocking chip that 
has four unidirectional lanes per chip. The chip has zero intelligence and is 
shared between ports (and on some devices, multiple chips are shared between 
ports). Everything must be tuned based on the SFP/QSFP module type and cable 
length. LEDs also must be software driven. (The software driving of LEDs is 
eliminated in OcteonTX2). These chips have no way to drive the LEDs themselves 
to indicate packet I/O or link status.

There are also other boards that use the Microsemi reclocking chips. They were 
chosen in part due to the power budget and these chips are very low power (and 
inexpensive).

In all of these phy cases, all of the parameters are maintained in the device 
tree so the drivers are generic. Unfortunately these drivers also require SFP 
and QSFP management support.

I figure if there are several boards I need to upstream, it's not much more 
effort to port all of the boards to the new U-Boot. I've worked hard to 
minimize the board-specific code and make as much of it generic and based on 
the device tree as possible.

Someday I would love for SFP/QSFP infrastructure to get into Linux. Some NIC 
cards do it in their drivers, but I'd like to see generic infrastructure (like 
my U-Boot support). This might make it harder for some drivers to only support 
certain brands of modules too :) The generic code I wrote works with most 
modules except Intel (because they have bad checksums, but counterfeit Intel 
modules work fine!). It still can be expanded at some point since there is no 
support for module diagnostics other than identifying if it is present. Pretty 
much all it does is monitor the GPIO pins and parse and decode the EEPROM. The 
SFP code is generic enough such that any phy driver that needs it can easily 
hook into it.

> > Our bootloader needs to be able to be booted from a variety of sources,
> > including SPI, eMMC, NOR flash and booting over the PCI bus from a host
> > system. This is one reason we use virtual memory. The other reason is
> > that it eliminates the need to perform relocation. Our start.S code
> > handles all of these different cases as well as exception handling.
> 
> This is already supported for MIPS. You should try to use the generic
> SPL framework for that. Whether you like the relocation or not, it's one
> of the basic design principles of U-Boot. I guess it likely won't be
> accepted if you circumvent this. In fact by now we're sharing the same
> technology as Linux to have relocatable binaries without using gcc's
> -fPIC or -mabicalls to reduce the binary footprint. You can configure
> gd->ram_top to any address of your liking as reference address for the
> relocation.
> 

I will look into this. One other complication is the fact that we require both 
a failsafe as well as a default bootloader. With the older U-Boot we got 
around all of this by just using TLB entries to map U-Boot to always run in 
the same virtual address regardless of the physical address. It eliminated any 
need for -fPIC and helped keep the binary small. For our older bootloader, it 
always executes at 0xC0000000 regardless of where it sits in physical memory. 
Using virtual memory also helps keep U-Boot simple and small.

> > I will also say up front that the memory initialization code is a mess
> > and quite large (it was written by a hardware engineer who never heard
> > of functions).
> > 
> > One thing is that this will break mips unless it is refactored like ARM
> > is, for example, separating armv7 and armv8. This way we could have
> > arch/mips/cpu/octeon. I did this with the old bootloader to separate our
> > stuff. I'm open to suggestions as for the naming. I don't see how we can
> > share much of the code with the other MIPS CPUs.
> 
> We have the same mach directory handling as in Linux MIPS. So you could
> easily add all your platform specific code (except drivers) to
> arch/mips/mach-octeon or (-cavium). Inside that directory you can have
> an include directory for you cusom header files, you can even override
> the generic files from arch/mips/include like in Linux. arch/mips/cpu
> and arch/mips/lib should only contain generic code. As already mentioned
> you could provide an own start.S inside arch/mips/mach-octeon but if
> possible you should try to reuse or extend the generic variant.
> 

We can't use the existing start.S. We have a lot of requirements that are not 
supported there as well as a fair bit of code dedicated to dealing with the 
cache and TLBs and bringing additional cores out of reset. We make use of a 
boot bus movable region in order to do this and handle other cases like NMIs 
and the watchdog. Our start.S currently sits at around 3800 lines of code. 
Some is common but most is not.

Our start.S is designed to be able to boot both a failsafe and non-failsafe 
image and supports adjusting the flash mapping in order to start from an 
offset other than zero in the flash. There is also a fair bit of code for 
copying the image out of flash into the L2 cache for a significant speedup for 
DRAM initialization. I'm trying to get permission to share our existing code 
but I'm getting push-back (even though it's GPL!?!). How they want me to 
upstream it without sharing the code is beyond me.

While U-Boot has an exception handler, I believe ours is more comprehensive. 
It is written entirely in assembler and is not dependent on a working C 
runtime environment. It also dumps more information than just the registers 
such as the stack and a number of other exception registers and does some 
exception decoding. It's quite a bit better than the ARMv8 exception handler 
IMHO.

Putting this under mach-octeon will make it much easier. I'll try and re-use 
where I can.

> > All in all, I think the final port will add between 500K-1M lines of
> > code for the Octeon CPU. It is much more extensive than what is required
> > for OcteonTX since in the latter case most of the hardware
> > initialization is done by earlier stage bootloaders and the ATF handles
> > things like SFP port management and many of the networking operations.
> > 
> > I'm not sure how well I'll be able to upstream all of this code at this
> > point since I was just handed this task. We already have at least 1M
> > lines of code added to the old U-Boot which is based off of 2013.08 with
> > a lot of backports.

I'm trying to get  our existing code made available someplace online. I'm 
getting pushback even though U-Boot is GPL and the license on our SDK is BSD-
like (i.e. do whatever you want but don't hold us responsible). It looks like 
it used to be available but was taken down. I don't undertstand lawyers. All 
of the code I wrote is GPL. There is some U-Boot specific code in our SDK, but 
none was copied from U-Boot. There also is some duplication of functionality 
between U-Boot and our SDK that I'll try and eliminate.

I have implemented just about every feature in U-Boot I could with our Octeon 
SoC. That's another reason it's so large. Some customer always comes back and 
says they want feature X to work. Fortunately, the changes to the U-Boot 
supplied code are generally minimal, despite it being so large.

I likely will need to add some more hooks to board_f.c and board_r.c. I have 
run into many cases where we need a specific order of initialization that does 
not match the normal U-Boot order. Perhaps make init_sequence_f and 
init_sequence_r weak so that they can be overridden if needed by a specific 
board or architecture. While much of the current init order works,  we need 
some things initialized as quickly as possible and others initialized later. 
For example, the first thing we call is an early_errate_workaround function in 
the init sequence before anything else is called. 

Regards,

-Aaron