[U-Boot] [EXT] Re: Cavium/Marvell Octeon Support

Aaron Williams awilliams at marvell.com
Wed Oct 30 21:23:23 UTC 2019

Hi Daniel,

On Wednesday, October 30, 2019 9:20:31 AM PDT Daniel Schwierzeck wrote:
> Hi Aaron,
> Am 27.10.19 um 03:34 schrieb Aaron Williams:
> > Hi Daniel,
> > 
> > On Friday, October 25, 2019 8:13:57 AM PDT Daniel Schwierzeck wrote:
> >> External Email
> >> 
> >> ----------------------------------------------------------------------
> >> Hi Aaron,
> >> 
> >> Am 23.10.19 um 05:50 schrieb Aaron Williams:
> >>> Hi all,
> >>> 
> >>> I have been tasked with porting our Octeon U-Boot to the latest U-Boot
> >>> and merging it upstream. This will involve a very significant amount of
> >>> code that generally will not be compatible with other MIPS processors
> >>> due to our needs and requirements. For example, the start.S will need to
> >>> be completely different than what is present. For example, our existing
> >>> start.S is 3577 lines of code in order to deal with things like RAS,
> >>> exceptions, virtual memory and more. We need to use virtual memory since
> >>> U-Boot can be loaded at any 4MB boundary in memory, not just 0xbfc00000.
> >>> A number of drivers will need to be updated in order to properly map
> >>> pointers to physical addresses. This is needed anyway, since I see
> >>> numerous drivers that assume that a pointer is a DMA address. For MIPS
> >>> this is never the case (I'm looking at XHCI).
> >> 
> >> Good to see some progress in mainline Octeon support. Could you briefly
> >> describe the differences and commonalities in booting an Octeon CPU
> >> compared to other "generic" MIPS cores? Or could you point me to a
> >> public Git tree? It can't be that different because Linux kernel is also
> >> able to share most of the code ;)
> > 
> > Actually the low level code is significantly different. First of all, we
> > need the U-Boot bootloader to be able to boot from different memory
> > locations. Because of this, we use mapped memory for U-Boot. A side
> > effect of this is that it eliminates the need for relocation when it is
> > shifted to the top of memory. All we need to do is just set a couple of
> > TLB entries.
> Understood. but still U-Boot relocates itself from its initial entry
> memory address to its destination memory address based on gd->ram_top.
> Maybe this is ineffective nowadays with various SPL/TPL boot methods
> because U-Boot proper is already loaded to an executable memory location
> by SPL, but you have to initially deal with that design. Feel free to
> suggest/submit a patch for the generic board init code to make the
> reloaction configurable.
We do this relocation as well, however the way we do it is by changing a 
couple of TLB entries. This lets U-Boot begin execution from any memory 
location, be it flash, L2 cache or RAM. It also lets us statically link U-Boot 
to run at a fixed address, in our case 0xC0000000. The relocation happens 
transparently in the start.S code. This also makes our bootloader smaller. 
None of the U-Boot code is affected since on MIPS pointers cannot be used for 
DMA anyway. The functions that map pointers to DMA addresses work as they 
should. The only issues I have found are drivers that don't use this and would 
break on MIPS anyway. We have a SPL loader for our CN7XXX series since the L2 
cache is too small to otherwise fit the entire bootloader. Even this is a 
challenge to make fit since the code to initialize DDR4 memory is very large 
so every bit of space savings helps.

As far as U-Boot is concerned, we just treat it as if relocation is disabled 
since with virtual memory it isn't needed.  I even got it working with the API 
for running standalone apps without requiring any changes to the existing code 
other than to add the MIPS specific changes for our environment.

This might be something to consider in the future on some platforms where 
"relocation" could be performed by just adjusting the TLB or page tables. MIPS 
makes this particularly easy.

I have attached a copy of our existing start.S code. It needs a bit of work 
for the new U-Boot since currently locking the cache and allocating GD on the 
stack are done in board_init_f(). The changes are fairly easy to make. I also 
need to strip out the code for CN6XXX and earlier.

> > The assembly code is significantly different and is far more extensive.
> > 
> > Additionally, the way Octeon Linux is booted is different.
> > 
> > The generic start.S is not usable in our case.
> > 
> > We have a significant amount of code for dealing with the cache and for
> > things like copying U-Boot from flash into the L2 cache. We also have to
> > deal with taking other cores out of reset in our start.S. Our exception
> > handler has also been extended to handle multiple cores.
> it's hard to discuss this without example code but I still think the
> basic principles of cache and exception handling can't be that different
> from generic MIPS cores. Locking cache lines and loading code to it
> could be useful for other MIPS platforms and should be added as generic
> feature. BTW the exception handler code is a port of the Linux one, I
> only skipped the stack trace output because of the complicated stack
> unwinding code. I think the current dump of general and CP0 and EPC
> registers is more than feasible for a bootloader. It already helped me
> multiple times to quickly locate code locations with e.g. null pointer
> dereferencing.
I have attached our start.S code which includes this. In addition, our version 
also dumps out the stack. NULL pointers aren't the easiest to catch since 
typically 0 is a valid memory location. I suppose I could just add a TLB entry 
to mark the first 4K memory as invalid.

> > Some other things we have included are a native API that allows Simple
> > Executive applications to make calls into U-Boot for such things as
> > environment variable access as well as access to block devices and
> > filesystems.
> This is one of the parts that shouldn't be needed for basic upstream
> support. It your API is a parallel and independent implementation of the
> API that U-Boot already has for standalone applications, than I'm afraid
> this won't be accepted and should be kept in a downstream fork.
That's fine. The code is actually quite small. It has some custom APIs unique 
to our needs. We have need to call into the phy code from these applications. 
I don't know if this could work with the general API or not. One reason we did 
this is because we wanted all addresses passed to U-Boot to be physical 
addresses. We need to context switch since these applications have their own 
memory mapping (hence the requirement for physical addresses). We save the TLB 
mapping of the application and set up the U-Boot TLBs then restore that 
afterwards. For pointers we just use XKPHYS addresses. With the API, though, I 
set it up so that applications are linked at another virtual address which can 
access the U-Boot virtual address directly. I think I used 0xd0000000 for 
those. This didn't require any changes to the API other than the assembly code 
and linker scripts.

> > We used to have our Octeon SDK available for download but it seems this
> > has
> > been taken down :( I'm trying to find out how I can make it available but
> > I'm getting pushback in sharing our GPLed U-Boot even though it is GPL.> 
> >> In principle you could compile an own start.S in your mach-octeon
> >> directory, but you should try to use the generic start.S which is
> >> already customisable and extensible. If needed, we could add more
> >> extension points to it. Booting from any custom memory address is
> >> already supported and very common for other MIPS based SoC's. Exception
> >> support is also already there.
> > 
> > The bootloader needs to be able to start from multiple memory locations
> > without recompiling. Our existing bootloader can run from any 4MB boundary
> > without recompiling or relocation. It can start out of flash (from any
> > sector boundary, not just 0) or L2 cache. Starting by L2 cache is
> > supported by eMMC, SPI and PCI target bootloaders. Additionally the same
> > bootloader can be started from RAM such as when the failsafe bootloader
> > starts the main bootloader. In most cases, the failsafe is the same
> > full-featured bootloader since it fits entirely within the L2 cache. Our
> > only bootloader requirement is that it fits in the L2 cache (except when
> > booting from Flash, though this is preferred for speed) and that it
> > remain under 4 MiB in size.
> > 
> > I believe our exception handling is more extensive than the standard
> > U-Boot
> > exception handler. It includes the stack output as well as numerous COP0
> > registers and decoding the cause of the exception. The exception handler
> > is
> > also independent of a working C environment. We also need to handle
> > exceptions occurring on multiple cores as they're brought out of reset
> > and not all cases are exceptions.
> as I wrote above, the current exception handling is already feasible in
> almost all cases to quickly locate code bugs and doesn't need much code.
> Adding stack trace output would required adding a lot of more code. But
> if you only missing some registers or want to dump the stack itself,
> feel free to extend the current code.

That's fine. The only other thing we do is we carve out a bit of the L1 cache 
for a temporary stack. That way the exception handler has zero dependency on 
memory. Currently it's all in assembly language as well.
> Cores are first powered on and kept in a halted state, then

We do more than that. We need to take the cores out of the halted state and do 
some more processing before starting applications. I hope to provide some 
examples later.

> > later when we start the Linux kernel or simple executive applications, the
> > exception handler is updated (via a bootbus moveable memory region)  and
> > an
> > NMI is generated for the cores where they will begin executing code out of
> > start.S before moving to the code that sets up the environment for booting
> > Linux and/or simple executive applications. In the latter case, TLB
> > entries
> > are programmed in for each core.
> > 
> >>> The new Octeon U-Boot will be native 64-bit instead of how the earlier
> >>> one was 32-bit using the N32 ABI (so 64-bit addresses could be
> >>> accessed). We had to jump through some hoops to make a 32-bit U-Boot
> >>> fully support 64-bit hardware.
> >> 
> >> We have 64 bit support for MIPS. I even sync'ed the asm/io stuff from
> >> Linux in the past (which includes support for Octeon) so that you would
> >> be able to use the standard IO primitives and ioremap stuff and hook in
> >> your platform-specifc memory mappings.
> > 
> > That is good to know. What I have run into is the fact that many drivers
> > do
> > not support I/O remapping. I.e. XHCI assumes that a pointer is a DMA
> > address. Also, does the 64-bit support handle multiple cores in U-Boot?
> we already have stuff like dev_remap_addr(struct udevice* dev) as part
> of the driver model API to map your physical addresses from device tree
> to virtual addresses. This is used in all drivers compatible with MIPS.
> That function is backed by the MIPS specific ioremap_nocache() function
> (also ported from Linux) so that you can hook in platform specific
> mapping code. If you want to use existing drivers which don't do
> remapping yet, you have to patch them. But this should be simple, we
> recently did that on Broadcom or Mediatek platforms, which are sharing
> drivers between their MIPS and ARM CPUs.
That's what we take advantage of :) This allows the drivers to work fine when 
virtual memory is used.

> For XHCI you probably only need to patch the xhci_readl() and
> xhci_writel() functions and establish the memory mappings in your
> platform specific glue code. But USB support shouldn't be your first
> priority ;)
The readl and writel are used for accessing the registers. Those aren't the 
problem. The problem comes when setting up the descriptors in memory. The 
descriptors need to use the memory mapping. That's the part that's missing. 
It's not difficult to fix. I think I also found a few endian issues as well 
since we run in big endian mode.

> > I agree about using the standard ioremap stuff. I'm only pointing out that
> > there are places where it is missing in the common U-Boot code. Where it
> > is
> > present, there won't be any issues since traditionally I used those
> > methods to call our platform specific remapping. I will look to see what
> > is present and if it will work or not.
> yes, those places need some patching anyway. There is already an ongoing
> task to address this:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.denx.de_u-2Dboot
> _custodians_u-2Dboot-2Dmips_issues_15&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=3y
> fMNumMHGMnOfmVc0dViBi3fJfF8ZXRL_aRWSIGwm4&m=knQuIYR9b2vNU-i0lQUe1OVT1ibM48_K
> zERoDPCHSoA&s=V0kRRm5AwodTkHkcaAvQVrfc2vmMQnw5FESKi5KQW08&e=

I think I can help there. I've already spent a fair bit of time on this with 
XHCI which I backported. I still have a major common XHCI issue to fix when 
short packets are received. The U-Boot code does not handle this case 
properly. It's easy to reproduce the case. Use a USB to Ethernet adapter and 
have the receive buffer cross a 64K boundary and bad things will happen.

> >>> I think we can shrink the code by removing support for starting "simple
> >>> executive" tasks. Simple executive tasks are bare metal applications
> >>> that can run on dedicated cores beside Linux (or without Linux). I will
> >>> also not be porting any support for anything older than Octeon3.
> >>> 
> >>> We also make heavy use of our SDK in order to perform hardware
> >>> initialization and networking. In our old U-Boot, we have almost 900K
> >>> lines of code. I can cut out much of this but much will remain.
> >>> 
> >>> We also have added extensive infrastructure for handling SFP and QSFP
> >>> cables as well as very extensive phy support for phys from
> >>> Aquantia/Marvell, Vitesse/Microsemi, Inphi/Cortina and an Avago gearbox.
> >>> Our customer wants us to port all of this to the new U-Boot and upstream
> >>> it. I'm worried about the sheer amount of code since it is absolutely
> >>> massive.
> >> 
> >> Maybe you should cut down your customers expectations a bit. According
> >> to sloccount we currently have 1.6M SLOC for the whole U-Boot. I guess
> >> Tom or Wolfgang wouldn't agree with adding another 900k only for one
> >> CPU. Actually what should be upstream is the basic CPU, driver and board
> >> support to be able to boot a mainline kernel. Everything else like
> >> custom bare metal applications or the SFP/PHY handling stuff mentioned
> >> below could also be maintained in a downstream tree. Maybe Wolfgang is
> >> willing to host one on gitlab.denx.de.
> > 
> > I will try and cut it down. Much of the code is register definitions. The
> > register definition files are auto-generated and tend to be huge. They're
> > fully commented and include both big and little endian bitfields. In this
> > case I can do like I did for OcteonTX and modify the scripts that
> > generate these headers to strip out the little-endian and comments. There
> > is a huge amount of code for configuring our QLM hardware interfaces. We
> > also have a lot of code for SFP/QSFP ports.
> > 
> > There are some other huge files that can also be eliminated by dropping
> > support for Octeon II and earlier. The error handling files are massive
> > for
> > those chips.
> > 
> > Much of the rest can be shrunk somewhat, but a lot of that code is still
> > required.
> > 
> > There is a huge amount of code for dealing with our quad-lane modules
> > (QLMs). The QLMs can be configured to run in a variety of modes, from
> > more. There is a lot of tuning and configuration code needed in order to
> > handle different clocks, equalization, gain, AGC and a whole host of
> > other serdes issues.
> > 
> > The MAC code is also quite large and complex since there are many
> > coprocessors that must be configured. These chips are designed as network
> > processors. While it makes their networking quite powerful and fast, it
> > also means that a lot of programming is needed before they will work.
> > There are input parser engines, buffer management engines, queueing
> > engines, output engines and more that must be fully configured before any
> > packets can be sent or received.
> what I meant was that your customer shouldn't expect to get his custom
> code merged upstream as it is only with some cleanups. Of course an
> user/customer can decide to use U-Boot as system management and hardware
> initialisation tool but that doesn't correspond with U-Boot's design. I
> think most people would agree, that a proper OS like Linux should be
> doing the heavy network initialisation and hardware-offloading stuff as
> well as booting all remaining CPU cores. U-Boot's responsibilty should
> only be to boot that OS in the first CPU ;)
> > There is a fair bit of code used to bring additional cores out of reset.
> > In
> > our biggest configuration, there can be two Octeon CN78XX chips connected
> > in tandem where each chip has 48 cores. In this case there is a lot of
> > tuning that needs to happen with the lanes connecting the two chips
> > before this configuration works reliably. There is a tuning process that
> > is required to run on both sides (and the second chip runs a small binary
> > image as well to perform its half of the tuning).
> > 
> > I do not know if this will change or not but the way the Linux kernel is
> > booted on Octeon is not compatible with the standard boot commands. Part
> > of
> > this is due to the fact that Linux can be run in parallel with Simple
> > Executive applications. It's even possible to run two copies of Linux
> > simultaneously on different cores. To go along with this, there is also a
> > mechanism with named memory blocks that is used. When bring cores out of
> > reset for SE applications, the TLB entries need to be configured. There
> > also is a fair bit of code dealing with core masks when choosing which
> > cores are used for what.
> > 
> > We also have a named memory block feature which is used by Linux and
> > simple
> > executive applications where blocks of memory can be carved up. U-Boot
> > needs to tie into this.
> > 
> > There are also a numerous other I/O interfaces that we also need to
> > initialize. Unfortunately we also have some erratas we need to work around
> > as well and a few are non-trivial.
> > 
> > The DRAM initialization code is also massive.  It handles DDR3 and DDR4
> > for
> > both registered and unregistered memory with ECC.
> > 
> > In many cases, the reason for the size of the code is due to the
> > complexity of the SoC and the platforms built around it. You can think of
> > CN78XX as being more like an enterprise-class server than a simple
> > embedded device. The CN73XX is not too far behind the CN78XX. The only
> > reason our Octeon TX2 U-Boot is so much smaller is that most of the early
> > initialization takes place before U- Boot is started and the fact that a
> > lot of the networking support (such as SFP management and PHY support) is
> > handled by ATF as well as on-chip managment cores. This is necessary
> > because Linux does not have any SFP management support
> last year the PHY framework has been reworked to a phylink framework
> which supports hot-plugging and dynamically linking of PHY drivers with
> MAC drivers especially to support SFP modules. A SFP module driver is
> there as well. There was a talk on ELCE 2018 about this:

I will look at this. The code I wrote can handle some really crazy 
configurations. I may want to modify some of the drivers we have to be 
"virtual MACs" such as Inphi. Also of note that not all phys use MDIO. Two of 
the ones I work with use i2c and there has been talk of using other methods of 
communicating with the phy.

> https://urldefense.proofpoint.com/v2/url?u=https-3A__events19.linuxfoundatio
> n.org_wp-2Dcontent_uploads_2017_12_chevallier-2Dtenart-2Dfrom-2Dthe-2Dethern
> et-2Dmac-2Dto-2Dthe-2Dlink-2Dpartner.pdf&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r
> =3yfMNumMHGMnOfmVc0dViBi3fJfF8ZXRL_aRWSIGwm4&m=knQuIYR9b2vNU-i0lQUe1OVT1ibM4
> 8_KzERoDPCHSoA&s=__bT79VjAensVB_6dAcDvepvNRxCf_TlQVYrRTo8exo&e=
> nor can it handle the complex typologies we're frequently running into
> > today.  The requirements of Redhat also preclude any additional software
> > being installed in order for the networking support to run.
> > 
> > One thing I may need to re-introduce to U-Boot is the temperature sensor
> > support for devices like this, since thermal monitoring is important.
> this should be easy as U-Boot already has a thermal uclass within the
> driver model.
I just noticed that. It looked like for a while it was removed. :)

> > Some boards require a background task to perform periodic monitoring for
> > certain events, including the board that needs to be upstreamed. I haven't
> > checked if anything is available now, but what I did in the past was hook
> > into the input function and while waiting for input it calls a
> > user-defined polling function.
> > 
> > If interrupts are supported it makes the polling job easier.
> > 
> >>> Some of these phy drivers are extremely complex and need to tie
> >>> into the SFP management. We also need to use a background polling thread
> >>> while at the command prompt. A fair bit of our phy code is not in the
> >>> normal phy drivers because it did not fit the model. Some of these phy
> >>> drivers need to interact with the SFP support code in order to handle
> >>> hot plug events in order to reconfigure themselves based on the cable
> >>> type. The existing SFP code handles everything from SFP to SFP28 as well
> >>> as QSFP and 100G QSFP (never tested).
> >>> 
> >>> In the old U-Boot the PHY support had to be significantly enhanced due
> >>> to requirements for hot-plugging and how some of the PHYs are
> >>> configured. It gets quite complicated with phys like the Inphi where one
> >>> phy can handle either four ports (XFI/SGMII) or a single 4-lane port
> >>> (XLAUI). It gets even worse since in some boards we use reclocking chips
> >>> and there is one chip that handles the receive path of a QSFP and
> >>> another that handles the transmit path. Further complicating things,
> >>> with a QSFP it can be treated either as XLAUI or as four XFI ports, so
> >>> you can have four ports spread across two chips, with each port using
> >>> different slices of each chip. In the case of the Inphi/Cortina chip, a
> >>> single device can handle one or four ports based on the configuration
> >>> and it is configured by "slice" which is basically an offset into the
> >>> MDIO register space. We had to jump through hoops in order to have this
> >>> stuff work in a sane way in the device tree. We added entries for SFP
> >>> and QSFP slots in the device tree which point to the MACs, GPIOs and I2C
> >>> bus because pointing them to the phys just got too insane. This will
> >>> need to be ported to the new U-Boot. It should not break the existing
> >>> support since most of it was implemented outside of the core PHY
> >>> handling code. In the port, it would be far better if this could be
> >>> integrated in. The SFP management code is architecture agnostic as is
> >>> all of the PHY support. The callbacks for the SFP support are used by
> >>> the MAC which then notifies the PHY since the MAC often needs to
> >>> reconfigure itself. It can handle some crazy configurations.
> >>> 
> >>> While I see some phy drivers that we also support, i.e. Cortina, our
> >>> drivers tend to have a lot more functionality. For example, all of our
> >>> phy drivers that support firmware support commands for upgrading the
> >>> firmware as well as things like cable testing and other features.
> >> 
> >> PHY drivers and ethernet drivers should be really reduced to the
> >> required functionality to enable basic networking like Ping, DHCP, TFTP.
> >> U-Boot is still "just" a bootloader and not a system managemnt tool ;)
> >> You should do that stuff either in Linux or in a downstream fork.
> > 
> > This is the case for the most part. Unfortunately, many of these drivers
> > require a lot of code and some require frequent monitoring to make
> > adjustments. The SFP support is required to monitor what cable type is
> > plugged in and to reprogram the phy as needed based on the type of cable.
> > The 10G and 25G phys need different settings for optical/active vs
> > passive copper vs SFP connectors. In addition, some require different
> > settings based on the cable length and in some cases exceptions are
> > needed for certain modules (there are a series of Avago SFP to Gigabit
> > modules that require autonegotiation to be disabled in 1000Base-X mode).
> > In at least one case there needs to be frequent polling to make
> > adjustments (25G) as the equalization settings can change based on
> > temperature. The SFP management code identifies the type of cable
> > connected and its parameters so that the phy driver can adjust the
> > appropriate settings. The SFP management code is generic and not tied to
> > any one type of phy or MAC or brand of module. It also monitors all of
> > the GPIO pins and will make callbacks when needed. Many phys lack the
> > support for doing this themselves. Phys I have worked with that need this
> > support include Cortina/ Inphi and several Microsemi/Vitesse devices.
> > 
> > The Inphy devices will typically handle four XFI lanes with four bi-
> > directional slices with each slice given a different register range.
> > Further complicating matters is that a QSFP port can either be four XFI
> > interfaces or a single XLAUI interface. We have code to update the
> > firmware for the Inphi chips, but this is small compared to the rest of
> > the initialization code. These chips require that equalization and gain
> > be configured on each slice based on the board and cable characteristics
> > as well as LED configuration.
> > 
> > With the Microsemi reclocking chips, each chip has four unidirectional
> > lanes. For a QSFP port, two chips are required with one chip configured
> > for ingress and the other for egress. This can support either XLAUI or
> > four XFI interfaces. When it is configured for XFI there are four XFI
> > interfaces, since now four MACs are shared with two chips with each MAC
> > going to one lane on each chip.
> > 
> > Also making things fun is that Inphi and the reclocking chips do not
> > conform to the clause 45 standard at all. In the case of Inphi, the ID
> > registers are 0.0 and 0.1 instead of 1.2 and 1.3 as they are in Clause
> > 45.
> > 
> > The MAC drivers are also non-trivial. The Octeon chips are designed as
> > network processors with a lot of hardware offloading and coprocessors.
> > Bringing up a "simple Ethernet" interface is anything but simple. There
> > are numerous offload engines that must be configured before it will work.
> > While we do have one "simple" interface that can be configured, it often
> > isn't because it's usually only good for a management port and many
> > boards do not have this and the customers desire to be able to use any
> > port.
> > 
> > Just configuring the interface between the MAC and PHY is also
> > non-trivial.
> > The Octeon (and later CPUs) have what are called "QLMs" or quad lane
> > modules. These QLMs contain programmable serdes which can be configured
> > for PCIe, SATA, XFI, XAUI, RXAUI, SGMII, 1000Base-X, XLAUI and a whole
> > host of other interface types with a lot of tuning for things like
> > equalization and clocks. The amount of QLM initialization code is quite
> > large but necessary. There are a lot of clock and analog tuning
> > parameters and sequences that must be run.
> > 
> > Sadly all of this is needed just for basic ping and DHCP. This isn't like
> > a
> > simple e1000 NIC or the NICs common with most SoCs.
> as already stated this heavy networking stuff should be the task of an
> OS. I understand why you chose another way because Linux only recently
> got real support for SFP or more hardware-offloading capabilities but
> maybe you should take the chance and update your system design and
> submit missing functionality to Linux rather than adding a lot of
> networm management stuff to U-Boot.
Unfortunately, without the support in U-Boot, networking just won't work at 
all. The U-Boot drivers do not use any of the heavy lifting features. 
Unfortunately there is still a lot of code that needs to execute just for 

> > Think of scaling from a Raspberry Pi to a dual-CPU XEON enterprise-class
> > server with 96 cores and 256GiB of RAM with 10, 25 and 40Gbe ports but
> > without a BCM or MCU to handle low-level board changes while also having
> > many enterprise-class requirements for RAS, etc. That is why our code is
> > so large and complex. There are a lot of hardware engines for offloading
> > a lot of tasks since the chips are often used in security appliances.
> > There are engines for ZIP compression, hardware regex engines, packet
> > ordering engines, packet parsing engines, buffer management engines, RAID
> > engines and a whole host of others. Many are not used in U-Boot, but a
> > fair number are required for basic packet I/O.
> > 
> > For example, one of the boxes contains a CN78XX with 8 10G ports (where
> > either can also be configured in XLAUI using 4 to 1 using a QSFP to SFP+
> > splitter cable. It has 128GiB of registered DDR4 DIMMS, 4 SATA drives,
> > redundant power supplies and a whole host of other things including
> > multiple temperature monitors. This uses an Inphi/Cortina phy chip that
> > requires full SFP management support. With Inphi phys, the phy cannot
> > drive LEDs based on traffic since it has no concept of packets,
> > especially in XLAUI mode since each lane is independent of the others.
> > 
> > Another board, one I specifically have been told to upstream is a NIC that
> > contains a CN73XX and two 10G/25G ports that go through a complex gearbox
> > chip. Since there is no hardware support for LEDs in the Octeon SoC to
> > indicate link and packet I/O this must be done in software (including
> > U-Boot, customer requirement) and SFP port management is also a must. The
> > phy is not at all a traditional phy. It uses i2c instead of MDIO and
> > requires frequent monitoring of the link parameters (it's an older custom
> > gearbox chip, there are newer and better chips that don't require this
> > now). I have a hook while U-Boot is sitting at the prompt which allows
> > for background tasks to operate while it's sitting.
> > 
> > I have several other NICs to support that use a Microsemi reclocking chip
> > that has four unidirectional lanes per chip. The chip has zero
> > intelligence and is shared between ports (and on some devices, multiple
> > chips are shared between ports). Everything must be tuned based on the
> > SFP/QSFP module type and cable length. LEDs also must be software driven.
> > (The software driving of LEDs is eliminated in OcteonTX2). These chips
> > have no way to drive the LEDs themselves to indicate packet I/O or link
> > status.
> > 
> > There are also other boards that use the Microsemi reclocking chips. They
> > were chosen in part due to the power budget and these chips are very low
> > power (and inexpensive).
> > 
> > In all of these phy cases, all of the parameters are maintained in the
> > device tree so the drivers are generic. Unfortunately these drivers also
> > require SFP and QSFP management support.
> > 
> > I figure if there are several boards I need to upstream, it's not much
> > more
> > effort to port all of the boards to the new U-Boot. I've worked hard to
> > minimize the board-specific code and make as much of it generic and based
> > on the device tree as possible.
> > 
> > Someday I would love for SFP/QSFP infrastructure to get into Linux. Some
> > NIC cards do it in their drivers, but I'd like to see generic
> > infrastructure (like my U-Boot support). This might make it harder for
> > some drivers to only support certain brands of modules too :) The generic
> > code I wrote works with most modules except Intel (because they have bad
> > checksums, but counterfeit Intel modules work fine!). It still can be
> > expanded at some point since there is no support for module diagnostics
> > other than identifying if it is present. Pretty much all it does is
> > monitor the GPIO pins and parse and decode the EEPROM. The SFP code is
> > generic enough such that any phy driver that needs it can easily hook
> > into it.
> as already noted this is already in Linux:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_
> linux_kernel_git_torvalds_linux.git_tree_drivers_net_phy_phylink.c&d=DwICaQ&
> c=nKjWec2b6R0mOyPaz7xtfQ&r=3yfMNumMHGMnOfmVc0dViBi3fJfF8ZXRL_aRWSIGwm4&m=knQ
> uIYR9b2vNU-i0lQUe1OVT1ibM48_KzERoDPCHSoA&s=p672bj1xBj_xHCzdr0pvpPNg4qe_LA0Pc
> R7Sa4J9OQA&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_
> linux_kernel_git_torvalds_linux.git_tree_drivers_net_phy_sfp.c&d=DwICaQ&c=nK
> jWec2b6R0mOyPaz7xtfQ&r=3yfMNumMHGMnOfmVc0dViBi3fJfF8ZXRL_aRWSIGwm4&m=knQuIYR
> 9b2vNU-i0lQUe1OVT1ibM48_KzERoDPCHSoA&s=uCs-21llsi62iM9tfPQIHGyU1qVnoYaQVwVX6
> TZwaO0&e=

Unfortunately, for high speed interfaces (which our customers use in U-Boot 
for tftpboot, a fair bit needs to be implemented just to work. The way the 
code is architect ed there isn't much impact to the existing U-Boot code 
unless it needs to take advantage of it.

> >>> Our bootloader needs to be able to be booted from a variety of sources,
> >>> including SPI, eMMC, NOR flash and booting over the PCI bus from a host
> >>> system. This is one reason we use virtual memory. The other reason is
> >>> that it eliminates the need to perform relocation. Our start.S code
> >>> handles all of these different cases as well as exception handling.
> >> 
> >> This is already supported for MIPS. You should try to use the generic
> >> SPL framework for that. Whether you like the relocation or not, it's one
> >> of the basic design principles of U-Boot. I guess it likely won't be
> >> accepted if you circumvent this. In fact by now we're sharing the same
> >> technology as Linux to have relocatable binaries without using gcc's
> >> -fPIC or -mabicalls to reduce the binary footprint. You can configure
> >> gd->ram_top to any address of your liking as reference address for the
> >> relocation.
> > 
> > I will look into this. One other complication is the fact that we require
> > both a failsafe as well as a default bootloader. With the older U-Boot we
> > got around all of this by just using TLB entries to map U-Boot to always
> > run in the same virtual address regardless of the physical address. It
> > eliminated any need for -fPIC and helped keep the binary small. For our
> > older bootloader, it always executes at 0xC0000000 regardless of where it
> > sits in physical memory. Using virtual memory also helps keep U-Boot
> > simple and small.
> > 
> >>> I will also say up front that the memory initialization code is a mess
> >>> and quite large (it was written by a hardware engineer who never heard
> >>> of functions).
> >>> 
> >>> One thing is that this will break mips unless it is refactored like ARM
> >>> is, for example, separating armv7 and armv8. This way we could have
> >>> arch/mips/cpu/octeon. I did this with the old bootloader to separate our
> >>> stuff. I'm open to suggestions as for the naming. I don't see how we can
> >>> share much of the code with the other MIPS CPUs.
> >> 
> >> We have the same mach directory handling as in Linux MIPS. So you could
> >> easily add all your platform specific code (except drivers) to
> >> arch/mips/mach-octeon or (-cavium). Inside that directory you can have
> >> an include directory for you cusom header files, you can even override
> >> the generic files from arch/mips/include like in Linux. arch/mips/cpu
> >> and arch/mips/lib should only contain generic code. As already mentioned
> >> you could provide an own start.S inside arch/mips/mach-octeon but if
> >> possible you should try to reuse or extend the generic variant.
> > 
> > We can't use the existing start.S. We have a lot of requirements that are
> > not supported there as well as a fair bit of code dedicated to dealing
> > with the cache and TLBs and bringing additional cores out of reset. We
> > make use of a boot bus movable region in order to do this and handle
> > other cases like NMIs and the watchdog. Our start.S currently sits at
> > around 3800 lines of code. Some is common but most is not.
> > 
> > Our start.S is designed to be able to boot both a failsafe and
> > non-failsafe
> > image and supports adjusting the flash mapping in order to start from an
> > offset other than zero in the flash. There is also a fair bit of code for
> > copying the image out of flash into the L2 cache for a significant speedup
> > for DRAM initialization. I'm trying to get permission to share our
> > existing code but I'm getting push-back (even though it's GPL!?!). How
> > they want me to upstream it without sharing the code is beyond me.
> > 
> > While U-Boot has an exception handler, I believe ours is more
> > comprehensive. It is written entirely in assembler and is not dependent
> > on a working C runtime environment. It also dumps more information than
> > just the registers such as the stack and a number of other exception
> > registers and does some exception decoding. It's quite a bit better than
> > the ARMv8 exception handler IMHO.
> > 
> > Putting this under mach-octeon will make it much easier. I'll try and
> > re-use where I can.
> > 
> >>> All in all, I think the final port will add between 500K-1M lines of
> >>> code for the Octeon CPU. It is much more extensive than what is required
> >>> for OcteonTX since in the latter case most of the hardware
> >>> initialization is done by earlier stage bootloaders and the ATF handles
> >>> things like SFP port management and many of the networking operations.
> >>> 
> >>> I'm not sure how well I'll be able to upstream all of this code at this
> >>> point since I was just handed this task. We already have at least 1M
> >>> lines of code added to the old U-Boot which is based off of 2013.08 with
> >>> a lot of backports.
> > 
> > I'm trying to get  our existing code made available someplace online. I'm
> > getting pushback even though U-Boot is GPL and the license on our SDK is
> > BSD- like (i.e. do whatever you want but don't hold us responsible). It
> > looks like it used to be available but was taken down. I don't
> > undertstand lawyers. All of the code I wrote is GPL. There is some U-Boot
> > specific code in our SDK, but none was copied from U-Boot. There also is
> > some duplication of functionality between U-Boot and our SDK that I'll
> > try and eliminate.
> > 
> > I have implemented just about every feature in U-Boot I could with our
> > Octeon SoC. That's another reason it's so large. Some customer always
> > comes back and says they want feature X to work. Fortunately, the changes
> > to the U-Boot supplied code are generally minimal, despite it being so
> > large.
> > 
> > I likely will need to add some more hooks to board_f.c and board_r.c. I
> > have run into many cases where we need a specific order of initialization
> > that does not match the normal U-Boot order. Perhaps make init_sequence_f
> > and init_sequence_r weak so that they can be overridden if needed by a
> > specific board or architecture. While much of the current init order
> > works,  we need some things initialized as quickly as possible and others
> > initialized later. For example, the first thing we call is an
> > early_errate_workaround function in the init sequence before anything
> > else is called.
> I guess overriding the complete generic board init code is not
> acceptable. It was once hard work to unify this. A hook like
> early_errate_workaround() sounds reasonable but could also be called
> from start.S before handing over to board_init_f(). But everything else
> should fit into the exisiting init hooks. There are quite a lot.

I agree. I did some more research and noticed that it's not uncommon to have 
other functions called before board_init_f by the start code. I also noticed 
that there appear to be quite a few places where custom board_init_f functions 
are defined. I will try and avoid this. Back when I did this port in 2012 
things were a lot more limited.

Would marking a few functions as weak be acceptable? This would help keep 
#ifdefs to a minimum. I have found that doing this as well as adding hooks in 
some key places can really minimize the use of #ifdefs and keep the code 
cleaner. In our common board code I did this a lot. That way there is nothing 
specific to any single board in there and any board can override whatever 
functionality it needs to do. Our existing U-Boot supports 83 boards, though 
many of these will go away (and some are no longer tested).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: start.S
Type: text/x-csrc
Size: 95688 bytes
Desc: start.S
URL: <http://lists.denx.de/pipermail/u-boot/attachments/20191030/93718458/attachment.c>

More information about the U-Boot mailing list