[PATCH v2 00/12] mips: Add initial Octeon MIPS64 base support

Fri Jun 19 09:51:06 CEST 2020

Hi Daniel,

On 16.06.20 19:27, Daniel Schwierzeck wrote:
>>>> Actually I wanted to do a more complete header sync with Linux and
>>>> submit some minimal refactorings of start.S so that you have a better
>>>> working base. Unfortuneately I hadn't time to do so in the last 3 weeks.
>>>
>>> I understand. Is there something I can help you with, to speed this
>>> up?
>>
>> Again, can I assist with some of your tasks? I really would like to see
>> the base Octeon support appear in mainline U-Boot in the upcoming
>> merge window. Please see below.
> 
> my current work state is pushed to u-boot-mips in branches
> header_sync_v2 and lowlevel_refactoring. That should be enough for the
> first refactoring step for the next merge window. I'll try to send
> patches soon. With this you can also drop patches 2/12 and 8/12.

Okay. Thanks for working on this.

>>
>>>> Regarding the copying to L2 cache: could we do this later and add the
>>>> init stuff at first? So we could have functionality at first and then
>>>> boot time optimization later. This also would give me some more time to
>>>> refactor the start.S hooks which would give you a better base for
>>>> implementing such optimizations.
>>>
>>> IIUYC, then your main reason that I should remove the L2 cache copy
>>> is, that you would like to refactor start.S first and add some hooks
>>> for e.g. this L2C cache copy, where it would fit better than currently
>>> in mips_sram_init(). I definitely promise to re-work this Octeon early
>>> boot code to fit into your refactored start.S, once its ready. This
>>> way, you wouldn't be pressed in time to get this rework done. And we
>>> have a chance to get this Octeon base support integrated in a "decent"
>>> state (bootup time, please see DDR4 init below) soon.
>>>
>>> There are other reasons, beside the boot speedup, for the L2 cache
>>> copy:
>>>
>>> Running from L2 cache also helps with re-mapping the CFI flash
>>> bootmap, so that the 8 MiB flash can be fully accessed. Otherwise
>>> the flash is too big for the initial bootmap size and things like
>>> environement can't be accessed.
>>>
>>> The DDR init code that I'm currently working on - I'm actually
>>> preparing the first patchset right now - is only tested with this
>>> configration right now.
>>>
>>> I'm also working on other peripheral drivers for Octeon. Some have
>>> already started (like USB) and others are on my list. All this work
>>> is pretty painful without the boot speedup, as e.g. the DDR4 init
>>> will take very long (many minutes compared to a few seconds, as I've
>>> been told).
>>>
>>> But of course this is your call. If you insist on removing the L2 cache
>>> copy from the base Octeon port, then I will re-visit the patchet and
>>> try to address it.
> 
> I had a deeper look into it, but I still can't understand why and how
> this L2 copy even works. Especially after a cold-boot from flash without
> any Octeon specific CPU initialization.
> 
> U-Boot is a position-dependent binary and all branches and jumps are
> implemented with absolute addresses. You can't simply add an offset to
> $pc (except when changing KSEG) and expect U-Boot to be running from the
> new location. That's why we have the relocation step and the reloc table
> where all absolute addresses are fixed relative to the relocation address.

Yes, I am aware of this. That's why TEXT_BASE is already set to the
"destination" address in L2 cache and not to the boot space (see below).

> Assembly dump of the current patch series:
> 
> ffffffff80000000 <_start>:
> ENTRY(_start)
> 	/* U-Boot entry point */
> 	b	reset
> ffffffff80000000:	1000013f 	b	ffffffff80000500 <reset>
> 	 mtc0	zero, CP0_COUNT	# clear cp0 count for most accurate boot timing
> ffffffff80000004:	40804800 	mtc0	zero,$9
> 	...
> ffffffff80000500 <reset>:
> 1:	mfc0	t0, CP0_EBASE
> ffffffff80000500:	400c7801 	mfc0	t0,$15,1
> 	and	t0, t0, EBASE_CPUNUM
> ffffffff80000504:	318c03ff 	andi	t0,t0,0x3ff
> 
> 	/* Hang if this isn't the first CPU in the system */
> 2:	beqz	t0, 4f
> ffffffff80000508:	11800004 	beqz	t0,ffffffff8000051c <reset+0x1c>
> 	 nop
> ffffffff8000050c:	00000000 	nop
> 3:	wait
> ffffffff80000510:	42000020 	wait
> 	b	3b
> ffffffff80000514:	1000fffe 	b	ffffffff80000510 <reset+0x10>
> 	 nop
> ffffffff80000518:	00000000 	nop
> 
> 	/* Init CP0 Status */
> 4:	mfc0	t0, CP0_STATUS
> ffffffff8000051c:	400c6000 	mfc0	t0,$12
> ...
> 
> If you boot off from flash at ffffffffbfc00000, the very first
> instruction will jump to ffffffff80000500. This can only work if there
> is already some magic memory/TLB mapping from ffffffffbfc00000 to
> ffffffff80000000 or I-caches are already active.

I'm assuming that the first "b reset" will translate into an relative
machine opcode, so a difference in TEXT_BASE and current PC won't be a
problem. Its impossible to code an absolute 64bit address in this 32bit
opcode. Could this be the case (you know MIPS far better than me)?

The first "real" call is to mips_sram_init(), which I addresses by
re-calculating the address in this patch:

     mips: start.S: Add CONFIG_MIPS_INIT_JUMP_OFFSET

But I might have missed something. Still it works reliably on my
platform this way.

> Could you describe the basic system design (or share some documents)
> i.e. how CPUs, busses, cache coherency engine, memory controller, flash
> are connected and at which addresses them are mapped? And how the boot
> process works? Are the I-caches already active when booting from flash?
> This would really help me for understanding the code ;)

I agree. Unfortunately I can't share the manuals. As for the caches,
I can only assume that the I-cache is disabled at boot as booting is
so slow without this L2 copy.

> Also the complete copy of the U-Boot binary to L2 cache breaks the
> U-Boot design philosophy.

Yes, I am aware of this. And I agree that its not perfect. But the
necessary changes to the MIPS common code are very minimal.

> The task of start.S is to initialize the CPU,
> initialize the external memory, prepare global_data/BSS and setup the
> initial stack space. Originally the MIPS start.S required to implement
> everything in assembly. Later we added the option to use global_data and
> initial stack from some SRAM to be able to use a full C environment for
> the lowlevel_init() code. That's why start.S is now cluttered and needs
> some refactoring.

I agree. start.S has become quite complex with all the #ifdef's.

> Wouldn't it possible for the first Octeon patch series to simply fit in
> this scheme and use the normal relocation step to copy the U-Boot binary
> to RAM or effectively L2 cache? Without the DDR init code there is not
> much code which would run slowly from flash.

Yes. But relocation is done to L2 cache in this case. And with the
DDR init code, relocation needs to be done to DDR (after DDR init).
So this would need a 2nd "relocation", which is cumbersome and not
supported. I also "played" with SPL for a while to overcome this
2nd relocation. But this didn't work very good AFAIR.

Or do you see a way to do this in a "clean way"?

> While writing this, it occurred to me that you maybe just work-around a
> non-working or not properly initialized I-cache when booting from flash.

Frankly, I don't really know or remember. I'll take another look at the
MIPS common cache setup and will try to use it, as I-cache is enabled
there.

> Or maybe fetching single instructions from flash to I-cache is slower
> than copying the code at once? Can't that be fixed? Would something like
> arch/mips/mach-mscc/cpu.c:vcoreiii_tlb_init() work with Octeon?

Maybe. I'll take a look at this code again to see, if it fits for Octeon
as well.

> Or was
> something like that already done in Aaron's custom start.S but not in a
> much understandable way (when not knowing the Octeon internals)?

Its a combination of both AFAIR. TLB setup and copy to L2 cache.

My reasoning was to reduce the complexity and "only" keep this L2 cache
copy seemed to work quite good. I also didn't want to introduce some
TLB mapping(s), as this also increases the complexity again. But as
mentioned above, I'll check again, what can be done here.

>>
>> Do you have some guidance for me on this? How should we continue here?
>> Can I help with the Linux header sync or start.S refactoring? Did you
>> start with this work already or do you have a rough schedule?
>>
>> Sorry for being so persistent on this. I really don't want to rush
>> you, but as mentioned above, I really would like to see at least the
>> Octeon base port in mainline U-Boot in the next merge window. And if
>> there is something that I can do to help you, then please let me know.
>>
> 
> I also want to merge the base port as soon as possible.

Great. ;)

> Than we could
> figure out how to properly accelerate the DDR init code respectively all
> code running from flash. But I want to avoid adding a second and custom
> relocation step if there are better ways. And the discussion about this
> should not block the initial patch series.
> 
> Could you rebase your patch series to u-boot-mips/header_sync_v2 and
> move the L2 copy code to the DDR init patch series?

Good idea.

> Than I could apply
> the base port to u-boot-mips/next and you could still develop based on
> the current L2 copy code?

Yes, I'll rework the code and will send a new base port v3 patchset
very soon.

Thanks for all your input and your assistance to improve the quality
of the code.

Thanks,
Stefan