[U-Boot] [PATCH 0/9] arm64: Unify MMU code

Mon Feb 22 21:15:10 CET 2016

On 02/22/2016 12:09 PM, Alexander Graf wrote:
> 
> 
> On 22.02.16 20:52, york sun wrote:
>> On 02/22/2016 11:42 AM, Alexander Graf wrote:
>>>
>>>
>>> On 22.02.16 19:39, york sun wrote:
>>>> On 02/22/2016 10:31 AM, Alexander Graf wrote:
>>>>>
>>>>> On Feb 22, 2016, at 7:12 PM, york sun <york.sun at nxp.com> wrote:
>>>>>
>>>>>> On 02/22/2016 10:02 AM, Alexander Graf wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Am 22.02.2016 um 18:37 schrieb york sun <york.sun at nxp.com>:
>>>>>>>>
>>>>>>>>> On 02/21/2016 05:57 PM, Alexander Graf wrote:
>>>>>>>>> Howdy,
>>>>>>>>>
>>>>>>>>> Currently on arm64 there is a big pile of mess when it comes to MMU
>>>>>>>>> support and page tables. Each board does its own little thing and the
>>>>>>>>> generic code is pretty dumb and nobody actually uses it.
>>>>>>>>>
>>>>>>>>> This patch set tries to clean that up. After this series is applied,
>>>>>>>>> all boards except for the FSL Layerscape ones are converted to the
>>>>>>>>> new generic page table logic and have icache+dcache enabled.
>>>>>>>>>
>>>>>>>>> The new code always uses 4k page size. It dynamically allocates 1G or
>>>>>>>>> 2M pages for ranges that fit. When a dcache attribute request comes in
>>>>>>>>> that requires a smaller granularity than our previous allocation could
>>>>>>>>> fulfill, pages get automatically split.
>>>>>>>>>
>>>>>>>>> I have tested and verified the code works on HiKey (bare metal),
>>>>>>>>> vexpress64 (Foundation Model) and zynqmp (QEMU). The TX1 target is
>>>>>>>>> untested, but given the simplicity of the maps I doubt it'll break.
>>>>>>>>> ThunderX in theory should also work, but I haven't tested it. I would
>>>>>>>>> be very happy if people with access to those system could give the patch
>>>>>>>>> set a try.
>>>>>>>>>
>>>>>>>>> With this we're a big step closer to a good base line for EFI payload
>>>>>>>>> support, since we can now just require that all boards always have dcache
>>>>>>>>> enabled.
>>>>>>>>>
>>>>>>>>> I would also be incredibly happy if some Freescale people could look
>>>>>>>>> at their MMU code and try to unify it into the now cleaned up generic
>>>>>>>>> code. I don't think we're far off here.
>>>>>>>>
>>>>>>>> Alex,
>>>>>>>>
>>>>>>>> Unified MMU will be great for all of us. The reason we started with our own MMU
>>>>>>>> table was size and performance. I don't know much about other ARMv8 SoCs. For
>>>>>>>> our use, we enable cache very early to speed up running, especially for
>>>>>>>> pre-silicon development on emulators. We don't have DDR to use for the early
>>>>>>>> stage and we have very limited on-chip SRAM. I believe we can use the unified
>>>>>>>> structure for our 2nd stage MMU when DDR is up.
>>>>>>>
>>>>>>> Yup, and I think it should be fairly doable to move the early generation into the same table format - maybe even fully reuse the generic code.
>>>>>>
>>>>>> What's the size for the MMU tables? I think it may be simpler to use static
>>>>>> tables for our early stage.
>>>>>
>>>>> The size is determined dynamically from the memory map using some code that (as Steven found) is not 100% sound, but works well enough so far :).
>>>>
>>>> That's the part I can't live with. Since we have very limited on-chip RAM, we
>>>> have to know limit the size. But again, I do see the benefit to use unified
>>>> structure for the 2nd stage.
>>>
>>> I'm not quite sure I see how your current code works any differently.
>>> While the code to determine the page table pool size is dynamic, the
>>> outcome is static depending on your memory map. So the same memory map
>>> always means the same page table pool size.
>>>
>>> We could also just hard code the size for the early phase for you I guess.
>>
>> We can definitely try.
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> The thing that I tripped over while attempting conversion was that you don't always map phys==virt, unless other boards, and I didn't fully understand why.
>>>>>>>
>>>>>> True. We have some complication on the address mapping. For compatibility, each
>>>>>> device is mapped (partially) under 32-bit space. If the device is too large to
>>>>>
>>>>> Compatibility with what? Do we really need this in an AArch64 world?
>>>>
>>>> It's not up to me. The SoC was designed this way. By the way, this SoC can work
>>>> in AArch32 mode.
>>>
>>> I think I'm slowly grasping what the problem is.
>>>
>>> The fact that the SoC can run in AArch32 mode doesn't actually make a
>>> difference here though, since we're talking about U-Boot internal memory
>>> maps. The only reason to keep things mapped reachable from 32bits is if
>>> you want to run 32bit code with the U-Boot maps. I don't think you'd
>>> want to do that, no? :)
>>
>> I don't really want to run 32-bit code. My point is the SoC was designed that
>> way. We have DDR under 32-bit space, and in high region. We have the same for
>> flash controller where NOR is connected. Explained later below.
>>>
>>>>
>>>>>
>>>>> For 32bit code I can definitely understand why you'd want to have phys != virt. But in a pure 64bit world (which this target really is, no?) I see little benefit on it.
>>>>>
>>>>>> fit, the rest is mapped to high regions. I remember one particular case on top
>>>>>> of my head. It is the NOR flash we use for environmental variables. U-boot uses
>>>>>> that address for saving, but also uses that for loading during booting. For our
>>>>>> case, the NOR flash doesn't fit well in the low region, so it is remapped to
>>>>>> high region after booting. To make the environmental variables accessible during
>>>>>> boot, we mapped the high region phys with different virt, so u-boot doesn't have
>>>>>> to know the low region address.
>>>>>
>>>>> I might be missing the obvious, but why can't the environmental variables live in high regions?
>>>>>
>>>>
>>>> It is in high region. But as I tried to explain, the default physical mapping of
>>>> NOR flash (not MMU) is in low region out of reset.
>>>
>>> I see. So the problem is during the transitioning phase from uncached to
>>> MMU enabled, where we'd end up at a different address.
>>
>> Not exactly. We enable cache very early for performance boost on emulator. It
>> may sound trivial but it makes big difference when debugging software on
>> emulators. Since we still use emulators for new product, I am not ready to drop
>> the early MMU approach.
> 
> I'm surprised it is that slow for you. Running the Foundation model
> (which doesn't do early mmu FWIW) seemed to be fast enough.

Foundation model is a simulator, not an emulator. Our emulator runs on hardware.
It is much much slower than simulator, but more accurate on lower level.

> 
>> But you get the idea, the difference is before and after relocation. After
>> u-boot relocates itself into DDR, we remap flash controller physical address to
>> high region.
>>
>>>
>>> Could we just configure NOR to be in high memory in early asm init code,
>>> then always use the high physical NOR address range and jump to it from
>>> asm very early on? Then we could ignore the 32bit map and everything
>>> could just stay 1:1 mapped.
>>>
>>
>> Out of reset, if booting from NOR flash, the flash controller is pre-configured
>> to use low region address. We can only reprogram the controller when u-boot is
>> not running on it.
> 
> I see, so you keep the low map alive until you make the switch-over to
> DDR. Makes a lot of sense.
> 
> I guess I can give the conversion another stab now whenever I get a free
> night :). If I understand you correctly we'd only need to do non-1:1
> maps for the early code, right?

So far, yes. But we don't want to block ourselves from using non-1:1 mapping
down the road, do we?

> 
>> I see you are trying to maintain the 1:1 mapping for MMU. Why so? I think the
>> framework should allow different mapping.
> 
> Mostly for the sake of simplicity. It wouldn't be very different to
> extend the logic to support setting of va != pa, but I find code vastly
> easier to debug and understand if the address I see is the address I access.
> 

Agreed.

York