[U-Boot] [Patch v4 2/5] ARMv8: Adjust MMU setup

Thu Jun 5 20:34:23 CEST 2014

On 06/05/2014 10:41 AM, Mark Rutland wrote:
> On Thu, Jun 05, 2014 at 04:07:17PM +0100, York Sun wrote:
>> On 06/05/2014 03:09 AM, Mark Rutland wrote:
>>> On Wed, Jun 04, 2014 at 05:27:30PM +0100, York Sun wrote:
>>>> On 06/02/2014 11:01 AM, Mark Rutland wrote:
>>>>> On Mon, Jun 02, 2014 at 05:06:13PM +0100, York Sun wrote:
>>>>>> On 06/02/2014 04:34 AM, Mark Rutland wrote:
>>>>>>> On Thu, May 29, 2014 at 09:49:05PM +0100, York Sun wrote:
>>>>>>>> Make MMU functions reusable. Platform code can setup its own MMU tables.
>>>>>>>
>>>>>>> What exactly does platform code need to setup its own tables for?
>>>>>>
>>>>>> The general ARMv8 MMU table is not detail enough to control memory attribute
>>>>>> like cache for all addresses. We have devices mapping to addresses with
>>>>>> different requirement for cache control.
>>>>>
>>>>> And there are no APIs for creating device mappings rather than exporting
>>>>> the raw pagetable accessors and hard-coding them differently in every
>>>>> board file?
>>>>>
>>>>
>>>> That's a good question. At this point, only two platforms are using ARMv8 code.
>>>> I am expecting FSL ARMv8 implementation will stay similar, i.e. covered by the
>>>> file I added. If that's not the case, or more ARMv8 SoCs need special MMU table,
>>>> we then should introduce such API. Having a full function MMU API may be an
>>>> overkill for U-boot. We don't need dynamic MMU anyway.
>>>
>>> Maybe. It just seems to me that it would be possible to pre-allocate an
>>> empty table that we could place device (nGnRnE?) mappings in. Then all
>>> you'd need to call from board code is a function to map a range, rather
>>> than having to duplicate logic for creating the tables you want.
>>
>> It sounds good, but not the case. For the three level tables I am using (level0,
>> level1, level2), I don't have level2 table for every address, that will be too
>> many. Instead, I have a lot of blocks for level1. When I need some fine control
>> within a level1 block range, I have to create a new level2 table. It is doable,
>> but I will hold on that if I can use static table.
> 
> While my suggestion might not be the best, I'm not sure I follow, unless
> you always want to idmap devices?
> 
> If you don't idmap devices, then you can place all of the disparate
> physical mappings within a single table unless you have very large
> peripherals to map?

If you mean identical map as idmap, yes I am creating identical map for devices.
I got your point. For this particular SoC, if I can get it work with these
simple static tables, I will stay with them. But if I need to maintain the
tables for various SoCs, I will convert to dynamic API.

> 
>>
>>>
>>>>
>>>>>>
>>>>>>>
>>>>>>>> Also fix a typo of "TCR_EL3_IPS_BITS" in cache_v8.c.
>>>>>>>>
>>>>>>>> Signed-off-by: York Sun <yorksun at freescale.com>
>>>>>>>> CC: David Feng <fenghua at phytium.com.cn>
>>>>>>>> ---
>>>>>>>> Change log:
>>>>>>>>  v4: new patch, splitted from v3 2/4
>>>>>>>>      Revise set_pgtable_section() to be reused by platform MMU code
>>>>>>>>      Add inline function set_ttbr_tcr_mair() to be used by this and platform mmu code
>>>>>>>>
>>>>>>>>  arch/arm/cpu/armv8/cache_v8.c    |   49 ++++++++++++++++----------------------
>>>>>>>>  arch/arm/include/asm/armv8/mmu.h |   23 ++++++++++++++++++
>>>>>>>>  2 files changed, 43 insertions(+), 29 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c b/arch/arm/cpu/armv8/cache_v8.c
>>>>>>>> index a96ecda..67dbd46 100644
>>>>>>>> --- a/arch/arm/cpu/armv8/cache_v8.c
>>>>>>>> +++ b/arch/arm/cpu/armv8/cache_v8.c
>>>>>>>> @@ -12,15 +12,14 @@
>>>>>>>>  DECLARE_GLOBAL_DATA_PTR;
>>>>>>>>
>>>>>>>>  #ifndef CONFIG_SYS_DCACHE_OFF
>>>>>>>> -
>>>>>>>> -static void set_pgtable_section(u64 section, u64 memory_type)
>>>>>>>> +void set_pgtable_section(u64 *page_table, u64 index, u64 section,
>>>>>>>> +                         u64 memory_type)
>>>>>>>>  {
>>>>>>>> -        u64 *page_table = (u64 *)gd->arch.tlb_addr;
>>>>>>>>          u64 value;
>>>>>>>>
>>>>>>>> -        value = (section << SECTION_SHIFT) | PMD_TYPE_SECT | PMD_SECT_AF;
>>>>>>>> +        value = section | PMD_TYPE_SECT | PMD_SECT_AF;
>>>>>>>>          value |= PMD_ATTRINDX(memory_type);
>>>>>>>> -        page_table[section] = value;
>>>>>>>> +        page_table[index] = value;
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  /* to activate the MMU we need to set up virtual memory */
>>>>>>>> @@ -28,10 +27,13 @@ static void mmu_setup(void)
>>>>>>>>  {
>>>>>>>>          int i, j, el;
>>>>>>>>          bd_t *bd = gd->bd;
>>>>>>>> +        u64 *page_table = (u64 *)gd->arch.tlb_addr;
>>>>>>>>
>>>>>>>>          /* Setup an identity-mapping for all spaces */
>>>>>>>> -        for (i = 0; i < (PGTABLE_SIZE >> 3); i++)
>>>>>>>> -                set_pgtable_section(i, MT_DEVICE_NGNRNE);
>>>>>>>> +        for (i = 0; i < (PGTABLE_SIZE >> 3); i++) {
>>>>>>>> +                set_pgtable_section(page_table, i, i << SECTION_SHIFT,
>>>>>>>> +                                    MT_DEVICE_NGNRNE);
>>>>>>>> +        }
>>>>>>>>
>>>>>>>>          /* Setup an identity-mapping for all RAM space */
>>>>>>>>          for (i = 0; i < CONFIG_NR_DRAM_BANKS; i++) {
>>>>>>>> @@ -39,36 +41,25 @@ static void mmu_setup(void)
>>>>>>>>                  ulong end = bd->bi_dram[i].start + bd->bi_dram[i].size;
>>>>>>>>                  for (j = start >> SECTION_SHIFT;
>>>>>>>>                       j < end >> SECTION_SHIFT; j++) {
>>>>>>>> -                        set_pgtable_section(j, MT_NORMAL);
>>>>>>>> +                        set_pgtable_section(page_table, j, j << SECTION_SHIFT,
>>>>>>>> +                                            MT_NORMAL);
>>>>>>>>                  }
>>>>>>>>          }
>>>>>>>>
>>>>>>>>          /* load TTBR0 */
>>>>>>>>          el = current_el();
>>>>>>>>          if (el == 1) {
>>>>>>>> -                asm volatile("msr ttbr0_el1, %0"
>>>>>>>> -                             : : "r" (gd->arch.tlb_addr) : "memory");
>>>>>>>> -                asm volatile("msr tcr_el1, %0"
>>>>>>>> -                             : : "r" (TCR_FLAGS | TCR_EL1_IPS_BITS)
>>>>>>>> -                             : "memory");
>>>>>>>> -                asm volatile("msr mair_el1, %0"
>>>>>>>> -                             : : "r" (MEMORY_ATTRIBUTES) : "memory");
>>>>>>>> +                set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
>>>>>>>> +                                  TCR_FLAGS | TCR_EL1_IPS_BITS,
>>>>>>>> +                                  MEMORY_ATTRIBUTES);
>>>>>>>>          } else if (el == 2) {
>>>>>>>> -                asm volatile("msr ttbr0_el2, %0"
>>>>>>>> -                             : : "r" (gd->arch.tlb_addr) : "memory");
>>>>>>>> -                asm volatile("msr tcr_el2, %0"
>>>>>>>> -                             : : "r" (TCR_FLAGS | TCR_EL2_IPS_BITS)
>>>>>>>> -                             : "memory");
>>>>>>>> -                asm volatile("msr mair_el2, %0"
>>>>>>>> -                             : : "r" (MEMORY_ATTRIBUTES) : "memory");
>>>>>>>> +                set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
>>>>>>>> +                                  TCR_FLAGS | TCR_EL2_IPS_BITS,
>>>>>>>> +                                  MEMORY_ATTRIBUTES);
>>>>>>>>          } else {
>>>>>>>> -                asm volatile("msr ttbr0_el3, %0"
>>>>>>>> -                             : : "r" (gd->arch.tlb_addr) : "memory");
>>>>>>>> -                asm volatile("msr tcr_el3, %0"
>>>>>>>> -                             : : "r" (TCR_FLAGS | TCR_EL2_IPS_BITS)
>>>>>>>> -                             : "memory");
>>>>>>>> -                asm volatile("msr mair_el3, %0"
>>>>>>>> -                             : : "r" (MEMORY_ATTRIBUTES) : "memory");
>>>>>>>> +                set_ttbr_tcr_mair(el, gd->arch.tlb_addr,
>>>>>>>> +                                  TCR_FLAGS | TCR_EL3_IPS_BITS,
>>>>>>>> +                                  MEMORY_ATTRIBUTES);
>>>>>>>>          }
>>>>>>>>
>>>>>>>>          /* enable the mmu */
>>>>>>>> diff --git a/arch/arm/include/asm/armv8/mmu.h b/arch/arm/include/asm/armv8/mmu.h
>>>>>>>> index 1193e76..7de4ff9 100644
>>>>>>>> --- a/arch/arm/include/asm/armv8/mmu.h
>>>>>>>> +++ b/arch/arm/include/asm/armv8/mmu.h
>>>>>>>> @@ -108,4 +108,27 @@
>>>>>>>>                                  TCR_IRGN_WBWA |         \
>>>>>>>>                                  TCR_T0SZ(VA_BITS))
>>>>>>>>
>>>>>>>> +#ifndef __ASSEMBLY__
>>>>>>>> +void set_pgtable_section(u64 *page_table, u64 index,
>>>>>>>> +                         u64 section, u64 memory_type);
>>>>>>>> +static inline void set_ttbr_tcr_mair(int el, u64 table, u64 tcr, u64 attr)
>>>>>>>> +{
>>>>>>>> +        asm volatile("dsb sy;isb");
>>>>>>>
>>>>>>> Huh? This wasn't anywhere before. Is the isb necessary?
>>>>>>
>>>>>> Probably not, but to make sure all previous instructions finish.
>>>>>
>>>>> Which instructions do you care about completing?
>>>>>
>>>>> You'll certainly want any page table writes to complete, but is anything
>>>>> else in-flight at this point in time?
>>>>
>>>> Before calling this inline function, MMU tables were created. We want the
>>>> writing completes before setting these registers.
>>>
>>> Sure, but the dsb sy guarantees that.
>>>
>>> The dsb sy will ensure that all reads and writes before it have become
>>> visible to all observers in the full system shareability domain
>>> (including the MMU) before any subsequent instructions (in program
>>> order) can execute.
>>>
>>> I'd only expect to see a "dsb $DOMAIN; isb" sequence is when we're
>>> changing things that affect the instruction stream (I-cache maintenance,
>>> changing the mapping underlying the currently executing stream).
>>
>> Thanks. I can drop "isb". Please comment on v5 patches.
> 
> Will do momentarily.
> 
>>
>>>
>>>>>
>>>>>>>
>>>>>>> When is this function expected to be called? Is the MMU expected to be
>>>>>>> on?
>>>>>>
>>>>>> The general ARMv8 code calls this when MMU is off. For the SoC I am enabling, it
>>>>>> is called when MMU is off for the first time, but MMU on for the 2nd time.
>>>>>
>>>>> Ok.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +        if (el == 1) {
>>>>>>>> +                asm volatile("msr ttbr0_el1, %0" : : "r" (table) : "memory");
>>>>>>>> +                asm volatile("msr tcr_el1, %0" : : "r" (tcr) : "memory");
>>>>>>>> +                asm volatile("msr mair_el1, %0" : : "r" (attr) : "memory");
>>>>>>>
>>>>>>> If the MMU is on, this looks really scary -- what are you expecting to
>>>>>>> change in a single invocation?
>>>>>>
>>>>>> It is not scary for general ARMv8 code. MMU is off then this is called. For FSL
>>>>>> SoC, I have these called for the 2nd time, when MMU is on. The 2nd time call
>>>>>> points to a new MMU table.
>>>>>
>>>>> Oh but it is ;)
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +        } else if (el == 2) {
>>>>>>>> +                asm volatile("msr ttbr0_el2, %0" : : "r" (table) : "memory");
>>>>>>>> +                asm volatile("msr tcr_el2, %0" : : "r" (tcr) : "memory");
>>>>>>>> +                asm volatile("msr mair_el2, %0" : : "r" (attr) : "memory");
>>>>>>>> +        } else if (el == 3) {
>>>>>>>> +                asm volatile("msr ttbr0_el3, %0" : : "r" (table) : "memory");
>>>>>>>> +                asm volatile("msr tcr_el3, %0" : : "r" (tcr) : "memory");
>>>>>>>> +                asm volatile("msr mair_el3, %0" : : "r" (attr) : "memory");
>>>>>>>> +        } else {
>>>>>>>> +                hang();
>>>>>>>> +        }
>>>>>>>
>>>>>>> And no synchronisation to ensure that these writes are complete or even
>>>>>>> ordered w.r.t. each other?
>>>>>>>
>>>>>>
>>>>>> That's why I added asm volatile("dsb sy;isb") before them. The order of these
>>>>>> write doesn't matter. See the code before my change
>>>>>> http://git.denx.de/?p=u-boot.git;a=blob;f=arch/arm/cpu/armv8/cache_v8.c;h=a96ecda7e3146996d20a2e46b975dced2769255c;hb=HEAD
>>>>>
>>>>> Ok, so that orders them against everything _before_ them, but not with
>>>>> respect to each other, or anything _after_ them.
>>>>>
>>>>> When the MMU is off, you are correct that the ordering of these
>>>>> operations w.r.t. each other doesn't matter. With the MMU on that's not
>>>>> true as the CPU can rerder the operations and can walk the page tables
>>>>> asynchronously.
>>>>>
>>>>> Changing the MAIR at runtime is somewhat scary because you may be
>>>>> changing attributes for entries which are active in the cache and/or
>>>>> TLBs. I'm not sure I follow why you need to change it once the MMU is on
>>>>> -- there are plenty of subfields in the MAIR that you could configure
>>>>> before turning the MMU on for use later.
>>>>>
>>>>> Likewise changing the TCR at runtime is somewhat scary because you can
>>>>> change attributes of active cache and/or TLB entries, change fields that
>>>>> affect the way page tables are walked (TG*. T*SZ), all asynchronously
>>>>> w.r.t. the logic that walsk the page tables.
>>>>>
>>>>> Changing the TTBR is fine, except that we didn't invalidate old entries
>>>>> with a TLBI, so we may even have partial translations cahced in the
>>>>> TLBs, and we can get odd translations generated from the combination of
>>>>> the old and new tables.
>>>>
>>>>
>>>> Maybe a more proper way to do so is to change TTBR only. That's actually what I
>>>> really need. Or if I really need to change all of them, I should turn off MMU first.
>>>
>>> In fact, if you had an API for creating device mappings, you wouldn't
>>> even need to change the TTBR -- you'd just need to replace some existing
>>> invalid entries with the new mapping, make them visible to the MMU with
>>> a dsb, and then carry on. The TLBs aren't allowed to cache invalid
>>> mappings, so on the next access to those mappings, the appropriate page
>>> table entry will be read into the TLBs.
>>>
>>> If you alter multiple levels of a live table then you'll need barriers
>>> (DMBs) between populating those levels. If you back device mappings by a
>>> single level of table then you can avoid that. Even better, with an API
>>> you can do the simple thing today then make it more advanced in future
>>> if necessary without having to change your board code.
>>>
>>
>> No objection here on the idea. But again this is not the case. My first MMU
>> table is in SRAM, which is small and will be used for other purpose. The 2nd MMU
>> table is in DDR. I could copy the table and do the maintenance as you said. For
>> now, I want to stick with the static table and only create the API when I have to.
> 
> Sure, if your tables are in SRAM then trying to do a load of dynamic
> allocation isn't going to work.
> 
> My fear is that while that sounds OK with a single user to do a quick
> havk and poke the tables directly, we'll end up with everyone doing
> that, and no-one will try to unify things. It is very diffifcult to
> unify such variation after the fact.

That's a good reason. Let me start to code the API. It will take a while to
cover the complexity of the multilevel tables and sizes. It will be a separated
patch for later review. I don't want that to delay this patch set. I am hoping
to get this set in for 2014.07 release.

> 
>>>>>> For the 2nd write used by FSL SoC, I need to flush the cache to make sure the
>>>>>> table is in main DDR and perform TLB invalidation. That's when the loading of
>>>>>> new MMU table happens.
>>>>>
>>>>> I missed the TLB invaliation -- where is that meant to happen?
>>>>>
>>>> After flushing the new MMU table into DDR. To your point above, I will send a
>>>> new version with updating TTBR only.
>>>
>>> Unfortunately, that's too early. The MMU is asynchronous w.r.t. the CPU
>>> and so it can re-read the existing tables into the TLBs _before_ it has
>>> been programmed with the new table address. The TLBI has to happen
>>> _after_ the old tables are no longer pointed to by the TTBR.
>>>
>>> What you need to do to replace the active set of tables (assuming that
>>> the new mapping has the instruction stream mapped in an identical way)
>>> is:
>>>
>>> - Write the tables.
>>>
>>> - DSB to make them visible to the MMU.
>>>
>>> - Write to the appropriate TTBR_*.
>>>
>>> - ISB to complete the write to the TTBR_*.
>>>
>>> - TLBI to invalidate the old mappings the the TLBs.
>>>
>>> - DSB to complete the TLBI.
>>>
>>> - If you've changed the instruction stream or system state that will
>>>   affect the instruction stream, ISB to flush the CPU pipeline.
>>>
>>>
>> Here is the flow I have (as of v5 patch)
>>
>> Write the tables
>>
>> (I removed dsb here in v5, need to add back)
>>
>> Write TTBR
>>
>> (I missed isb here, need to add)
>>
>> Flush dcache (otherwise the table will not be in DDR. Yes, I verified)
> 
> This looks odd -- why do we need the tables to be in DDR? Why would we
> flush them here, when the address is partially visible to the MMU?

This sounds odd but it actually makes sense. Let's say we have new tables
created by u-boot. The new tables are in the address of DDR with D-cache
enabled. If we don't flush the cache, the moment TLBs are invalidated, MMU will
fetch the tables from TTBR which points to DDR. Without a valid TLB, MMU cannot
get the table from D-cache, it has to fetch from the DDR directly. I have
verified this by checking waveforms of the SoC and exercised code in both ways.

> 
>> TLBI
>>
>> (DSB and ISB TLBI function)
> 
> Why another TLBI?

Not another one. I just point out the DSB and ISB are part of the existing
function in u-boot.
> 
>>
>> I will need to send v6 patch to add the missing parts.
> 
> Sure, I'll hold off replies here until then.
> 

v6 is coming soon today.

York