[U-Boot] [Resend RFC PATCH 1/2] armv8: Fix dcache disable function
york sun
york.sun at nxp.com
Fri Oct 21 21:31:52 CEST 2016
On 10/20/2016 01:34 PM, Stephen Warren wrote:
> On 10/19/2016 11:06 PM, york sun wrote:
>> On 10/19/2016 06:01 PM, Stephen Warren wrote:
>>> On 10/19/2016 04:32 PM, york sun wrote:
>>>> On 10/19/2016 12:18 PM, Stephen Warren wrote:
>>>>> On 10/19/2016 09:25 AM, Stephen Warren wrote:
>>>>>> On 10/14/2016 02:17 PM, York Sun wrote:
>>>>>>> Current code turns off d-cache first, then flush all levels of cache.
>>>>>>> This results data loss. As soon as d-cache is off, the dirty cache
>>>>>>> is discarded according to the test on LS2080A. This issue was not
>>>>>>> seen as long as external L3 cache was flushed to push the data to
>>>>>>> main memory. However, external L3 cache is not guaranteed to have
>>>>>>> the data. To fix this, flush the d-cache by way/set first to make
>>>>>>> sure cache is clean before turning it off.
>>>>>>
>>>>>>> diff --git a/arch/arm/cpu/armv8/cache_v8.c
>>>>>>> b/arch/arm/cpu/armv8/cache_v8.c
>>>>>>
>>>>>>> @@ -478,9 +478,9 @@ void dcache_disable(void)
>>>>>>
>>>>>>> + flush_dcache_all();
>>>>>>> set_sctlr(sctlr & ~(CR_C|CR_M));
>>>>>>>
>>>>>>> - flush_dcache_all();
>>>>>>> __asm_invalidate_tlb_all();
>>>>>>
>>>>>> I talked to Mark Rutland at ARM, and I believe the current code is
>>>>>> correct. Here's my interpretation of what he said:
>>>>>>
>>>>>> The dcache must be disabled first. This prevents allocation of new
>>>>>> entries in the cache during the flush operation, which prevents the race
>>>>>> conditions that were mentioned in the other thread.
>>>>>>
>>>>>> Then, the flush operation must be invoked. Since the cache is now
>>>>>> disabled, this can fully flush the cache without worrying about racing
>>>>>> with things being added to the cache.
>>>>>>
>>>>>> This all implies that the implementation of dcache_disable(),
>>>>>> set_sctlr(), flush_dcache_all(), and any code they call must not access
>>>>>> data in DRAM at all; since because the dcache is off, any DRAM access
>>>>>> will[1] read potentially stale data from DRAM, rather than any dirty
>>>>>> data that might be in the cache.
>>>>>>
>>>>>> [1] I'm not sure if that's "will" or "may", i.e. whether this is
>>>>>> architecturally guaranteed in ARMv8 or is implementation defined. At
>>>>>> least the Cortex A72 TRM says "will" for that CPU; not sure about others.
>>>>>>
>>>>>> Perhaps the most obvious upshot of this is that the stack can't be used.
>>>>>> This implies to me that we need to recode all those functions purely in
>>>>>> assembly, or just possibly play some tricks to 100% force gcc not to
>>>>>> touch memory anywhere inside dcache_disable() or the functions it calls.
>>>>>> We're just getting lucky here right now since everything happens to be
>>>>>> inlined, but I don't think we're doing anything to 100% guarantee this.
>>>>>>
>>>>>> What worries me here is that at least on Tegra, a "flush the entire
>>>>>> dcache" operation requires an SMC call to the secure monitor. That will
>>>>>> certainly access DRAM when the secure monitor runs, but perhaps this
>>>>>> doesn't matter since that's at a different exception level, and we know
>>>>>> the secure monitor accesses DRAM regions that are separate from U-Boot's
>>>>>> DRAM? I suspect life isn't that convenient. I'm wondering if this all
>>>>>> implies that, like patch 2 in this series, we need to get 100% away from
>>>>>> flush-by-set/way, even with SoC-specific hooks to make that work
>>>>>> reliably, and just flush everything by VA, which IIRC is architecturally
>>>>>> guaranteed to work without SoC-specific logic. That way, we can
>>>>>> encapsulate everything into an assembly function without worrying about
>>>>>> calling SMCs or SoC-specific hook functions without using DRAM. Of
>>>>>> course, how that assembly function knows which VAs to flush without
>>>>>> decoding the page tables or other data structure is another matter:-(
>>>>>
>>>>> Re: the last paragraph there:
>>>>>
>>>>> After reading the ARMv8 ARM, I see that EL1, EL2, and EL3 all have
>>>>> separate cache enable bits in SCTLR_ELx. I believe U-Boot only needs to
>>>>> flush/... its own RAM out of the dcache, since that's all that's
>>>>> relevant at the EL U-Boot is running at, and doesn't care about anything
>>>>> EL3 might have mapped cached. So, it's safe to invoke SMCs from the
>>>>> cache flush code in U-Boot even if the EL3 code touches its own DRAM.
>>>>> There might be a corner case where this isn't true if EL3 has some
>>>>> EL1/EL2-owned RAM mapped, but I don't expect that to be the case here.
>>>>>
>>>>> Re: my other series to add more cache hooks: I'll re-implement the Tegra
>>>>> hook in assembly so it's guaranteed not to touch RAM, retest, and resumbit.
>>>>>
>>>>> If it turns out that dcache_disable() ever starts touching DRAM at the
>>>>> wrong time, we can deal with that then; it doesn't now at least.
>>>>>
>>>>
>>>> Stephen,
>>>>
>>>> I think one of our difference is whether the data is returned correctly
>>>> with dirty dcache being disabled. With current code flow, the dcache is
>>>> disabled, followed by flushing/invalidating by way/set, then L3 cache is
>>>> flushed. I see correct stack after these steps. During my recent attempt
>>>> to remove flushing L3, I added code to flush the stack by VA (which I
>>>> already confirmed the data will end up in main memory) to replace
>>>> flushing L3. I found as soon as the dcache is disabled, the dirty cache
>>>> data is lost when trying to access from the core (debugged by JTAG
>>>> tool). This makes me to think the order may be wrong. By reverting the
>>>> order, i.e. flushing dcache first then disable it, I can see correct
>>>> data in main memory. I understand the proposed new code requires not
>>>> touching stack or any data after the first flush. I prefer to not to
>>>> make this change if we have an explanation of what I saw.
>>>
>>> I think the confusion is: what does "lost" mean?
>>>
>>> On ARMv8 at least:
>>>
>>> Starting off with dcache enabled and containing some dirty data:
>>>
>>> CPU reads will read the dirty data from the cache.
>>>
>>> Now if dcache is disabled:
>>>
>>> CPU reads will be of the uncached/uncachable type since dcache is off.
>>> The dcache will not respond. So, the CPU will receive (potentially)
>>> stale data directly from RAM.
>>>
>>> In this state, it is not possible to access the dirty data in the
>>> caches. However, it's not (permanently) lost...
>>>
>>> Now if the dcache is fully flushed:
>>>
>>> All dirty data that was in the dcache will be written to RAM. After this
>>> process is complete, any reads from RAM will return the most up-to-date
>>> possible data; the temporarily "lost" dirty data is now no longer lost.
>>>
>>>
>>> The cache disable process must proceed as above; we can't flush the
>>> dcache and then disable it. Attempting to do things in that order would
>>> introduce race conditions. At all times before the dcache is disabled,
>>> any CPU writes to memory will allocate new lines in the dcache. Also,
>>> the dcache is fully responsive to the cache coherency protocol, so some
>>> other agent on the bus (another CPU, cluster, DMA agent, ...) could
>>> cause a (clean or dirty) line to be brought into the cache. This might
>>> happen after that part of the cache would have been flushed. In other
>>> words, it's impossible to fully flush the cache (by set/way) while it's
>>> enabled.
>>>
>>>
>>> I'm not sure that flushing by VA instead of by set/way solves anything
>>> much here. I believe flushing by VA runs less risk of interference due
>>> to the coherency protocol bringing lines back into the cache, since
>>> flush by VA will affect other caches? However this still doesn't allow
>>> use of the CPU's own stack or other DRAM writes; if we:
>>>
>>> a) Flush by VA
>>> (this step is assumed to perform some function calls/returns, or other
>>> DRAM writes)
>>>
>>> b) Disable cache
>>>
>>> ... then irrespective of how step (a) was performed, since the DRAM
>>> writes performed in that step are likely still in the cache as dirty
>>> data, so step (b) will cause them to be "lost" until a full flush
>>> operation is performed. Thus we must always flush after disabling the
>>> cache, so there's no point also flushing before.
>>>
>>>
>> Stephen,
>>
>> Sorry for late response. I am still traveling...
>
> I didn't think your response was slow:-)
>
>> I understand the data in dirty cache is not lost when the dcache is
>> disabled. It is just not accessible. In my case, after flushing L1/L2 by
>> way/set, the data is probably in L3 cache. Without flushing L3, I have
>> stalled data (including the stack) in main memory.
>
> I assume "stale" not "stalled".
>
> Yes, I agree: If you have only flushed L1/L2 by set/way (or attempted to
> do everything, but the L3 cache doesn't implement by set/way
> operations), then L3 can indeed still contain dirty data, and hence main
> memory can be stale.
>
>> My previous test was trying to prove I can skip flushing L3 if I flush
>> the necessary VA.
>
> It depends whether your L3 cache is before/after the Level of
> Unification and/or the Level of Coherency for your HW, and whether you
> flush by VA to the PoU or PoC.
>
> Looking at your "[PATCH 2/2] armv8: Fix flush_dcache_all function", it
> uses __asm_flush_dcache_range() which cleans and invalidates to the
> point of coherency (it invokes the dc civac instruction).
>
> > Now I when recall it carefully, I think I made a
>> mistake by flushing by VA _after_ flushing the cache by way/set. I might
>> have a positive result if I flushed the cache by VA first. I will repeat
>> this test when I get back to prove this theory.
>
> I'll assume you L3 cache is before the Point of Coherency. If so, then
> performing a clean/invalidate by VA to the PoC will affect all of L1,
> L2, and L3 caches. As such, you need only perform the c/i by VA, and you
> can entirely skip the c/i by set/way; I believe it would be redundant,
> assuming that the by VA operations cover all relevant VAs.
I believe the PoC and PoU is before L3 for me. I can clean/invalidate by
VA, it may not cover all the cache lines. So by set/way is still needed.
>
>> That being said, do you or anyone see any value by skipping flushing L3?
>> My target was to avoid making a SMC call for EL2/EL1, and to avoid
>> implementing flushing L3 function in U-Boot (even we already have one
>> for CCN-504). I understand this is different from what dcache_disable()
>> implies.
>
> I /think/ that if we can modify U-Boot to do all cache-wide operations
> by VA to the PoC, then we can indeed do that instead of any by-set/way
> operations, and in turn that will mean U-Boot doesn't need to invoke any
> SMC calls or SoC-specific logic when enabling/disabling the cache. My
> remaining questions would be:
>
> a) How can we determine the set of VA ranges we need to operate on?
>
> In the case where U-Boot is invalidating the dcache prior to enabling
> it, I think the set of VA ranges to operate on is determined by the MMU
> configuration of the SW that ran before U-Boot, since that defines what
> dirty data might be present in the dcache. U-Boot won't be able to
> determine that.
>
> Perhaps we only care about the set of VAs that U-Boot will actually
> enable in its own MMU configuration, since those will be the only VAs
> the CPU will be allowed to architecturally access. So, this might
> actually be a solvable problem.
>
> Perhaps U-Boot requires that the previous SW has already cleaned all
> relevant parts of the dcache (the fact U-Boot attempts to invalidate the
> entire dcache before enabling it implies this is not the case though).
>
> In the case where U-Boot has already programmed its own MMU
> configuration, then that MMU configuration can determine the set of VAs
> to operate on. This case is easy (easier) since we definitely have all
> the required configuration knowledge.
>
> b) How can we implement the by VA code in a way that doesn't touch DRAM?
>
> Implementing by-set/way is fairly constrained in that all the
> configuration data is in a few registers, and the algorithm just .
> requires a few nested loops.
>
> A generic by VA implementation seems like it would require walking
> U-Boot's struct mm_region *mem_map data structure (or the CPU's
> translation tables) in RAM. Perhaps that's OK since it's read-only...
>
I agree in general about your points, but it may not be always practical
to flush all by VA. U-Boot may map huge amount of memory. Walking
through MMU table and flush all will be too much. Without flushing all
memory, we really cannot say we flushed all dcache. On the other side,
for U-Boot itself to operate, we don't really have to flush all. I guess
the key is if we need to flush all. For Linux to boot, we don't. We can
flush some memory (including U-Boot stack), turn off the MMU. As soon as
kernel boots up, it enables dcache and everything is back. One can make
the argument that if users put something in memory and it is needed
before dcache is re-enabled, then it makes sense to flush the concerned
memory, unless it is not practical to do so.
At this moment, I can drop these two patches. I will still test it as I
said earlier.
York
More information about the U-Boot
mailing list