[PATCH v5 10/13] video: Only dcache flush damaged lines
Alexander Graf
agraf at csgraf.de
Tue Aug 22 00:44:38 CEST 2023
On 22.08.23 00:10, Simon Glass wrote:
> Hi Alex,
>
> On Mon, 21 Aug 2023 at 13:59, Alexander Graf <agraf at csgraf.de> wrote:
>>
>> On 21.08.23 21:11, Simon Glass wrote:
>>> Hi Alper,
>>>
>>> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak at gmail.com> wrote:
>>>> From: Alexander Graf <agraf at csgraf.de>
>>>>
>>>> Now that we have a damage area tells us which parts of the frame buffer
>>>> actually need updating, let's only dcache flush those on video_sync()
>>>> calls. With this optimization in place, frame buffer updates - especially
>>>> on large screen such as 4k displays - speed up significantly.
>>>>
>>>> Signed-off-by: Alexander Graf <agraf at csgraf.de>
>>>> Reported-by: Da Xue <da at libre.computer>
>>>> [Alper: Use damage.xstart/yend, IS_ENABLED()]
>>>> Co-developed-by: Alper Nebi Yasak <alpernebiyasak at gmail.com>
>>>> Signed-off-by: Alper Nebi Yasak <alpernebiyasak at gmail.com>
>>>> ---
>>>>
>>>> Changes in v5:
>>>> - Use xstart, ystart, xend, yend as names for damage region
>>>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
>>>>
>>>> Changes in v2:
>>>> - Fix dcache range; we were flushing too much before
>>>> - Remove ifdefs
>>>>
>>>> drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
>>>> 1 file changed, 36 insertions(+), 5 deletions(-)
>>> This is a little strange, since flushing the whole cache will only
>>> actually write out data that was actually written (to the display). Is
>>> there a benefit to this patch, in terms of performance?
>>
>> I'm happy to see you go through the same thought process I went through
>> when writing these: "This surely can't be the problem, can it?". The
>> answer is "simple" in hindsight:
>>
>> Have a look at the ARMv8 cache flush function. It does the only "safe"
>> thing you can expect it to do: Clean+Invalidate to POC because we use it
>> for multiple things, clearing modified code among others:
>>
>> ENTRY(__asm_flush_dcache_range)
>> mrs x3, ctr_el0
>> ubfx x3, x3, #16, #4
>> mov x2, #4
>> lsl x2, x2, x3 /* cache line size */
>>
>> /* x2 <- minimal cache line size in cache system */
>> sub x3, x2, #1
>> bic x0, x0, x3
>> 1: dc civac, x0 /* clean & invalidate data or unified
>> cache */
>> add x0, x0, x2
>> cmp x0, x1
>> b.lo 1b
>> dsb sy
>> ret
>> ENDPROC(__asm_flush_dcache_range)
>>
>>
>> Looking at the "dc civac" call, we find this documentation page from
>> ARM:
>> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC
>>
>> This says we're writing any dirtyness of this cache line up to the POC
>> and then invalidate (remove the cache line) also up to POC. That means
>> when you look at a typical SBC, this will either be L2 or system level
>> cache. Every read afterwards needs to go and pull it all the way back to
>> L1 to modify it (or not) on the next character write and then flush it
>> again.
>>
>> Even worse: Because of the invalidate, we may even evict it from caches
>> that the display controller uses to read the frame buffer. So depending
>> on the SoC's cache topology and implementation, we may force the display
>> controller to refetch the full FB content on its next screen refresh cycle.
>>
>> I faintly remember that I tried to experiment with CVAC instead to only
>> flush without invalidating. I don't fully remember the results anymore
>> though. I believe CVAC just behaved identical to CIVAC on the A53
>> platform I was working on. And then I looked at Cortex-A53 errata like
>> [1] and just accepted that doing anything but restricting the flushing
>> range is a waste of time :)
> Yuck I didn't know it was invalidating too. That is horrible. Is there
> no way to fix it?
Before building all of this damage logic, I tried, but failed. I'd
welcome anyone else to try again :). I'm not even convinced yet that it
is actually fixable: Depending on the SoC's internal cache logic, it may
opt to always invalidate I think.
That said, this patch set really also makes sense outside of the
particular invalidate problem. It creates a generic abstraction between
the copy and non-copy code path and allows us to reduce the amount of
work spent for both, generically for any video sync operation.
Alex
More information about the U-Boot
mailing list