[PATCH v5 10/13] video: Only dcache flush damaged lines

Alper Nebi Yasak alpernebiyasak at gmail.com
Wed Aug 30 21:12:44 CEST 2023


On 2023-08-22 02:03 +03:00, Simon Glass wrote:
> Hi Alex,
> 
> On Mon, 21 Aug 2023 at 16:44, Alexander Graf <agraf at csgraf.de> wrote:
>>
>>
>> On 22.08.23 00:10, Simon Glass wrote:
>>> Hi Alex,
>>>
>>> On Mon, 21 Aug 2023 at 13:59, Alexander Graf <agraf at csgraf.de> wrote:
>>>>
>>>> On 21.08.23 21:11, Simon Glass wrote:
>>>>> Hi Alper,
>>>>>
>>>>> On Mon, 21 Aug 2023 at 07:51, Alper Nebi Yasak <alpernebiyasak at gmail.com> wrote:
>>>>>> From: Alexander Graf <agraf at csgraf.de>
>>>>>>
>>>>>> Now that we have a damage area tells us which parts of the frame buffer
>>>>>> actually need updating, let's only dcache flush those on video_sync()
>>>>>> calls. With this optimization in place, frame buffer updates - especially
>>>>>> on large screen such as 4k displays - speed up significantly.
>>>>>>
>>>>>> Signed-off-by: Alexander Graf <agraf at csgraf.de>
>>>>>> Reported-by: Da Xue <da at libre.computer>
>>>>>> [Alper: Use damage.xstart/yend, IS_ENABLED()]
>>>>>> Co-developed-by: Alper Nebi Yasak <alpernebiyasak at gmail.com>
>>>>>> Signed-off-by: Alper Nebi Yasak <alpernebiyasak at gmail.com>
>>>>>> ---
>>>>>>
>>>>>> Changes in v5:
>>>>>> - Use xstart, ystart, xend, yend as names for damage region
>>>>>> - Use IS_ENABLED() instead of CONFIG_IS_ENABLED()
>>>>>>
>>>>>> Changes in v2:
>>>>>> - Fix dcache range; we were flushing too much before
>>>>>> - Remove ifdefs
>>>>>>
>>>>>>    drivers/video/video-uclass.c | 41 +++++++++++++++++++++++++++++++-----
>>>>>>    1 file changed, 36 insertions(+), 5 deletions(-)
>>>>> This is a little strange, since flushing the whole cache will only
>>>>> actually write out data that was actually written (to the display). Is
>>>>> there a benefit to this patch, in terms of performance?
>>>>
>>>> I'm happy to see you go through the same thought process I went through
>>>> when writing these: "This surely can't be the problem, can it?". The
>>>> answer is "simple" in hindsight:
>>>>
>>>> Have a look at the ARMv8 cache flush function. It does the only "safe"
>>>> thing you can expect it to do: Clean+Invalidate to POC because we use it
>>>> for multiple things, clearing modified code among others:
>>>>
>>>> ENTRY(__asm_flush_dcache_range)
>>>>           mrs     x3, ctr_el0
>>>>           ubfx    x3, x3, #16, #4
>>>>           mov     x2, #4
>>>>           lsl     x2, x2, x3              /* cache line size */
>>>>
>>>>           /* x2 <- minimal cache line size in cache system */
>>>>           sub     x3, x2, #1
>>>>           bic     x0, x0, x3
>>>> 1:      dc      civac, x0       /* clean & invalidate data or unified
>>>> cache */
>>>>           add     x0, x0, x2
>>>>           cmp     x0, x1
>>>>           b.lo    1b
>>>>           dsb     sy
>>>>           ret
>>>> ENDPROC(__asm_flush_dcache_range)
>>>>
>>>>
>>>> Looking at the "dc civac" call, we find this documentation page from
>>>> ARM:
>>>> https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Instructions/DC-CIVAC--Data-or-unified-Cache-line-Clean-and-Invalidate-by-VA-to-PoC
>>>>
>>>> This says we're writing any dirtyness of this cache line up to the POC
>>>> and then invalidate (remove the cache line) also up to POC. That means
>>>> when you look at a typical SBC, this will either be L2 or system level
>>>> cache. Every read afterwards needs to go and pull it all the way back to
>>>> L1 to modify it (or not) on the next character write and then flush it
>>>> again.
>>>>
>>>> Even worse: Because of the invalidate, we may even evict it from caches
>>>> that the display controller uses to read the frame buffer. So depending
>>>> on the SoC's cache topology and implementation, we may force the display
>>>> controller to refetch the full FB content on its next screen refresh cycle.
>>>>
>>>> I faintly remember that I tried to experiment with CVAC instead to only
>>>> flush without invalidating. I don't fully remember the results anymore
>>>> though. I believe CVAC just behaved identical to CIVAC on the A53
>>>> platform I was working on. And then I looked at Cortex-A53 errata like
>>>> [1] and just accepted that doing anything but restricting the flushing
>>>> range is a waste of time :)
>>> Yuck I didn't know it was invalidating too. That is horrible. Is there
>>> no way to fix it?
>>
>>
>> Before building all of this damage logic, I tried, but failed. I'd
>> welcome anyone else to try again :). I'm not even convinced yet that it
>> is actually fixable: Depending on the SoC's internal cache logic, it may
>> opt to always invalidate I think.
> 
> Wow, that is crazy! How is anyone supposed to make the system run well
> with logic like that??!
> 
>>
>> That said, this patch set really also makes sense outside of the
>> particular invalidate problem. It creates a generic abstraction between
>> the copy and non-copy code path and allows us to reduce the amount of
>> work spent for both, generically for any video sync operation.
> 
> Sure...my question was really why it helps so much, given what I
> understood the caches to be doing. If they are invalidating, then it
> is amazing anything gets done...

I don't really know cache mechanisms and terminology, but AFAIU there's
nothing actionable for this patch regarding this discussion, right?

Meanwhile I noticed this patch only flushes priv->fb, and think it also
needs to flush priv->copy_fb if VIDEO_COPY.


More information about the U-Boot mailing list