[U-Boot] [PATCH 2/8] armv7: cache maintenance operations for armv7

Aneesh V aneesh at ti.com
Thu Jan 13 12:10:18 CET 2011


On Thursday 13 January 2011 12:48 AM, Albert ARIBAUD wrote:
> (I realize I did not answer the other ones)
>
> Le 08/01/2011 11:06, Aneesh V a écrit :
>
>>> Out of curiosity, can you elaborate on why the compiler would optimize
>>> better in these cases?
>>
>> While counting down the termination condition check is against 0. So
>> you can just decrement the loop count using a 'subs' and do a 'bne'.
>> When you count up you have to do a comparison with a non-zero value. So
>> you will need one 'cmp' instruction extra:-)
>
> I would not try to be too smart about what instructions are generated
> and how by a compiler such as gcc which has rather complex code
> generation optimizations.

IMHO, on ARM comparing with 0 is always going to be efficient than
comparing with a non-zero number for a termination condition, assuming
a decent compiler.

>
>> bigger loop inside because that reduces the frequency at which your
>> outer parameter changes and hence the overall number of instructions
>> executed. Consider this:
>> 1. We encode both the loop counts along with other data into a register
>> that is finally written to CP15 register.
>> 2. outer loop has the code for shifting and ORing the outer variable to
>> this register.
>> 3. Inner loop has the code for shifting and ORing the inner variable.
>> Step (3) has to be executed 'way x set' number of times anyways.
>> But having bigger loop inside makes sure that 2 is executed fewer times!
>

It's not a constant calculation. It's based on loop index. And this
optimization is not relying on compiler specifics. This is a logic
level optimization. It should generally give good results with all
compilers. Perhaps I was wrong in stating that it helps in getting
better assembly. It just helps in better run-time efficiency.

 > Here too it seems like you're underestimating the compiler's optimizing
 > capabilities -- your explanation seems to amount to extracting a
 > constant calculation from a loop, something that is rather usual in code
 > optimizing.

Actually, in my experience(in this same context) GCC does a terrible
job at this! For instance:

+		for (set = num_sets - 1; set >= 0; set--) {
+			setway = (level << 1) | (set << log2_line_len) |
+				 (way << way_shift);

Here, way_shift = 32 - log2_num_ways

But if you substitute way_shift with the latter, GCC will put the 
subtraction instruction inside the loop! - where as it is clearly loop 
invariant. So, I had to move it explicitly out of the loop!
In fact, I was thinking of giving this feedback to GCC.

>
>> With these tweaks the assembly code generated by this C code is as good
>> as the original hand-written assembly code with my compiler.
>
> How about other compilers?
>

I haven't tested other compilers. However, as I mentioned above the
latter one is a logic optimization. The former hopefully should help
all ARM compilers.

As you must be knowing, existing code for cache maintenance was in
assembly. When I moved it to C I wanted to make sure that the generated
code is as good as the original assembly for this critical piece of
code (I expected some criticism about moving it to C :-)). That's
why I checked the generated code and did these ,hopefully, minor tweaks
to make it better. I hope they don't have any serious drawbacks.

Best regards,
Aneesh


More information about the U-Boot mailing list