[U-Boot] [PATCH 8/8] x86: quark: Optimize MRC execution time

Bin Meng bmeng.cn at gmail.com
Tue Sep 1 05:04:06 CEST 2015


Hi Simon,

On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <sjg at chromium.org> wrote:
> Hi Bin,
>
> On 31 August 2015 at 19:40, Bin Meng <bmeng.cn at gmail.com> wrote:
>> Hi Simon,
>>
>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg at chromium.org> wrote:
>>> Hi Bin,
>>>
>>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>> Hi Simon,
>>>>
>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg at chromium.org> wrote:
>>>>> Hi Bin,
>>>>>
>>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>>>> Hi Simon,
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg at chromium.org> wrote:
>>>>>>> Hi Bin,
>>>>>>>
>>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>>> than before (12 seconds).
>>>>>>>>
>>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>>
>>>>>>>> Signed-off-by: Bin Meng <bmeng.cn at gmail.com>
>>>>>>>> ---
>>>>>>>>
>>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>>
>>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>>> or is it just general driver model overhead.
>>>>>>
>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>>> pci_read_config_dword() are called.
>>>>>>
>>>>>> With driver model, the overhead is:
>>>>>>
>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>>> call pci_x86_read_config().
>>>>>>
>>>>>> Without driver model, there is still some overhead (so previously the
>>>>>> MRC time was about 12 seconds)
>>>>>>
>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>>
>>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>>> hardcoded dword size pci config access, without the need to consider
>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>>
>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>>> everywhere.
>>>>
>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>>> MRC is just something that program a bunch of registers with pci
>>>> config rw call.
>>>
>>> My question is really what takes the time? It's not clear whether it
>>> is the driver model overhead or something else. The code you add in
>>> qrk_pci_write_config_dword() looks very similar to
>>> pci_x86_read_config().
>>>
>>
>> It is the driver model overhead. In order to get to
>> pci_x86_read_config(), we need go through a bunch of function calls
>> (see above). Yes, my version is very similar to pci_x86_read_config(),
>> but my version is more simpler as it only needs to deal with dword
>> size thus no need to do offset mask and switch/case. If you look at
>> the Quark MRC codes, there are thousands of calls to msg_port_read()
>> and msg_port_write().
>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>>> we have work to do...
>>>>>>>
>>>>
>>>> Also for this patch, I just realized that not only it helps to reduce
>>>> the boot time, but also it helps to support PCIe root port in future
>>>> patches.
>>>>
>>>> By checking the Quark SoC manual, I found something that needs to be
>>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>>> get PCIe root ports show up in the PCI configuration space, we need
>>>> program some registers in the message bus (using APIs in
>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>>> enumeration process first before we actually write something to the
>>>> configuration space, however the enumeration process will hang when
>>>> scanning to the PCIe root port if we don't properly initialize the
>>>> message bus registers. This is a chicken-egg problem.
>>>
>>> Sure, although I see that as a separate problem.
>>>
>>
>> Yes, it is a separate problem. It came to me when I was reading the
>> manual after I submitted the patch.
>>
>>> We can't have a driver model implementation that is really slow, so
>>> I'd like to clear that up first. Of course your patch makes sense for
>>> other reasons, but I don't want to play whack-a-mole here.
>>>
>>
>> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
>> The performance issue was not obvious on these targets, but it is
>> quite noticeable on Intel Quark. These PCI config read/write routines
>> will go through lots of function calls before we actually touch the
>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
>> root bus (it needs to go through its parent followed by its parent's
>> parent).
>>
>> But anyway I think this optimization for Quark is needed. I doubt we
>> can optimize driver model pci to such an extent.
>
> Can we use this as an opportunity to try a few things? If we use the
> dm_pci functions that should cut out some overhead. Can you try an
> experiment to see how much difference it makes?
>
> dm_pci_read_config32()

We can't use this API as MRC is not a dm driver.

>    pci_get_bdf()
> pci_read_config()
>    for loop which probably terminates immediately
> pci_bus_read_config()
> read_config(), which is pci_x86_read_config()
>
> So it's not great but it doesn't look too bad.
>
> Also is this an Intel Gallileo gen 2 development board? I'm thinking
> of getting one.
>

Yes, this is an Intel Galileo gen2 development board. Although there
is an gen1 board in the past and the same u-boot.rom can boot on both
gen1 and gen2 board, Intel is now shipping only gen2 board.

>>
>>> Also we should start to move things away from the non-driver-model pci
>>> functions.
>>>
>>

Regards,
Bin


More information about the U-Boot mailing list