[U-Boot] [PATCH 8/8] x86: quark: Optimize MRC execution time

Simon Glass sjg at chromium.org
Wed Sep 2 04:48:28 CEST 2015


Hi Bin,

On 1 September 2015 at 04:29, Bin Meng <bmeng.cn at gmail.com> wrote:
> Hi Simon,
>
> On Tue, Sep 1, 2015 at 11:22 AM, Bin Meng <bmeng.cn at gmail.com> wrote:
>> Hi Simon,
>>
>> On Tue, Sep 1, 2015 at 11:12 AM, Simon Glass <sjg at chromium.org> wrote:
>>> Hi Bin,
>>>
>>> On 31 August 2015 at 21:04, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>> Hi Simon,
>>>>
>>>> On Tue, Sep 1, 2015 at 10:57 AM, Simon Glass <sjg at chromium.org> wrote:
>>>>> Hi Bin,
>>>>>
>>>>> On 31 August 2015 at 19:40, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>>>> Hi Simon,
>>>>>>
>>>>>> On Tue, Sep 1, 2015 at 7:12 AM, Simon Glass <sjg at chromium.org> wrote:
>>>>>>> Hi Bin,
>>>>>>>
>>>>>>> On 31 August 2015 at 08:04, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>>>>>> Hi Simon,
>>>>>>>>
>>>>>>>> On Mon, Aug 31, 2015 at 9:54 PM, Simon Glass <sjg at chromium.org> wrote:
>>>>>>>>> Hi Bin,
>>>>>>>>>
>>>>>>>>> On 31 August 2015 at 07:43, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>>>>>>>> Hi Simon,
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 31, 2015 at 9:20 PM, Simon Glass <sjg at chromium.org> wrote:
>>>>>>>>>>> Hi Bin,
>>>>>>>>>>>
>>>>>>>>>>> On 31 August 2015 at 03:52, Bin Meng <bmeng.cn at gmail.com> wrote:
>>>>>>>>>>>> Boot time performance degradation is observed with the conversion
>>>>>>>>>>>> to use dm pci. Intel Quark SoC has a low end x86 processor with
>>>>>>>>>>>> only 400MHz frequency and the most time consuming part is with MRC.
>>>>>>>>>>>> Each MRC register programming requires indirect access via pci bus.
>>>>>>>>>>>> With dm pci, accessing pci configuration space has some overhead.
>>>>>>>>>>>> Unfortunately this single access overhead gets accumulated in the
>>>>>>>>>>>> whole MRC process, and finally leads to twice boot time (25 seconds)
>>>>>>>>>>>> than before (12 seconds).
>>>>>>>>>>>>
>>>>>>>>>>>> To speed up the boot, create an optimized version of pci config
>>>>>>>>>>>> read/write routines without bothering to go through driver model.
>>>>>>>>>>>> Now it only takes about 3 seconds to finish MRC, which is really
>>>>>>>>>>>> fast (8 times faster than dm pci, or 4 times faster than before).
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Bin Meng <bmeng.cn at gmail.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>
>>>>>>>>>>>>  arch/x86/cpu/quark/msg_port.c | 59 +++++++++++++++++++++++++++----------------
>>>>>>>>>>>>  1 file changed, 37 insertions(+), 22 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> Before I delve into the patch - with driver model we are using the I/O
>>>>>>>>>>> method - see pci_x86_read_config(). Is that the source of the slowdown
>>>>>>>>>>> or is it just general driver model overhead.
>>>>>>>>>>
>>>>>>>>>> The MRC calls APIs in arch/x86/cpu/quark/msg_port.c to program DDR
>>>>>>>>>> controller. Inside msg_port.c, pci_write_config_dword() and
>>>>>>>>>> pci_read_config_dword() are called.
>>>>>>>>>>
>>>>>>>>>> With driver model, the overhead is:
>>>>>>>>>>
>>>>>>>>>> pci_write_config_dword() -> pci_write_config32() -> pci_write_config()
>>>>>>>>>> -> uclass_get_device_by_seq() then pci_bus_write_config() will finally
>>>>>>>>>> call pci_x86_read_config().
>>>>>>>>>>
>>>>>>>>>> Without driver model, there is still some overhead (so previously the
>>>>>>>>>> MRC time was about 12 seconds)
>>>>>>>>>>
>>>>>>>>>> pci_write_config_dword() -> pci_hose_write_config_dword() ->
>>>>>>>>>> TYPE1_PCI_OP(write, dword, u32, outl, 0)
>>>>>>>>>>
>>>>>>>>>> With my optimized version, pci_write_config_dword() directly calls a
>>>>>>>>>> hardcoded dword size pci config access, without the need to consider
>>>>>>>>>> offset and mask, and dereferencing hose->cfg_addr/cfg->data.
>>>>>>>>>
>>>>>>>>> What about if we use dm_pci_read_config32()? We should try to move PCI
>>>>>>>>> access to driver model to avoid the uclass_get_device_by_seq()
>>>>>>>>> everywhere.
>>>>>>>>
>>>>>>>> I don't think that helps. dm_pci_read_config32() requires a dm driver.
>>>>>>>> MRC is just something that program a bunch of registers with pci
>>>>>>>> config rw call.
>>>>>>>
>>>>>>> My question is really what takes the time? It's not clear whether it
>>>>>>> is the driver model overhead or something else. The code you add in
>>>>>>> qrk_pci_write_config_dword() looks very similar to
>>>>>>> pci_x86_read_config().
>>>>>>>
>>>>>>
>>>>>> It is the driver model overhead. In order to get to
>>>>>> pci_x86_read_config(), we need go through a bunch of function calls
>>>>>> (see above). Yes, my version is very similar to pci_x86_read_config(),
>>>>>> but my version is more simpler as it only needs to deal with dword
>>>>>> size thus no need to do offset mask and switch/case. If you look at
>>>>>> the Quark MRC codes, there are thousands of calls to msg_port_read()
>>>>>> and msg_port_write().
>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If the former then perhaps we should change this. If the latter then
>>>>>>>>>>> we have work to do...
>>>>>>>>>>>
>>>>>>>>
>>>>>>>> Also for this patch, I just realized that not only it helps to reduce
>>>>>>>> the boot time, but also it helps to support PCIe root port in future
>>>>>>>> patches.
>>>>>>>>
>>>>>>>> By checking the Quark SoC manual, I found something that needs to be
>>>>>>>> done to get the two PCIe root ports on Quark SoC to work. In order to
>>>>>>>> get PCIe root ports show up in the PCI configuration space, we need
>>>>>>>> program some registers in the message bus (using APIs in
>>>>>>>> arch/x86/cpu/quark/msg_port.c). With driver model, if we still call
>>>>>>>> pci_write_config_dword() in the msg_port.c, we end up triggering PCI
>>>>>>>> enumeration process first before we actually write something to the
>>>>>>>> configuration space, however the enumeration process will hang when
>>>>>>>> scanning to the PCIe root port if we don't properly initialize the
>>>>>>>> message bus registers. This is a chicken-egg problem.
>>>>>>>
>>>>>>> Sure, although I see that as a separate problem.
>>>>>>>
>>>>>>
>>>>>> Yes, it is a separate problem. It came to me when I was reading the
>>>>>> manual after I submitted the patch.
>>>>>>
>>>>>>> We can't have a driver model implementation that is really slow, so
>>>>>>> I'd like to clear that up first. Of course your patch makes sense for
>>>>>>> other reasons, but I don't want to play whack-a-mole here.
>>>>>>>
>>>>>>
>>>>>> Agreed. So far the driver model PCI is used on x86 boards and sandbox.
>>>>>> The performance issue was not obvious on these targets, but it is
>>>>>> quite noticeable on Intel Quark. These PCI config read/write routines
>>>>>> will go through lots of function calls before we actually touch the
>>>>>> I/O ports 0xcf8 and 0xcfc, especially when the device is not on the
>>>>>> root bus (it needs to go through its parent followed by its parent's
>>>>>> parent).
>>>>>>
>>>>>> But anyway I think this optimization for Quark is needed. I doubt we
>>>>>> can optimize driver model pci to such an extent.
>>>>>
>>>>> Can we use this as an opportunity to try a few things? If we use the
>>>>> dm_pci functions that should cut out some overhead. Can you try an
>>>>> experiment to see how much difference it makes?
>>>>>
>>>>> dm_pci_read_config32()
>>>>
>>>> We can't use this API as MRC is not a dm driver.
>>>
>>> OK, probably I need to dig in and understand this a little better. Is
>>> it running pre-relocation with the early PCI stuff? We could make a
>>> driver with UCLASS_RAM perhaps.
>>
>> Yes, it is running pre-relocation with the early PCI stuff. But I
>> doubt the need to create a UCLASS_RAM for x86 targets as most x86
>> targets uses FSP to initialize the RAM. The best candidate to
>> implement UCLASS_RAM that I can think of now is the Freescale DDR
>> controller driver on powerpc, and on ARM recently. It supports both
>> SPD and memory-down. On ARM, most RAM targets' DDR initialization is
>> memory-down I believe.
>>
>
> Some updates today when trying to support PCIe root ports in the v2:
>
> I moved qrk_pci_write_config_dword() and qrk_pci_read_config_dword()
> from msg_port.c to quark.c and update all codes in quark.c to call
> these two routines to avoid the chicken & egg problem. With this
> change, I noticed that the MRC execution time changed from 3 seconds
> to 4 seconds. So 1 additional second is needed. I disassembled u-boot
> and found in v1 since qrk_pci_write_config_dword() and
> qrk_pci_read_config_dword() are declared static in msg_port.c, they
> are inlined by the compiler into these APIs in msg_port.c. But with v2
> changes, they are no longer inline but normal function call, which
> causes this additional 1 second.
>
> So even inline makes some improvement for MRC, not to mention if we
> can avoid these multiple call chains in the driver model PCI APIs.

OK so at this point I think we should give up and go with your
original code. I hope that in future we can figure out a way to make
this more efficient, but for now, let's run with it.

Regards,
Simon


More information about the U-Boot mailing list