[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 5/5] RFC: pci: Migrate pci_mmcfg_{read,write} to pci.c
- To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
- From: Teddy Astie <teddy.astie@xxxxxxxxxx>
- Date: Tue, 19 May 2026 15:42:46 +0200
- Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=vates.tech header.i="@vates.tech" header.h="From:Subject:Date:Message-ID:To:Cc:MIME-Version:Content-Type:In-Reply-To:References:Feedback-ID"
- Autocrypt: addr=teddy.astie@xxxxxxxxxx; keydata= xsDNBGn5sK8BDACuzSrrTjpVf4ay06OYB6yY0J1PqKffihoNMtrQRZjAHxoAPC7LTBVHV/XO Zw5HJc+9R71z1JV+iYg6z3jPziGKzX8Fj3ZXlzJPmpf1PuETH3KdbvtJT4ny+OGntnJntUoR KRPhTirr6yNeBk/637O3CQXjtqFUPZnko8OI/o1yawIBhJJAWicutjkkUgd28Bh6HV9EIumH tCBgn5/1A/fpm9624MMgYLsA8qjC4XsoovQvFCaO8HEhvfzrrTZHjn/nPeB9SigxIxXW8YaT VqMdqul07o72m3eA2mf+LMu9a04FX/d4wbxBLtELm+1jIrbtyaFZEMOLv/haSiS/Lj3btJH/ EoucejoZ5SH49ksmVAmKOLktOaTQ8b2gEvP7iaKiIiszCCtOSRohr+2GvDsDeLvVZnlR3I+S PhHar7TPKjFz0G3DPNolyjXywNqOAMpomSPi8lSwjAFsxOtQbcck/qRGRSNk4DAmH70pA+89 MXfQXZ3qt1Q01B1+sU0I8xsAEQEAAc0kVGVkZHkgQXN0aWUgPHRlZGR5LmFzdGllQHZhdGVz LnRlY2g+wsENBBMBCAA3FiEEGAIew9LzHY3pdrqtZg+p0QLLz9AFAmn5sK8FCQWjmoACGwME CwkIBwUVCAkKCwUWAgMBAAAKCRBmD6nRAsvP0ID6DACGOktArFbLKHNzuyOVCskwfUZPla6Z pd3GZ8r61SrAKePIr2BnpgPkd0hV3bSRkRLIrgjzR2NRCzfp0x0HfuhcYfAYPR46XHTvjaJE v99sT/vGUG1BZguYDOScSEpgSNaNlYum3RKZbMuROxdK8G+YHccJY8PvWSq2K2yiae2KGiAv 1yjnZxug9/PtDfX8vQFUSg2w1ukRDf50wvDohN1zUQfFtofOP2xCRsDZiHAlQ0pF+aUjXQhP eP3IdpfWc8cyRLXF06Rk46YMYCytweGtGdHcqAfrVthl84129ZPN422k/voW0sm14gjYlGcT UwgnYlFRk2FLq0QeKEDcS0aj3o3EVAQCrayoGzi1pnlIKE3PRGUcUzjGVvzQ/po24gOjwba9 Egr/Wmu3MQlx/7A8zT5QBzF/n+RYdLNQ0Eu6YnUwf0Z1uieqNaon+olyIRFiLb/hCZHO6ekN f5vrm2clHUbQAYaPQebknujoKBo6ZLHg0WM1gZS01Gz+aUpKsUfOwM0EafmwsAEMAKiQiZa3 yQMmc/h3sDbfVHPSiBA4IMI/NAB7IotzPHq1GzCpsoVILAhF/INbWjxJ3DbVf+en3/FvdVZg 2S38xtnth0njNdlVKpyxm054phKjbdoFDwaknWolS4hrddTmetSG5/52AjtmPFtlXAk0NmLv fJnW3seXVQbgM7sW/MNXPP5UKDpkGnLhnvej+GU0s3109sJeXT5ImVdphFs9cvyZyBT9t1Pb Rowv58EgV0zE4hbAeVkULAbxFV5b/ExTjjGVHoX7CVhWxvCiTqCUoXZRkUE9C3FnkzEFRkKb Yu6NCfiHfEyB3Xyg9hfdrRgjMRq907zCof+nDtWxGz1MSEuvTj1g9GZ049Bennqzjc/Q+0ov XoK4jm+Py0FiUGUaA6yhexficjH+kCR/xDbVnWrMhSLB4AuTBT9HjfZI6gk3uYLhoT8Pig4/ eVtR2Q1wZIJsFToR6ofGuyECwFcs+PUXN7fmGRSiPXgjAr/zIUBdW0VWCE3OGPNqtRk2E5s6 IQARAQABwsD8BBgBCAAmFiEEGAIew9LzHY3pdrqtZg+p0QLLz9AFAmn5sLAFCQWjmoACGwwA CgkQZg+p0QLLz9DncQwAg76IehTemLIfrB8T9WIBZrI4kUV7G7a4rjiVoUiHYN5QwhnbZnsa JDlt+Ezoqy/510eo2bCSzvW5xXYPgyjcuOPwgQo1Qp764QxyX6rld2f2RcWkDuBHun55ZWXj by8o21ginPRwruBVYY5rVf3DV1iBu4NurUeHtyFk/dS0XTOQi2wVUb17sW/+ybCEokdVacZG zOqP/OmwHrF8ylXlXnhQq6e3r+J+T8fuoGJelm/CJiMwyP6cEWE8sxVqX/iqwjwUYkuOCpE+ lOWSvdNHgoEkWR0RXBPQjnGmLKbfTl/QDXLk6NP2/r9uxm2HL6Ei3QJKSEdrp+XZaVnk/Off O485NOTKwGOxyWb006cTMh53xPkAJFQu4Tvdj+odsHz88jqw5wfPG0BYWx0I/FspYj7N9kZR 8ULR9nX0LvpzJ/kB4NgHIUt8YtIL6ZSfM2dbF7fKzvx1UqFfvozJZwFzfEieJLXa4nlGgR6D x9fhaZEsniw8/bYgC3igkk5YJiOa
- Cc: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
- Delivery-date: Tue, 19 May 2026 13:42:59 +0000
- Feedback-id: default:8631fc262581453bbf619ec5b2062170:Sweego
- List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
Le 18/05/2026 à 19:38, Andrew Cooper a écrit :
On 18/05/2026 4:21 pm, Teddy Astie wrote:
Key parts of MMCFG access bits are in mmconfig_64.c (in particular
pci_mmcfg_{read,write}()) while PCI configuration primitives (used accross the
codebase) are in pci.c.
This leads to situations where the compiler cannot optimize the `switch (len)`
for MMCFG access for all pci_conf_{read,write}N(), because they are not from
the same file.
Move the pci_mmcfg_{read,write} in pci.c and hint the compiler to inline these
functions such that it's more likely that the compiler eliminates the
`switch (len)``.
Also take the opportunity to migrate to pci_sbdf_t to reduce the parameter count
and drop many parameter domains checks.
On GCC 16.1, this leads to codegen where pci_conf_{read,write}N() doesn't call
pci_mmcfg_{read,write}() anymore and directly perform the MMIO RW.
<pci_conf_read32>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
53 push %rbx
89 f8 mov %edi,%eax
89 f3 mov %esi,%ebx
c1 ef 10 shr $0x10,%edi
81 fe ff 00 00 00 cmp $0xff,%esi
77 26 ja ffff82d040301fab <pci_conf_read32+0x3a>
85 ff test %edi,%edi
75 22 jne ffff82d040301fab <pci_conf_read32+0x3a>
0f b7 f8 movzwl %ax,%edi
c1 e7 08 shl $0x8,%edi
83 e3 fc and $0xfffffffc,%ebx
09 df or %ebx,%edi
81 cf 00 00 00 80 or $0x80000000,%edi
ba 04 00 00 00 mov $0x4,%edx
be 00 00 00 00 mov $0x0,%esi
e8 2a 1c 03 00 call ffff82d040333bd3 <pci_conf_read>
eb 22 jmp ffff82d040301fcd <pci_conf_read32+0x5c>
81 fb ff 0f 00 00 cmp $0xfff,%ebx
77 24 ja ffff82d040301fd7 <pci_conf_read32+0x66>
0f b6 d0 movzbl %al,%edx
0f b6 f4 movzbl %ah,%esi
0f b7 ff movzwl %di,%edi
e8 f5 fd ff ff call ffff82d040301db6 <pci_dev_base>
48 85 c0 test %rax,%rax
74 18 je ffff82d040301fde <pci_conf_read32+0x6d>
89 db mov %ebx,%ebx
48 01 d8 add %rbx,%rax
8b 00 mov (%rax),%eax
48 8b 5d f8 mov -0x8(%rbp),%rbx
c9 leave
e9 89 12 f0 ff jmp ffff82d040203260 <__x86_return_thunk>
b8 ff ff ff ff mov $0xffffffff,%eax
eb ef jmp ffff82d040301fcd <pci_conf_read32+0x5c>
b8 ff ff ff ff mov $0xffffffff,%eax
eb e8 jmp ffff82d040301fcd <pci_conf_read32+0x5c>
This is not the whole function because it's missing a pop %rbx. Also,
right at the bottom here are the -1's from bad error paths (discussed
later).
But, this should be after the ---. Disassembly this long isn't
interesting to stay in the commit message.
Yes, I guess I could just summary it as the function is now inlined
properly.
diff --git a/xen/arch/x86/x86_64/mmconfig_64.c
b/xen/arch/x86/x86_64/mmconfig_64.c
index 940cf6d747..483dff9c2c 100644
--- a/xen/arch/x86/x86_64/mmconfig_64.c
+++ b/xen/arch/x86/x86_64/mmconfig_64.c
@@ -133,6 +46,25 @@ static void __iomem *mcfg_ioremap(const struct
acpi_mcfg_allocation *cfg,
return (void __iomem *) virt;
}
+char __iomem *pci_mmcfg_base(unsigned int seg, unsigned int *bus)
+{
+ struct acpi_mcfg_allocation *cfg;
+ int cfg_num;
+
+ for (cfg_num = 0; cfg_num < pci_mmcfg_config_num; cfg_num++) {
+ cfg = pci_mmcfg_virt[cfg_num].cfg;
+ if (cfg->pci_segment == seg &&
+ (cfg->start_bus_number <= *bus) &&
+ (cfg->end_bus_number >= *bus)) {
+ *bus -= cfg->start_bus_number;
+ return pci_mmcfg_virt[cfg_num].virt;
+ }
+ }
+
+ /* Fall back to type 0 */
+ return NULL;
+}
This is a horrid function. Accessing and modifying bus like that causes
poor code generation, and by now having this in a separate translation
unit, the optimiser can't fold it into it's single caller and undo the
poor decisions which went into writing this function.
Instead, you want:
void __iomem *pci_mmcfg_base(pci_sbdf_t sbdf)
{
...
}
base which takes care of the bus adjustment internally. This can be
broken out into a separate patch, and take the opportunity to write it
to Xen style.
That's a good idea, IIUC you want this function to now return a pointer
to the ECAM MMIO region for this SBDF. So we won't have to compute it
afterward.
diff --git a/xen/arch/x86/x86_64/pci.c b/xen/arch/x86/x86_64/pci.c
index 8d33429103..c37e3edade 100644
--- a/xen/arch/x86/x86_64/pci.c
+++ b/xen/arch/x86/x86_64/pci.c
@@ -11,13 +11,123 @@
#define PCI_CONF_ADDRESS(sbdf, reg) \
(0x80000000U | ((sbdf).bdf << 8) | ((reg) & ~3))
+/*
+ * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
+ * on their northbrige except through the * %eax register. As such, you MUST
+ * NOT use normal IOMEM accesses, you need to only use the magic mmio-config
+ * accessor functions.
+ * In fact just use pci_config_*, nothing else please.
I know you're just copying an existing comment, but it's mostly an
opinion and not terribly helpful in place.
"AMD Fam10h CPUs can only make MMCFG accesses via MOV %eax/%ax/%al",
would be better, except...
... this claim cannot be true. It's been made since the K8 RevF BKWG
and exists even into the latest PPRs, but that's simply not how
load/store ops work in the pipeline.
It was added to Linux in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3320ad994afb2c44ad34b3b34c3c5cf0da297331
but without adequate justification. I've made some enquiries.
Recent AMD PPR state that you must use eax/ax/al to access MMIO
configuration space, so I suppose it's not actually a bug (or was in the
past one, but just became a part of the spec ?)
> MMIO configuration space accesses must use the uncacheable (UC) memory
> type.
> Instructions used to read MMIO configuration space are required to
> take the following form:
> mov eax/ax/al, any_address_mode;
> Instructions used to write MMIO configuration space are required to
> take the following form:
> mov any_address_mode, eax/ax/al;
> No other source/target registers may be used other than eax/ax/al.
I can't find anything regarding Intel though.
If that happens to not be the case in practice, and it's fine to use
whichever register, that would be preferable; some other parts of the
code rely on just volatile pointers to do it.
+ */
+static inline unsigned char mmio_config_readb(void __iomem *pos)
+{
+ u8 val;
+ asm volatile("movb (%1),%%al" : "=a" (val) : "r" (pos));
+ return val;
+}
These need corrections, either in this patch or a followup.
Switch to Xen types, and correct the memory operand constraint to be "m"
(*(uint8_t *)ptr).
The Fam10h BKWG states that any memory encoding is acceptable, and this
allows the optimiser more flexibility (which will get used).
Looks good to me.
I was thinking about using a macro here, but I didn't like the end
result. So I guess I will stay with these inline functions.
Agree with styling change (I guess there are (too) many places that want
such refactor).
+
+static inline unsigned short mmio_config_readw(void __iomem *pos)
+{
+ u16 val;
+ asm volatile("movw (%1),%%ax" : "=a" (val) : "r" (pos));
+ return val;
+}
+
+static inline unsigned int mmio_config_readl(void __iomem *pos)
+{
+ u32 val;
+ asm volatile("movl (%1),%%eax" : "=a" (val) : "r" (pos));
+ return val;
+}
+
+static inline void mmio_config_writeb(void __iomem *pos, u8 val)
+{
+ asm volatile("movb %%al,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static inline void mmio_config_writew(void __iomem *pos, u16 val)
+{
+ asm volatile("movw %%ax,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static inline void mmio_config_writel(void __iomem *pos, u32 val)
+{
+ asm volatile("movl %%eax,(%1)" :: "a" (val), "r" (pos) : "memory");
+}
+
+static char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, unsigned
int devfn)
+{
+ char __iomem *addr;
+
+ addr = pci_mmcfg_base(seg, &bus);
+ if (!addr)
+ return NULL;
+ return addr + ((bus << 20) | (devfn << 12));
+}
+
+static inline
+int pci_mmcfg_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len, u32
*value)
+{
+ char __iomem *addr;
+
+ /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
+ if (unlikely(reg > 4095)) {
+err: *value = -1;
+ return -EINVAL;
+ }
+
+ addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
+ if (!addr)
+ goto err;
+
+ switch (len) {
+ case 1:
+ *value = mmio_config_readb(addr + reg);
+ break;
+ case 2:
+ *value = mmio_config_readw(addr + reg);
+ break;
+ case 4:
+ *value = mmio_config_readl(addr + reg);
+ break;
+ }
+
+ return 0;
+}
Again, for this patch or a later cleanup, drop the output-by-pointer and
return value directly. The optimiser is hopefully doing this already
but it also makes the function simpler.
At best, we want ASSERT_UNREACHBLE()'s in the error paths (including no
mapping), and to consistently return -1. Returning 0 for a bad length
is bogus.
Strictly speaking we also need to check reg & (len - 1) because accesses
must be naturally aligned, but even with ASSERT_UNREACHABLE() and a
failsafe -1, that's still logic emitted and I'm not sure if it's worth
having. Amongst other things you really need to know that len is 1, 2
or 4 before the alignment check reads correctly.
Making this function inline itself could actually work at allowing the
compiler to eliminate many of these checks, but it's not ideal.
We can eventually leave alignment/length asserts to debug code.
+
+inline int pci_mmcfg_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int
len, u32 value)
+{
+ char __iomem *addr;
+
+ /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
+ if (unlikely(reg > 4095))
+ return -EINVAL;
+
+ addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
+ if (!addr)
+ return -EINVAL;
+
+ switch (len) {
+ case 1:
+ mmio_config_writeb(addr + reg, value);
+ break;
+ case 2:
+ mmio_config_writew(addr + reg, value);
+ break;
+ case 4:
+ mmio_config_writel(addr + reg, value);
+ break;
+ }
+
+ return 0;
+}
+
uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg)
{
uint32_t value;
if ( sbdf.seg || reg > 255 )
{
- pci_mmcfg_read(sbdf.seg, sbdf.bus, sbdf.devfn, reg, 1, &value);
+ pci_mmcfg_read(sbdf, reg, 1, &value);
return value;
}
Not for this patch, but we also need to junk the if() condition here and
elsewhere.
We should be using MMCFG unilaterally if it's available; the IO port
pairs require use of a global spinlock, and behind the scenes all the
CPU is doing is translating it back into MMCFG-like accesses.
At this juncture we should probably change it at the start of the 4.23
dev window to give it maximum time to settle before getting into a
release, so probably best to tack it on as a final commit in this series?
It's actually the plan for [1]. I was thinking of splitting this
specific patch and merge it into a new one dedicated to MMCFG
refactoring and keep the PCI refactoring around pci_sbdf_t separate (the
more I look to PCI subsystem, the more stuff I find to improve there).
~Andrew
Teddy
[1]
https://lore.kernel.org/xen-devel/cover.1767804090.git.teddy.astie@xxxxxxxxxx/
Attachment:
OpenPGP_0x660FA9D102CBCFD0.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature
|