[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Question about partitioning shared cache in Xen



Hi,

[Goal]
I want to investigate the impact of the shared cache on the
performance of workload in guest domain.
I also want to partition the shared cache via page coloring mechanism
so that guest domains can use different cache colors of shared cache
and will not have interference in the shared cache.

[Motivation: Why do I want to partition the shared cache?]
Because the shared cache is shared among all guest domains (I assume
the machine has multicores sharing the same LLC. For example, Intel(R)
Xeon(R) CPU E5-1650 v2 has 6 physical cores sharing a 12MB L3 cache.),
the workload in one domU can interfere another domU's memory-intensive
workload on the same machine via shared cache. This shared-cache
interference makes the execution time of the workload in a domU
non-deterministic and increase a lot. (If we assume the worst case,
the worst-case execution time of the workload will be too
pessimistic.) A stable execution time is very important in real-time
computation when the real-time program, like the control program on
automobile, have to generate the result within a deadline.

I did some quick measurements to show how shared cache can be used by
a holistic domain to interfere the execution time of another domain's
workload. I pin the VCPUs of two domains to different physical cores
and use one domain to pollute the shared cache. The result shows that
the shared-cache interference can make the execution time of another
domain's workload slow down by 4x. The whole experiment result can be
found at 
https://github.com/PennPanda/cis601/blob/master/project/data/boxplot_cache_v2.pdf
 . (The workload of the figure is a program reading a large array. I
run the program for 100 times and draw the latency of accessing the
array in a box plot. The first column with name "aloneâd1v1" is the
boxplot latency when the program in dom1 runs alone. The fourth column
"d1v1d2v1âpindiffcore" is the boxplot latency when the program in dom1
runs along with another program in dom2, and these two domains uses
different cores. dom1 and dom2 have 1 vcpu with budget equal to
period. The scheduler is credit scheduler.)

[Idea of how to partition the shared cache]
When a PV guest domain is created, it will call xc_dom_boot_mem_init()
to allocate memory for the domain, which finally calls
xc_domain_populate_physmap_exact() to allocate memory pages from
domheap in Xen.
The idea of partitioning the share cache is as follows:
1) xl tool change: Add an option in domain's configuration file which
specifies which cache colors this domain should use. (I have done this
and when I use xl create --dry-run, I can see the parameters are
parsed to the build information.)
2) hypervisor change: Add another hypercall
xc_domain_populate_physmap_exact_ca() which has one more parameter,
i.e, the cache colors this domain should use. I also need to reserve a
memory pool which sort the reserved memory pages based on its cache
color.

When a PV domain is created, I can specify the cache colors it uses.
Then the xl tool will call the xc_domain_populate_physmap_exact_ca()
to only allocate the memory pages with the specified cache colors to
this domain.

[Quick implementation]
I attached my quick implementation patch at the end of this email.

[Issues and Questions]
After I applied the patch to  Xen's commit point
36174af3fbeb1b662c0eadbfa193e77f68cc955b and run it on my machine,
dom0 cannot boot up.:-(
The error message from dom0 is:
[    0.000000] Kernel panic - not syncing: Failed to get contiguous
memory for DMA from Xen!

[    0.000000] You either: don't have the permissions, do not have
enough free memory under 4GB, or the hypervisor memory is too
fragmented! (rc:-12)

I tried to print every message in the function I touched in order to
figure out where goes wrong but failed. :-(
The thing I cannot understand is that: My implementation haven't
reserve any  memory pages in the cache-aware memory pool before the
system boots up. Basically, every function I modified haven't been
called before the system boots up. But the system crashes. :-( (The
system can boot up and work perfectly before applying my patch.)

I really appreciate it if any of you could point out the part I missed
or misunderstood. :-)

Thank you very very much!

Best,

Meng

====
The full crash message is as follows:

Xen 4.5.0-rc

(XEN) Xen version 4.5.0-rc (root@) (gcc (Ubuntu/Linaro 4.6.3-1ubuntu5)
4.6.3) debug=y Sun Jan 11 11:39:23 EST 2015

(XEN) Latest ChangeSet: Sun Jan 4 22:19:40 2015 -0500 git:962a13f-dirty

(XEN) Bootloader: GRUB 1.99-21ubuntu3.14

(XEN) Command line: placeholder dom0_memory=512M sched=credit
console=tty0 com1=115200n8 console=com1

(XEN) Video information:

(XEN)  VGA is text mode 80x25, font 8x16

(XEN) Disc information:

(XEN)  Found 1 MBR signatures

(XEN)  Found 1 EDD information structures

(XEN) Xen-e820 RAM map:

(XEN)  0000000000000000 - 000000000009fc00 (usable)

(XEN)  000000000009fc00 - 00000000000a0000 (reserved)

(XEN)  00000000000f0000 - 0000000000100000 (reserved)

(XEN)  0000000000100000 - 00000000dfff0000 (usable)

(XEN)  00000000dfff0000 - 00000000e0000000 (ACPI data)

(XEN)  00000000fffc0000 - 0000000100000000 (reserved)

(XEN)  0000000100000000 - 0000000120000000 (usable)

(XEN) ACPI: RSDP 000E0000, 0024 (r2 VBOX  )

(XEN) ACPI: XSDT DFFF0030, 003C (r1 VBOX   VBOXXSDT        1 ASL        61)

(XEN) ACPI: FACP DFFF00F0, 00F4 (r4 VBOX   VBOXFACP        1 ASL        61)

(XEN) ACPI: DSDT DFFF0480, 1B96 (r1 VBOX   VBOXBIOS        2 INTL 20100528)

(XEN) ACPI: FACS DFFF0200, 0040

(XEN) ACPI: APIC DFFF0240, 006C (r2 VBOX   VBOXAPIC        1 ASL        61)

(XEN) ACPI: SSDT DFFF02B0, 01CC (r1 VBOX   VBOXCPUT        2 INTL 20100528)

(XEN) System RAM: 4095MB (4193852kB)

(XEN) No NUMA configuration found

(XEN) Faking a node at 0000000000000000-0000000120000000

(XEN) RTXEN:init_node_heap: called

(XEN) Domain heap initialised

(XEN) found SMP MP-table at 0009fff0

(XEN) DMI 2.5 present.

(XEN) Using APIC driver default

(XEN) ACPI: PM-Timer IO Port: 0x4008

(XEN) ACPI: SLEEP INFO: pm1x_cnt[1:4004,1:0], pm1x_evt[1:4000,1:0]

(XEN) ACPI:             wakeup_vec[dfff020c], vec_size[20]

(XEN) ACPI: Local APIC address 0xfee00000

(XEN) ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)

(XEN) Processor #0 6:5 APIC version 20

(XEN) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)

(XEN) Processor #1 6:5 APIC version 20

(XEN) ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)

(XEN) Processor #2 6:5 APIC version 20

(XEN) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)

(XEN) Processor #3 6:5 APIC version 20

(XEN) ACPI: IOAPIC (id[0x04] address[0xfec00000] gsi_base[0])

(XEN) IOAPIC[0]: apic_id 4, version 17, address 0xfec00000, GSI 0-23

(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)

(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)

(XEN) ACPI: IRQ0 used by override.

(XEN) ACPI: IRQ2 used by override.

(XEN) ACPI: IRQ9 used by override.

(XEN) Enabling APIC mode:  Flat.  Using 1 I/O APICs

(XEN) ERST table was not found

(XEN) Using ACPI (MADT) for SMP configuration information

(XEN) SMP: Allowing 4 CPUs (0 hotplug CPUs)

(XEN) IRQ limits: 24 GSI, 760 MSI/MSI-X

(XEN) Using scheduler: SMP Credit Scheduler (credit)

(XEN) Detected 2288.242 MHz processor.

(XEN) Initing memory sharing.

(XEN) CPU0: No MCE banks present. Machine check support disabled

(XEN) alt table ffff82d0802d8b30 -> ffff82d0802d9b50

(XEN) I/O virtualisation disabled

(XEN) ENABLING IO-APIC IRQs

(XEN)  -> Using new ACK method

(XEN) ..TIMER: vector=0xF0 apic1=0 pin1=2 apic2=-1 pin2=-1

(XEN) Platform timer is 3.579MHz ACPI PM Timer

(XEN) Allocated console ring of 32 KiB.

(XEN) CPU1: No MCE banks present. Machine check support disabled

(XEN) CPU2: No MCE banks present. Machine check support disabled

(XEN) CPU3: No MCE banks present. Machine check support disabled

(XEN) Brought up 4 CPUs

(XEN) CPUIDLE: disabled due to no HPET. Force enable with 'cpuidle'.

(XEN) ACPI sleep modes: S3

(XEN) xenoprof: Initialization failed. Intel processor family 6 model
69is not supported

(XEN) Dom0 has maximum 600 PIRQs

(XEN) *** LOADING DOMAIN 0 ***

(XEN) elf_parse_binary: phdr: paddr=0x1000000 memsz=0xbfb000

(XEN) elf_parse_binary: phdr: paddr=0x1c00000 memsz=0x10e0f0

(XEN) elf_parse_binary: phdr: paddr=0x1d0f000 memsz=0x152c0

(XEN) elf_parse_binary: phdr: paddr=0x1d25000 memsz=0x6d6000

(XEN) elf_parse_binary: memory: 0x1000000 -> 0x23fb000

(XEN) elf_xen_parse_note: GUEST_OS = "linux"

(XEN) elf_xen_parse_note: GUEST_VERSION = "2.6"

(XEN) elf_xen_parse_note: XEN_VERSION = "xen-3.0"

(XEN) elf_xen_parse_note: VIRT_BASE = 0xffffffff80000000

(XEN) elf_xen_parse_note: ENTRY = 0xffffffff81d251e0

(XEN) elf_xen_parse_note: HYPERCALL_PAGE = 0xffffffff81001000

(XEN) elf_xen_parse_note: FEATURES = "!writable_page_tables|pae_pgdir_above_4gb"

(XEN) elf_xen_parse_note: PAE_MODE = "yes"

(XEN) elf_xen_parse_note: LOADER = "generic"

(XEN) elf_xen_parse_note: unknown xen elf note (0xd)

(XEN) elf_xen_parse_note: SUSPEND_CANCEL = 0x1

(XEN) elf_xen_parse_note: HV_START_LOW = 0xffff800000000000

(XEN) elf_xen_parse_note: PADDR_OFFSET = 0x0

(XEN) elf_xen_addr_calc_check: addresses:

(XEN)     virt_base        = 0xffffffff80000000

(XEN)     elf_paddr_offset = 0x0

(XEN)     virt_offset      = 0xffffffff80000000

(XEN)     virt_kstart      = 0xffffffff81000000

(XEN)     virt_kend        = 0xffffffff823fb000

(XEN)     virt_entry       = 0xffffffff81d251e0

(XEN)     p2m_base         = 0xffffffffffffffff

(XEN)  Xen  kernel: 64-bit, lsb, compat32

(XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x23fb000

(XEN) PHYSICAL MEMORY ARRANGEMENT:

(XEN)  Dom0 alloc.:   0000000114000000->0000000118000000 (974366 pages
to be allocated)

(XEN)  Init. ramdisk: 000000011d21a000->000000011ffffe00

(XEN) VIRTUAL MEMORY ARRANGEMENT:

(XEN)  Loaded kernel: ffffffff81000000->ffffffff823fb000

(XEN)  Init. ramdisk: ffffffff823fb000->ffffffff851e0e00

(XEN)  Phys-Mach map: ffffffff851e1000->ffffffff85987020

(XEN)  Start info:    ffffffff85988000->ffffffff859884b4

(XEN)  Page tables:   ffffffff85989000->ffffffff859ba000

(XEN)  Boot stack:    ffffffff859ba000->ffffffff859bb000

(XEN)  TOTAL:         ffffffff80000000->ffffffff85c00000

(XEN)  ENTRY ADDRESS: ffffffff81d251e0

(XEN) Dom0 has maximum 4 VCPUs

(XEN) elf_load_binary: phdr 0 at 0xffffffff81000000 -> 0xffffffff81bfb000

(XEN) elf_load_binary: phdr 1 at 0xffffffff81c00000 -> 0xffffffff81d0e0f0

(XEN) elf_load_binary: phdr 2 at 0xffffffff81d0f000 -> 0xffffffff81d242c0

(XEN) elf_load_binary: phdr 3 at 0xffffffff81d25000 -> 0xffffffff81e73000

(XEN) Scrubbing Free RAM on 1 nodes using 4 CPUs

(XEN) .........done.

(XEN) Initial low memory virq threshold set at 0x4000 pages.

(XEN) Std. Loglevel: All

(XEN) Guest Loglevel: All

(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch
input to Xen)

(XEN) Freed 292kB init memory.

mapping kernel into physical memory

about to get started...

[    0.000000] Kernel panic - not syncing: Failed to get contiguous
memory for DMA from Xen!

[    0.000000] You either: don't have the permissions, do not have
enough free memory under 4GB, or the hypervisor memory is too
fragmented! (rc:-12)

[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
3.11.0-15-generic #25~precise1-Ubuntu

[    0.000000] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
VirtualBox 12/01/2006

[    0.000000]  ffffffff81b44658 ffffffff81c01dc8 ffffffff8173bc5e
0000000000000000

[    0.000000]  ffffffff81b7cffd ffffffff81c01e48 ffffffff8172e8d8
00000000fffffff4

[    0.000000]  0000000000000018 ffffffff81c01e58 ffffffff81c01df8
ffffffff81c01e48

[    0.000000] Call Trace:

[    0.000000]  [<ffffffff8173bc5e>] dump_stack+0x46/0x58

[    0.000000]  [<ffffffff8172e8d8>] panic+0xc1/0x1d7

[    0.000000]  [<ffffffff817275ec>] xen_swiotlb_init+0x33c/0x340

[    0.000000]  [<ffffffff81d3c413>] ? pci_swiotlb_detect_4gb+0x2c/0x2c

[    0.000000]  [<ffffffff81d2cb60>] pci_xen_swiotlb_init+0x1c/0x2e

[    0.000000]  [<ffffffff81d2ffa9>] pci_iommu_alloc+0x57/0x6e

[    0.000000]  [<ffffffff81d3f16a>] mem_init+0x11/0xa2

[    0.000000]  [<ffffffff81d25cfa>] start_kernel+0x1de/0x414

[    0.000000]  [<ffffffff81d259ae>] ? repair_env_string+0x5a/0x5a

[    0.000000]  [<ffffffff81d255e8>] x86_64_start_reservations+0x2a/0x2c

[    0.000000]  [<ffffffff81d29454>] xen_start_kernel+0x49a/0x49c

(XEN) Domain 0 crashed: rebooting machine in 5 seconds.


====My Implementation patch for partitioning shared cache===
This is a very immature attempt to partition shared cache to different
guest domains by applying page coloring mechanism.
---
 tools/libxc/include/xc_dom.h  |   10 ++
 tools/libxc/include/xenctrl.h |   16 +++
 tools/libxc/xc_dom_x86.c      |   46 +++++--
 tools/libxc/xc_domain.c       |   59 ++++++++
 tools/libxl/libxl_dom.c       |   15 ++-
 xen/common/memory.c           |  103 ++++++++++++++
 xen/common/page_alloc.c       |  299 ++++++++++++++++++++++++++++++++++++++++-
 xen/include/public/memory.h   |   19 ++-
 xen/include/xen/mm.h          |    2 +
 9 files changed, 552 insertions(+), 17 deletions(-)

diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
index 07d7224..14bd7be 100644
--- a/tools/libxc/include/xc_dom.h
+++ b/tools/libxc/include/xc_dom.h
@@ -171,6 +171,16 @@ struct xc_dom_image {
     struct xc_dom_arch *arch_hooks;
     /* allocate up to virt_alloc_end */
     int (*allocate) (struct xc_dom_image * dom, xen_vaddr_t up_to);
+
+    /* rtxen cache color */
+    /* TODO: walk around to include same cache config file */
+#define RTXEN_PAGE_BITS             12
+#define RTXEN_PAGE_SIZE             4096
+#define RTXEN_LLC_SLICE_SIZE        (2*1024*1024)
+#define RTXEN_LLC_ASSOC             (16)
+#define RTXEN_LLC_CC_MASK           (RTXEN_LLC_SLICE_SIZE /
RTXEN_LLC_ASSOC / RTXEN_PAGE_SIZE - 1)
+#define RTXEN_LLC_CC_NUM            (RTXEN_LLC_CC_MASK + 1)
+    int32_t cache_colors[RTXEN_LLC_CC_NUM];
 };

 /* --- pluggable kernel loader ------------------------------------- */
diff --git a/tools/libxc/include/xenctrl.h b/tools/libxc/include/xenctrl.h
index 0ad8b8d..94785de 100644
--- a/tools/libxc/include/xenctrl.h
+++ b/tools/libxc/include/xenctrl.h
@@ -1363,6 +1363,22 @@ int xc_domain_populate_physmap_exact(xc_interface *xch,
                                      unsigned int mem_flags,
                                      xen_pfn_t *extent_start);

+int xc_domain_populate_physmap_ca(xc_interface *xch,
+                                  uint32_t domid,
+                                  unsigned long nr_extents,
+                                  unsigned int extent_order,
+                                  unsigned int mem_flags,
+                                  xen_pfn_t *extent_start,
+                                  int32_t *cache_colors);
+
+int xc_domain_populate_physmap_exact_ca(xc_interface *xch,
+                                        uint32_t domid,
+                                        unsigned long nr_extents,
+                                        unsigned int extent_order,
+                                        unsigned int mem_flags,
+                                        xen_pfn_t *extent_start,
+                                        int32_t *cache_colors);
+
 int xc_domain_claim_pages(xc_interface *xch,
                                uint32_t domid,
                                unsigned long nr_pages);
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index bf06fe4..219c6e8 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -760,6 +760,18 @@ int arch_setup_meminit(struct xc_dom_image *dom)
 {
     int rc;
     xen_pfn_t pfn, allocsz, i, j, mfn;
+    int32_t* cache_colors = dom->cache_colors; /* short refererence */
+    int num_cache_colors = 0;
+    int k;
+
+    printf("%s: domid=%d cache colors:\n", __FUNCTION__, dom->guest_domid);
+    printf("color used\n");
+    for ( k = 0; k < RTXEN_LLC_CC_NUM; k++ )
+    {
+        printf("%d %d\n", k, cache_colors[k]);
+        if ( cache_colors[k] == 1 )
+            num_cache_colors++;
+    }

     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -813,16 +825,32 @@ int arch_setup_meminit(struct xc_dom_image *dom)
             dom->p2m_host[pfn] = pfn;

         /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
+        printf("%s: dom=%d num_cache_colors = %d\n", __FUNCTION__,
dom->guest_domid, num_cache_colors);
+        if ( num_cache_colors == 0 )
         {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
+            for ( i = rc = allocsz = 0;
+                  (i < dom->total_pages) && !rc;
+                  i += allocsz )
+            {
+                allocsz = dom->total_pages - i;
+                if ( allocsz > 1024*1024 )
+                    allocsz = 1024*1024;
+                rc = xc_domain_populate_physmap_exact(
+                    dom->xch, dom->guest_domid, allocsz,
+                    0, 0, &dom->p2m_host[i]);
+            }
+        }
+        else
+        {
+            for ( i = rc = allocsz = 0;
+                (i < dom->total_pages) && !rc;
+                i += allocsz )
+            {
+                allocsz = 1; /* TODO: change to allocate mulitple
pages when have memory pool */
+                rc = xc_domain_populate_physmap_exact_ca(
+                    dom->xch, dom->guest_domid, allocsz,
+                    0, 0, &dom->p2m_host[i], cache_colors);
+            }
         }

         /* Ensure no unclaimed pages are left unused.
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index b864872..0d5a707 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1008,6 +1008,65 @@ int xc_domain_populate_physmap_exact(xc_interface *xch,
     return err;
 }

+int xc_domain_populate_physmap_ca(xc_interface *xch,
+                                  uint32_t domid,
+                                  unsigned long nr_extents,
+                                  unsigned int extent_order,
+                                  unsigned int mem_flags,
+                                  xen_pfn_t *extent_start,
+                                  int32_t *cache_colors)
+{
+    int err, i;
+    DECLARE_HYPERCALL_BOUNCE(extent_start, nr_extents *
sizeof(*extent_start), XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    struct xen_memory_reservation reservation = {
+        .nr_extents   = nr_extents,
+        .extent_order = extent_order,
+        .mem_flags    = mem_flags,
+        .domid        = domid,
+    };
+
+    for ( i = 0; i < RTXEN_LLC_CC_NUM; i++ )
+        reservation.cache_colors[i] = cache_colors[i];
+
+    if ( xc_hypercall_bounce_pre(xch, extent_start) )
+    {
+        PERROR("Could not bounce memory for XENMEM_populate_physmap
hypercall");
+        return -1;
+    }
+    set_xen_guest_handle(reservation.extent_start, extent_start);
+
+    err = do_memory_op(xch, XENMEM_populate_physmap_ca, &reservation,
sizeof(reservation));
+
+    xc_hypercall_bounce_post(xch, extent_start);
+    return err;
+}
+
+int xc_domain_populate_physmap_exact_ca(xc_interface *xch,
+                                        uint32_t domid,
+                                        unsigned long nr_extents,
+                                        unsigned int extent_order,
+                                        unsigned int mem_flags,
+                                        xen_pfn_t *extent_start,
+                                        int32_t *cache_colors)
+{
+    int err;
+
+    err = xc_domain_populate_physmap_ca(xch, domid, nr_extents,
+                                        extent_order, mem_flags,
extent_start, cache_colors);
+    if ( err == nr_extents )
+        return 0;
+
+    if ( err >= 0 )
+    {
+        DPRINTF("Failed allocation for dom %d: %ld extents of order %d\n",
+                domid, nr_extents, extent_order);
+        errno = EBUSY;
+        err = -1;
+    }
+
+    return err;
+}
+
 int xc_domain_memory_exchange_pages(xc_interface *xch,
                                     int domid,
                                     unsigned long nr_in_extents,
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 74ea84b..973a05c 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -517,7 +517,7 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
 {
     libxl_ctx *ctx = libxl__gc_owner(gc);
     struct xc_dom_image *dom;
-    int ret;
+    int ret, i;
     int flags = 0;

     xc_dom_loginit(ctx->xch);
@@ -568,6 +568,19 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     dom->xenstore_evtchn = state->store_port;
     dom->xenstore_domid = state->store_domid;
     dom->claim_enabled = libxl_defbool_val(info->claim_mode);
+    /* init dom's cache colors */
+    memset(dom->cache_colors, 0, sizeof(dom->cache_colors));
+    LOG(DEBUG, "%s: dom=%d num_cache_colors=%d\n", __FUNCTION__,
domid, info->num_cache_colors);
+    if ( info->num_cache_colors > 0 ) {
+        for ( i = 0; i < info->num_cache_colors; i++ ) {
+            assert(info->cache_colors[i] < RTXEN_LLC_CC_NUM);
+            dom->cache_colors[info->cache_colors[i]] = 1;
+        }
+    } else {
+        for ( i = 0; i < RTXEN_LLC_CC_NUM; i++ ) {
+            dom->cache_colors[i] = -1;
+        }
+    }

     if ( (ret = xc_dom_boot_xen_init(dom, ctx->xch, domid)) != 0 ) {
         LOGE(ERROR, "xc_dom_boot_xen_init failed");
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 234dae6..a5292ab 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -37,6 +37,7 @@ struct memop_args {
     unsigned int nr_extents;   /* Number of extents to allocate or free. */
     unsigned int extent_order; /* Size of each extent. */
     unsigned int memflags;     /* Allocation flags. */
+    int32_t *cache_colors;      /* RTXEN: cache colors for a domain */

     /* INPUT/OUTPUT */
     unsigned int nr_done;    /* Number of extents processed so far. */
@@ -175,6 +176,94 @@ out:
     a->nr_done = i;
 }

+static void populate_physmap_ca(struct memop_args *a)
+{
+    struct page_info *page;
+    unsigned long i, j;
+    xen_pfn_t gpfn, mfn;
+    struct domain *d = a->domain;
+    int32_t *cache_colors = a->cache_colors;
+
+    printk("%s: called\n", __FUNCTION__);
+    if ( !guest_handle_subrange_okay(a->extent_list, a->nr_done,
+                                     a->nr_extents-1) )
+        return;
+
+    if ( a->memflags & MEMF_populate_on_demand ? a->extent_order > MAX_ORDER :
+         !multipage_allocation_permitted(current->domain, a->extent_order) )
+        return;
+
+    for ( i = a->nr_done; i < a->nr_extents; i++ )
+    {
+        if ( i != a->nr_done && hypercall_preempt_check() )
+        {
+            a->preempted = 1;
+            goto out;
+        }
+
+        if ( unlikely(__copy_from_guest_offset(&gpfn, a->extent_list, i, 1)) )
+            goto out;
+
+        if ( a->memflags & MEMF_populate_on_demand )
+        {
+            if ( guest_physmap_mark_populate_on_demand(d, gpfn,
+                                                       a->extent_order) < 0 )
+                goto out;
+        }
+        else
+        {
+            if ( is_domain_direct_mapped(d) ) /* RTXEN: should always false */
+            {
+                mfn = gpfn;
+                if ( !mfn_valid(mfn) )
+                {
+                    gdprintk(XENLOG_INFO, "Invalid mfn %#"PRI_xen_pfn"\n",
+                             mfn);
+                    goto out;
+                }
+
+                page = mfn_to_page(mfn);
+                if ( !get_page(page, d) )
+                {
+                    gdprintk(XENLOG_INFO,
+                             "mfn %#"PRI_xen_pfn" doesn't belong to the"
+                             " domain\n", mfn);
+                    goto out;
+                }
+                put_page(page);
+            }
+            else
+                page = alloc_domheap_pages_ca(d, a->extent_order,
a->memflags, cache_colors); /* RTXEN: alloc a page for cache_colors */
+
+            if ( unlikely(page == NULL) )
+            {
+                if ( !opt_tmem || (a->extent_order != 0) )
+                    gdprintk(XENLOG_INFO, "Could not allocate order=%d extent:"
+                             " id=%d memflags=%x (%ld of %d)\n",
+                             a->extent_order, d->domain_id, a->memflags,
+                             i, a->nr_extents);
+                goto out;
+            }
+
+            mfn = page_to_mfn(page);
+            guest_physmap_add_page(d, gpfn, mfn, a->extent_order);
+
+            if ( !paging_mode_translate(d) )
+            {
+                for ( j = 0; j < (1 << a->extent_order); j++ )
+                    set_gpfn_from_mfn(mfn + j, gpfn + j);
+
+                /* Inform the domain of the new page's machine address. */
+                if ( unlikely(__copy_to_guest_offset(a->extent_list,
i, &mfn, 1)) )
+                    goto out;
+            }
+        }
+    }
+
+out:
+    a->nr_done = i;
+}
+
 int guest_remove_page(struct domain *d, unsigned long gmfn)
 {
     struct page_info *page;
@@ -702,11 +791,13 @@ long do_memory_op(unsigned long cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
     domid_t domid;
     unsigned long start_extent = cmd >> MEMOP_EXTENT_SHIFT;
     int op = cmd & MEMOP_CMD_MASK;
+    int i;

     switch ( op )
     {
     case XENMEM_increase_reservation:
     case XENMEM_decrease_reservation:
+    case XENMEM_populate_physmap_ca:
     case XENMEM_populate_physmap:
         if ( copy_from_guest(&reservation, arg, 1) )
             return start_extent;
@@ -742,6 +833,10 @@ long do_memory_op(unsigned long cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
              && (reservation.mem_flags & XENMEMF_populate_on_demand) )
             args.memflags |= MEMF_populate_on_demand;

+        if ( op == XENMEM_populate_physmap_ca
+             && (reservation.mem_flags & XENMEMF_populate_on_demand) )
+            args.memflags |= MEMF_populate_on_demand;
+
         d = rcu_lock_domain_by_any_id(reservation.domid);
         if ( d == NULL )
             return start_extent;
@@ -762,6 +857,14 @@ long do_memory_op(unsigned long cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
         case XENMEM_decrease_reservation:
             decrease_reservation(&args);
             break;
+        case XENMEM_populate_physmap_ca: /* RTXen: cache-aware memory
allocation */
+            args.cache_colors = reservation.cache_colors;
+            printk("%s, ===XENMEM_populate_physmap_ca called===\n",
__FUNCTION__);
+            printk("colorID use\n");
+            for ( i = 0; i < RTXEN_LLC_CC_NUM; i++ )
+                printk("%d %d\n", i, reservation.cache_colors[i]);
+            populate_physmap_ca(&args);
+            break;
         default: /* XENMEM_populate_physmap */
             populate_physmap(&args);
             break;
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 7b4092d..a20b230 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -287,6 +287,14 @@ static heap_by_zone_and_order_t *_heap[MAX_NUMNODES];
 static unsigned long *avail[MAX_NUMNODES];
 static long total_avail_pages;

+/* RTXEN: Cache-Aware Memory Pool */
+typedef struct page_list_head
ca_heap_by_zone_and_color_t[NR_ZONES][RTXEN_LLC_CC_NUM+1];
+static ca_heap_by_zone_and_color_t *_ca_heap[MAX_NUMNODES];
+#define ca_heap(node, zone, color) ((*_ca_heap[node])[zone][color])
+
+static unsigned long *ca_avail[MAX_NUMNODES];
+static long ca_total_avail_pages;
+
 /* TMEM: Reserve a fraction of memory for mid-size (0<order<9) allocations.*/
 static long midsize_alloc_zone_pages;
 #define MIDSIZE_ALLOC_FRAC 128
@@ -410,7 +418,9 @@ static unsigned long init_node_heap(int node,
unsigned long mfn,
 {
     /* First node to be discovered has its heap metadata statically alloced. */
     static heap_by_zone_and_order_t _heap_static;
+//    static ca_heap_by_zone_and_color_t _ca_heap_static;
     static unsigned long avail_static[NR_ZONES];
+//    static unsigned long ca_avail_static[NR_ZONES];
     static int first_node_initialised;
     unsigned long needed = (sizeof(**_heap) +
                             sizeof(**avail) * NR_ZONES +
@@ -420,10 +430,13 @@ static unsigned long init_node_heap(int node,
unsigned long mfn,
 #endif
     int i, j;

+    printk("RTXEN:%s: called\n", __FUNCTION__);
     if ( !first_node_initialised )
     {
         _heap[node] = &_heap_static;
         avail[node] = avail_static;
+//        _ca_heap[node] = &_ca_heap_static;
+//        ca_avail[node] = ca_avail_static;
         first_node_initialised = 1;
         needed = 0;
     }
@@ -458,15 +471,21 @@ static unsigned long init_node_heap(int node,
unsigned long mfn,
         _heap[node] = xmalloc(heap_by_zone_and_order_t);
         avail[node] = xmalloc_array(unsigned long, NR_ZONES);
         BUG_ON(!_heap[node] || !avail[node]);
+//        _ca_heap[node] = xmalloc(ca_heap_by_zone_and_color_t);
+//        ca_avail[node] = xmalloc_array(unsigned long, NR_ZONES);
+//        BUG_ON(!_ca_heap[node] || !ca_avail[node]);
         needed = 0;
     }

     memset(avail[node], 0, NR_ZONES * sizeof(long));
+//    memset(ca_avail[node], 0, NR_ZONES * sizeof(long));

     for ( i = 0; i < NR_ZONES; i++ )
         for ( j = 0; j <= MAX_ORDER; j++ )
+        {
             INIT_PAGE_LIST_HEAD(&(*_heap[node])[i][j]);
-
+//            INIT_PAGE_LIST_HEAD(&(*_ca_heap[node])[i][j]);
+        }
     return needed;
 }

@@ -750,6 +769,246 @@ static struct page_info *alloc_heap_pages(
     return pg;
 }

+/**
+ * RTXEN: reserve memory pages as a cache-aware memory pool
+ */
+static int rtxen_has_reserve_ca_mem_pool = 0;
+static int rtxen_reserve_ca_mem_pool ( int rtxen_ca_mem_pool_pg_num )
+{
+    int i;
+    unsigned int dma_zone, zone_hi;
+
+    printk("========RTXEN Reserve CA_MEM_POOL START=========\n");
+    zone_hi = NR_ZONES - 1;
+    dma_zone = (bits_to_zone(dma_bitsize)) < zone_hi;
+    for ( i = 0; i < rtxen_ca_mem_pool_pg_num; i++ )
+    {
+        struct page_info *pg = alloc_heap_pages( dma_zone + 1,
NR_ZONES - 1, 0, 0, 0 );
+        unsigned int node, zone, color;
+
+        /* TODO: speed up the allocation, not urgent since only exe once */
+        if ( pg == NULL )
+            return -ENOMEM;
+
+        node = phys_to_nid(page_to_maddr(pg));
+        zone = page_to_zone(pg);
+        color = page_to_mfn(pg) & RTXEN_LLC_CC_MASK;
+
+        ca_avail[node][zone] += 1;
+        ca_total_avail_pages += 1;
+        page_list_add_tail(pg, &ca_heap(node, zone, color));
+    }
+
+    printk("========RTXEN Reserve CA_MEM_POOL DONE=========\n");
+    return 0;
+}
+
+/**
+ * Allocate 2^@order contiguous pages.
+ * order should alwasy be 0 for cache aware memory allocation
+ * i.e., always allocate one page
+ */
+static struct page_info *alloc_heap_pages_ca(
+    unsigned int zone_lo, unsigned int zone_hi,
+    unsigned int order, unsigned int memflags,
+    struct domain *d, int32_t *cache_colors)
+{
+    unsigned int first_node, i, j, zone = 0, nodemask_retry = 0;
+    unsigned int node = (uint8_t)((memflags >> _MEMF_node) - 1);
+    unsigned long request = 1UL << order;
+    struct page_info *pg;
+    nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
+    bool_t need_tlbflush = 0;
+    uint32_t tlbflush_timestamp = 0;
+
+    printk("%s called\n", __FUNCTION__);
+    if ( !rtxen_has_reserve_ca_mem_pool || ca_total_avail_pages < 1024 )
+    {
+        /* reserve 1GB memory */
+        int rtxen_ca_mem_pool_pg_num = 256 * 1024;
+        if ( rtxen_reserve_ca_mem_pool( rtxen_ca_mem_pool_pg_num ) < 0 )
+        {
+            printk("%s, failed to reserve ca_mem_pool\n", __FUNCTION__);
+            rtxen_has_reserve_ca_mem_pool = 0;
+        } else
+            rtxen_has_reserve_ca_mem_pool = 1;
+    }
+
+    if ( node == NUMA_NO_NODE )
+    {
+        memflags &= ~MEMF_exact_node;
+        if ( d != NULL )
+        {
+            node = next_node(d->last_alloc_node, nodemask);
+            if ( node >= MAX_NUMNODES )
+                node = first_node(nodemask);
+        }
+        if ( node >= MAX_NUMNODES )
+            node = cpu_to_node(smp_processor_id());
+    }
+    first_node = node;
+
+    ASSERT(node >= 0);
+    ASSERT(zone_lo <= zone_hi);
+    ASSERT(zone_hi < NR_ZONES);
+
+    if ( unlikely(order > MAX_ORDER) )
+        return NULL;
+
+    if ( unlikely(order > 0) )
+    {
+        printk("%s, order(%d) should always be 0\n", __FUNCTION__, order);
+        return NULL;
+    }
+
+    /* TODO: use ca_heap_lock */
+    spin_lock(&heap_lock);
+
+    /*
+     * Claimed memory is considered unavailable unless the request
+     * is made by a domain with sufficient unclaimed pages.
+     */
+    if ( (outstanding_claims + request >
+          total_avail_pages + tmem_freeable_pages()) &&
+          (d == NULL || d->outstanding_pages < request) )
+        goto not_found;
+
+    /*
+     * TMEM: When available memory is scarce due to tmem absorbing it, allow
+     * only mid-size allocations to avoid worst of fragmentation issues.
+     * Others try tmem pools then fail.  This is a workaround until all
+     * post-dom0-creation-multi-page allocations can be eliminated.
+     */
+    if ( opt_tmem && ((order == 0) || (order >= 9)) &&
+         (total_avail_pages <= midsize_alloc_zone_pages) &&
+         tmem_freeable_pages() )
+        goto try_tmem;
+
+    /*
+     * Start with requested node, but exhaust all node memory in requested
+     * zone before failing, only calc new node value if we fail to find memory
+     * in target node, this avoids needless computation on fast-path.
+     */
+    for ( ; ; )
+    {
+        zone = zone_hi;
+        do {
+            /* Check if target node can support the allocation. */
+            if ( !ca_avail[node] || (ca_avail[node][zone] < request) )
+                continue;
+
+            /* allocate memory pages with specified colors*/
+            for ( j = 0; j <= RTXEN_LLC_CC_NUM; j++ )
+            {
+                if ( cache_colors[j] == 0 )
+                    continue;
+                if ( (pg = page_list_remove_head(&ca_heap(node, zone, j))) )
+                    goto found;
+            }
+        } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
+
+        if ( memflags & MEMF_exact_node )
+            goto not_found;
+
+        /* Pick next node. */
+        if ( !node_isset(node, nodemask) )
+        {
+            /* Very first node may be caller-specified and outside nodemask. */
+            ASSERT(!nodemask_retry);
+            first_node = node = first_node(nodemask);
+            if ( node < MAX_NUMNODES )
+                continue;
+        }
+        else if ( (node = next_node(node, nodemask)) >= MAX_NUMNODES )
+            node = first_node(nodemask);
+        if ( node == first_node )
+        {
+            /* When we have tried all in nodemask, we fall back to others. */
+            if ( nodemask_retry++ )
+                goto not_found;
+            nodes_andnot(nodemask, node_online_map, nodemask);
+            first_node = node = first_node(nodemask);
+            if ( node >= MAX_NUMNODES )
+                goto not_found;
+        }
+    }
+
+ try_tmem:
+    /* Try to free memory from tmem */
+    if ( (pg = tmem_relinquish_pages(order, memflags)) != NULL )
+    {
+        /* reassigning an already allocated anonymous heap page */
+        spin_unlock(&heap_lock);
+        return pg;
+    }
+
+ not_found:
+    /* No suitable memory blocks. Fail the request. */
+    spin_unlock(&heap_lock);
+    return NULL;
+
+ found:
+    /* RTXEN: always allocate one page, no need to chunk the allocated page */
+    j = 0; order = 0;
+    /* We may have to halve the chunk a number of times. */
+    while ( j != order )
+    {
+        PFN_ORDER(pg) = --j;
+        page_list_add_tail(pg, &heap(node, zone, j));
+        pg += 1 << j;
+    }
+
+    ASSERT(ca_avail[node][zone] >= request);
+    ca_avail[node][zone] -= request;
+    ca_total_avail_pages -= request;
+    ASSERT(ca_total_avail_pages >= 0);
+
+    check_low_mem_virq();
+
+    if ( d != NULL )
+        d->last_alloc_node = node;
+
+    for ( i = 0; i < (1 << order); i++ )
+    {
+        /* Reference count must continuously be zero for free pages. */
+        BUG_ON(pg[i].count_info != PGC_state_free);
+        pg[i].count_info = PGC_state_inuse;
+
+        if ( pg[i].u.free.need_tlbflush &&
+             (pg[i].tlbflush_timestamp <= tlbflush_current_time()) &&
+             (!need_tlbflush ||
+              (pg[i].tlbflush_timestamp > tlbflush_timestamp)) )
+        {
+            need_tlbflush = 1;
+            tlbflush_timestamp = pg[i].tlbflush_timestamp;
+        }
+
+        /* Initialise fields which have other uses for free pages. */
+        pg[i].u.inuse.type_info = 0;
+        page_set_owner(&pg[i], NULL);
+
+        /* Ensure cache and RAM are consistent for platforms where the
+         * guest can control its own visibility of/through the cache.
+         */
+        flush_page_to_ram(page_to_mfn(&pg[i]));
+    }
+
+    spin_unlock(&heap_lock);
+
+    if ( need_tlbflush )
+    {
+        cpumask_t mask = cpu_online_map;
+        tlbflush_filter(mask, tlbflush_timestamp);
+        if ( !cpumask_empty(&mask) )
+        {
+            perfc_incr(need_flush_tlb_flush);
+            flush_tlb_mask(&mask);
+        }
+    }
+
+    return pg;
+}
+
 /* Remove any offlined page in the buddy pointed to by head. */
 static int reserve_offlined_page(struct page_info *head)
 {
@@ -1703,6 +1962,44 @@ struct page_info *alloc_domheap_pages(
     return pg;
 }

+/**
+ * alloc one page whose size is 2^order * 4KB
+ * the page's color is in cache_colors
+ */
+struct page_info *alloc_domheap_pages_ca(
+    struct domain *d, unsigned int order, unsigned int memflags,
int32_t *cache_colors)
+{
+    struct page_info *pg = NULL;
+    unsigned int bits = memflags >> _MEMF_bits, zone_hi = NR_ZONES - 1;
+    unsigned int dma_zone;
+
+    ASSERT(!in_irq());
+    ASSERT(order == 0); /* RTXEN: only alloc mem in one page granularity */
+
+    printk("%s called\n", __FUNCTION__);
+    bits = domain_clamp_alloc_bitsize(d, bits ? : (BITS_PER_LONG+PAGE_SHIFT));
+    if ( (zone_hi = min_t(unsigned int, bits_to_zone(bits), zone_hi)) == 0 )
+        return NULL;
+
+    /* RTXEN: give dma continuous memory  */
+    if ( dma_bitsize && ((dma_zone = bits_to_zone(dma_bitsize)) < zone_hi) )
+        pg = alloc_heap_pages(dma_zone + 1, zone_hi, order, memflags, d);
+
+    if ( (pg == NULL) &&
+         ((memflags & MEMF_no_dma) ||
+          ((pg = alloc_heap_pages_ca(MEMZONE_XEN + 1, zone_hi, order,
+                                     memflags, d, cache_colors)) == NULL)) )
+         return NULL;
+
+    if ( (d != NULL) && assign_pages(d, pg, order, memflags) )
+    {
+        free_heap_pages(pg, order);
+        return NULL;
+    }
+
+    return pg;
+}
+
 void free_domheap_pages(struct page_info *pg, unsigned int order)
 {
     struct domain *d = page_get_owner(pg);
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index ffc2eef..ff0e040 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -20,6 +20,7 @@
  * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
  * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
  * DEALINGS IN THE SOFTWARE.
+#include <memory.h>
  *
  * Copyright (c) 2005, Keir Fraser <keir@xxxxxxxxxxxxx>
  */
@@ -29,6 +30,13 @@

 #include "xen.h"

+#define RTXEN_PAGE_BITS             12
+#define RTXEN_PAGE_SIZE             4096
+#define RTXEN_LLC_SLICE_SIZE        (2*1024*1024)
+#define RTXEN_LLC_ASSOC             (16)
+#define RTXEN_LLC_CC_MASK           (RTXEN_LLC_SLICE_SIZE /
RTXEN_LLC_ASSOC / RTXEN_PAGE_SIZE - 1)
+#define RTXEN_LLC_CC_NUM            (RTXEN_LLC_CC_MASK + 1)
+
 /*
  * Increase or decrease the specified domain's memory reservation. Returns the
  * number of extents successfully allocated or freed.
@@ -89,6 +97,8 @@ struct xen_memory_reservation {
      * Unprivileged domains can specify only DOMID_SELF.
      */
     domid_t        domid;
+    /* RTXen: cache color */
+    int32_t cache_colors[RTXEN_LLC_CC_NUM];
 };
 typedef struct xen_memory_reservation xen_memory_reservation_t;
 DEFINE_XEN_GUEST_HANDLE(xen_memory_reservation_t);
@@ -573,12 +583,9 @@ typedef struct xen_vnuma_topology_info
xen_vnuma_topology_info_t;
 DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);

 /* Next available subop number is 27 */
-#define RTXEN_PAGE_BITS             12
-#define RTXEN_PAGE_SIZE             4096
-#define RTXEN_LLC_SLICE_SIZE        (2*1024*1024)
-#define RTXEN_LLC_ASSOC             (16)
-#define RTXEN_LLC_CC_MASK           (RTXEN_LLC_SLICE_SIZE /
RTXEN_LLC_ASSOC / RTXEN_PAGE_SIZE - 1)
-#define RTXEN_LLC_CC_NUM            (RTXEN_LLC_CC_MASK + 1)
+#define XENMEM_populate_physmap_ca     27
+
+/* Next available subop number is 28 */

 #endif /* __XEN_PUBLIC_MEMORY_H__ */

diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 74a65a6..4181001 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -65,6 +65,8 @@ void get_outstanding_claims(uint64_t *free_pages,
uint64_t *outstanding_pages);
 void init_domheap_pages(paddr_t ps, paddr_t pe);
 struct page_info *alloc_domheap_pages(
     struct domain *d, unsigned int order, unsigned int memflags);
+struct page_info *alloc_domheap_pages_ca(
+    struct domain *d, unsigned int order, unsigned int memflags,
int32_t *cache_colors);
 void free_domheap_pages(struct page_info *pg, unsigned int order);
 unsigned long avail_domheap_pages_region(
     unsigned int node, unsigned int min_width, unsigned int max_width);

-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.