[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] VMX status report. Xen:26323 & Dom0:3.7.1



On Jan 11, 2013, at 3:34 AM, Jan Beulich <jbeulich@xxxxxxxx> wrote:

>>>> On 10.01.13 at 18:10, Andres Lagar-Cavilla <andreslc@xxxxxxxxxxxxxx> wrote:
>> On Jan 10, 2013, at 3:57 AM, "Jan Beulich" <JBeulich@xxxxxxxx> wrote:
>> 
>>>>>> On 10.01.13 at 08:51, "Ren, Yongjie" <yongjie.ren@xxxxxxxxx> wrote:
>>>> New issue(1)
>>>> ==============
>>>> 1. sometimes live migration failed and reported call trace in dom0
>>>> http://bugzilla.xen.org/bugzilla/show_bug.cgi?id=1841 
>>> 
>>> For the failed allocation, the only obvious candidate appears to be
>>> 
>>>     err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
>>> 
>>> which quite obviously can be of (almost) arbitrary size because
>>> 
>>>     nr_pages = m.num;
>>>     if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>>>             return -EINVAL;
>>> 
>>> really only checks for completely insane values.
>>> 
>>> This got introduced by Andres' "xen/privcmd: add PRIVCMD_MMAPBATCH_V2
>>> ioctl" and is becoming worse with Mukesh's recent "xen: privcmd:
>>> support autotranslated physmap guests", which added another
>>> similar (twice as large) allocation in alloc_empty_pages().
>> 
>> Perhaps the err_array in this case, since alloc_empty_pages only happens for 
>> auto translated dom0s.
>> 
>> Not familiar wether libxl changes (or is even capable of changing) 
>> parameters of the migration code. But right now in libxc, mapping is done in 
>> MAX_BATCH_SIZE batches, which are of size 1024. So we are talking about 1024 
>> ints, which is *one* page.
>> 
>> So is really the kernel incapable of allocating one measly page?
>> 
>> This leads me to think that it might be gather_array, but that one would 
>> allocate a grand total of two pages.
> 
> The log indicated a failed order-4 allocation though. So maybe not
> that allocation after all (be the basically unbounded size here is a
> problem in any case).

In my mind this casts serious doubt on whether err_array is the culprit of this 
test result. I would look into other variables.

Having said that, a kcalloc of a potentially multi-page-sized array is a 
problem.

Below you'll find pasted an RFC patch to fix this. I've expanded the cc line to 
add Mats Peterson, who is also looking into some improvements to privcmd (and 
IanC for general feedback).

The RFC patch cuts down code overall and cleans up logic too. I did change the 
behavior wrt classic implementations when it comes to handling errors & EFAULT. 
Instead of doing all the mapping work and then copying back to user, I copy 
back each individual mapping error as soon as it arises. And short-circuit and 
quit the whole operation as soon as the first EFAULT arises.

Short-circuiting on the first EFAULT is the right thing to do IMHO. There is no 
point in continue working if we can't tell the caller about the result. 
Further, libxc will just munmap the vma and undo all work upon receiving 
EFAULT. So any further work is wasted work.

Please advise, thanks
Andres

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 3421f0d..9433396 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c 
@@ -254,18 +254,16 @@ struct mmap_batch_state {
    unsigned long va;
    struct vm_area_struct *vma;
    int index;
-   /* A tristate: 
-    *      0 for no errors                 
-    *      1 if at least one error has happened (and no
-    *          -ENOENT errors have happened)
-    *      -ENOENT if at least 1 -ENOENT has happened.
-    */             
-   int global_error;   
+   /* Whether at least one enoent has happened. */
+   int enoent;     
    /* An array for individual errors */
    int *err;           
                    
-   /* User-space mfn array to store errors in the second pass for V1. */
+   int version;
+   /* User-space mfn array to store errors for V1. */
    xen_pfn_t __user *user_mfn;
+   /* User space int array to store errors for V2. */
+   int __user *user_err;
 };
 
 /* auto translated dom0 note: if domU being created is PV, then mfn is
@@ -287,40 +285,35 @@ static int mmap_batch_fn(void *data, void *state)
                     st->vma->vm_page_prot, st->domain,
                     &cur_page);
 
-   /* Store error code for second pass. */
-   *(st->err++) = ret;
-
-   /* And see if it affects the global_error. */
-   if (ret < 0) {
-       if (ret == -ENOENT)
-           st->global_error = -ENOENT;
-       else {
-           /* Record that at least one error has happened. */
-           if (st->global_error == 0)
-               st->global_error = 1;
+   /* See if error affects the global enoent flag. */
+   if (ret == -ENOENT) {
+       st->enoent = 1;
+   }
+
+   /* And, if error, store in user space (version dependent). */
+   if (ret) {
+       int efault = 0;
+       if (st->version == 1) {
+           xen_pfn_t mfn_err = *mfnp;
+           mfn_err |= (ret == -ENOENT) ?
+                       PRIVCMD_MMAPBATCH_PAGED_ERROR :
+                       PRIVCMD_MMAPBATCH_MFN_ERROR;
+           efault = __put_user(mfn_err, st->user_mfn++);
+       } else { /* st->version == 2 */
+           efault = __put_user(ret, st->user_err++);
        }
+       /* If copy to user failed, short circuit now. */
+       if (efault)
+           return efault;
+   } else {
+       st->user_mfn++;
+       st->user_err++;
    }
-   st->va += PAGE_SIZE;

+   st->va += PAGE_SIZE;
    return 0;
 }

-static int mmap_return_errors_v1(void *data, void *state)
-{
-   xen_pfn_t *mfnp = data;
-   struct mmap_batch_state *st = state;
-   int err = *(st->err++);
-
-   /*
-    * V1 encodes the error codes in the 32bit top nibble of the
-    * mfn (with its known limitations vis-a-vis 64 bit callers).
-    */
-   *mfnp |= (err == -ENOENT) ?
-               PRIVCMD_MMAPBATCH_PAGED_ERROR :
-               PRIVCMD_MMAPBATCH_MFN_ERROR;
-   return __put_user(*mfnp, st->user_mfn++);
-}
-
 /* Allocate pfns that are then mapped with gmfns from foreign domid. Update
  * the vma with the page info to use later.
  * Returns: 0 if success, otherwise -errno
@@ -357,7 +350,6 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
int version)
    struct vm_area_struct *vma;
    unsigned long nr_pages;
    LIST_HEAD(pagelist);
-   int *err_array = NULL;
    struct mmap_batch_state state;

    if (!xen_initial_domain())
@@ -396,10 +388,12 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
int version)
        goto out;
    }

-   err_array = kcalloc(m.num, sizeof(int), GFP_KERNEL);
-   if (err_array == NULL) {
-       ret = -ENOMEM;
-       goto out;
+   /* Zero the V2 array of errors, so we only write return codes on error. */
+   if (version == 2) {
+       if (clear_user(m.err, sizeof(int) * m.num)) {
+           ret = -EFAULT;
+           goto out;
+       }
    }

    down_write(&mm->mmap_sem);
@@ -426,38 +420,22 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, 
int version)
    state.vma           = vma;
    state.va            = m.addr;
    state.index         = 0;
-   state.global_error  = 0;
-   state.err           = err_array;
+   state.enoent        = 0;
+   state.user_mfn      = (xen_pfn_t *)m.arr;
+   state.user_err      = m.err;
+   state.version       = version;

-   /* mmap_batch_fn guarantees ret == 0 */
-   BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
-                &pagelist, mmap_batch_fn, &state));
+   ret = traverse_pages(m.num, sizeof(xen_pfn_t),
+                        &pagelist, mmap_batch_fn, &state);

    up_write(&mm->mmap_sem);

-   if (version == 1) {
-       if (state.global_error) {
-           /* Write back errors in second pass. */
-           state.user_mfn = (xen_pfn_t *)m.arr;
-           state.err      = err_array;
-           ret = traverse_pages(m.num, sizeof(xen_pfn_t),
-                        &pagelist, mmap_return_errors_v1, &state);
-       } else
-           ret = 0;
-
-   } else if (version == 2) {
-       ret = __copy_to_user(m.err, err_array, m.num * sizeof(int));
-       if (ret)
-           ret = -EFAULT;
-   }
-
    /* If we have not had any EFAULT-like global errors then set the global
     * error to -ENOENT if necessary. */
-   if ((ret == 0) && (state.global_error == -ENOENT))
+   if ((ret == 0) && (state.enoent))
        ret = -ENOENT;

 out:
-   kfree(err_array);
    free_page_list(&pagelist);

    return ret;


> 
>> In any case, both functions allocate arbitrary number of pages, and that is 
>> the fundamental problem.
>> 
>> What is the approach in the forward ported kernel wrt to gather_array?
> 
> There's no gather_array there. It simply sets up things one page
> at a time. (And you can always go and check linux-2.6.18-xen.hg).
> 
> Jan
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.