Xen project Mailing List

Re: [Xen-devel] [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash

To: David Vrabel <david.vrabel@xxxxxxxxxx>

From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>

Date: Mon, 11 Aug 2014 12:03:07 +0200

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Andrew Jones <drjones@xxxxxxxxxx>

Delivery-date: Mon, 11 Aug 2014 10:03:30 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

David Vrabel <david.vrabel@xxxxxxxxxx> writes: > On 08/08/14 16:17, Vitaly Kuznetsov wrote: >> David Vrabel <david.vrabel@xxxxxxxxxx> writes: >> >>> On 08/08/14 15:22, Vitaly Kuznetsov wrote: >>>> When EVTCHNOP_reset is being performed last_vcpu_id attribute is not being >>>> cleaned by __evtchn_close(). In case last_vcpu_id != 0 for a particular >>>> event channel and this event channel is going to be used for event delivery >>>> (for another vcpu) before EVTCHNOP_init_control for vcpu == last_vcpu_id >>>> was done the following crash is observed: >>>> >>>> ... >>>> (XEN) Xen call trace: >>>> (XEN) [<ffff82d080127785>] _spin_lock_irqsave+0x5/0x70 >>>> (XEN) [<ffff82d0801097db>] evtchn_fifo_set_pending+0xdb/0x370 >>>> (XEN) [<ffff82d080107146>] evtchn_send+0xd6/0x160 >>>> (XEN) [<ffff82d080107df9>] do_event_channel_op+0x6a9/0x16c0 >>>> (XEN) [<ffff82d0801ce800>] vmx_intr_assist+0x30/0x480 >>>> (XEN) [<ffff82d080219e99>] syscall_enter+0xa9/0xae >>>> >>>> This happens because lock_old_queue() does not check VCPU's control >>>> block existence and after EVTCHNOP_reset they are all cleaned. >>>> >>>> I suggest we fix the issue twice: reset last_vcpu_id to 0 in >>>> __evtchn_close() >>>> and add appropriate check to lock_old_queue() as lost event is much better >>>> than hypervisor crash. >>>> >>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> >>>> --- >>>> xen/common/event_channel.c | 3 +++ >>>> xen/common/event_fifo.c | 9 +++++++++ >>>> 2 files changed, 12 insertions(+) >>>> >>>> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c >>>> index a7becae..67b9d53 100644 >>>> --- a/xen/common/event_channel.c >>>> +++ b/xen/common/event_channel.c >>>> @@ -578,6 +578,9 @@ static long __evtchn_close(struct domain *d1, int >>>> port1) >>>> chn1->state = ECS_FREE; >>>> chn1->notify_vcpu_id = 0; >>>> >>>> + /* Reset last_vcpu_id to vcpu0 as control block can be freed */ >>>> + chn1->last_vcpu_id = 0; >>> >>> This is broken if the event channel is closed and rebound while the >>> event is linked. >>> >>> You can only safely clear chn->last_vcpu_id during evtchn_fifo_destroy(). >>> >>> You also need to clear last_priority. >>> >> >> Thanks, alternatively I can do that in evtchn_reset() after >> evtchn_fifo_destroy() as it is the only path leading to the issue. I >> wanted to avoid that to exclude additional loop for all event channels. >> >>>> + >>>> xsm_evtchn_close_post(chn1); >>>> >>>> out: >>>> diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c >>>> index 51b4ff6..e4bef80 100644 >>>> --- a/xen/common/event_fifo.c >>>> +++ b/xen/common/event_fifo.c >>>> @@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const >>>> struct domain *d, >>>> for ( try = 0; try < 3; try++ ) >>>> { >>>> v = d->vcpu[evtchn->last_vcpu_id]; >>>> + >>>> + if ( !v->evtchn_fifo ) >>>> + { >>>> + gdprintk(XENLOG_ERR, >>>> + "domain %d vcpu %d has no control block!\n", >>>> + d->domain_id, v->vcpu_id); >>>> + return NULL; >>>> + } >>> >>> I think this check needs to be in evtchn_fifo_init() to prevent the >>> event from being bound to VCPU that does not have a control block. >>> >> >> I *think* it is not the issue here - the event is being bound to VCPU >> with this block initialized. But last_vcpu_id for this particular event >> channel points to some other VCPU which has not initialized its control >> block yet (so d->vcpu[evtchn->last_vcpu_id]->evtchn_fifo is NULL). There >> is no path to get in such situation (after we clear last_vcpu_id), I >> just wanted to put reasonable message here in case something will change >> in future. > > Then evtchn_fifo_init() needs to check both the new VCPU and > last_vcpu_id have control blocks. > > I much prefer failing the bind up front than detecting the problem later. I discovered an issue with such approach: xen_setup_timer() is being called earlier that EVTCHNOP_init_control for all secondary VCPUs in current linux kernel (but after VCPU0 switched ABI to FIFO). This means evtchn_fifo_init() is being called for VIRQ event channel which has notify_vcpu_id pointing to a VCPU with uninitialized control block. And if we reset is to VCPU0 we'll break everything... -- Vitaly _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.