[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] segfault in VM



That's comforting. I was starting to think of looking for gcc bugs and the like.
 
Even so, it might be useful to collect the gcc versions of anyone who either has seen the bug or has tried to reproduce it and can't. Mine reports itself as "gcc (GCC) 3.3.4 (Debian 1:3.3.4-2)" with "gcc --version"
 
James


From: Keir Fraser
Sent: Fri 23/07/2004 11:11 AM
To: James Harper
Cc: Keir Fraser; Derek Glidden; xen-devel@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] segfault in VM

Yeah, it turns out I can reproduce this bug trivially by md5summing a
file just slightly bigger than dom0's memory allocation, while
floodpinging dom1.

I'm trying out a few things right now, so hopefully I'll be able to
report progress on this evil bug r.s.n. :-)

 -- Keir

> I just made a change so that the skbuf is always copied in netif_be_start_xmit but it still crashes, which means most likely that bit is fine or at least isn't the only code containing bugs.
> 
> As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' block, (still block the receive but do it later) and there were no crashes, so i'm comfortable that we've exhausted netif_be_start_xmit as a source for bugs.
> 
> So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages that get passed from dom0 to domU, how/where/do they get recycled back to dom0? Is it possible that domU could still write to a page that dom0 thought it had free to use for something else? If so, where would that be?
> 
> Keir: have you been able to reproduce these errors at all?
> 
> James
> 
> 
> 
> 
> From: Keir Fraser
> Sent: Fri 23/07/2004 3:48 AM
> To: Derek Glidden
> Cc: xen-devel@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] segfault in VM
> 
> 
> It's useful to have the extra data points -- it adds to our confidence
> that it's the network driver that is somehow at fault here.
> 
> Quite how to proceed in narrowing down the problem is
> unclear. One approach is to perturb the backend driver's data path
> (e.g., always copying packets into a known-safe page-sized buffer, as
> a check that our current copy-avoidancxe checks are not at fault; and
> replacing the current high-performance but convoluted code for
> batching hypercalls with something slower but easier to grok). The
> latter is useful because if the bug goes away then we have a smaller
> chunk of code to look at; if the bug remains then we end up with a
> less complex data path that is easier to instrument and bughunt.
> 
> If anyone is interested in pursuing this bug independently, the
> functions most under suspicion are netif_be_start_xmit and
> net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
> These two form the data path for packets getting sent to guest OSes.
> 
>  -- Keir
> 
> 
> > 
> > On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> > >
> > > Anyway - currently sounds like teh bug resides in the most complex
> > > half of the most complex driver. Who'd've thought it? ;-)
> > 
> > At this point this data is surely redundant but...
> > 
> > When I went to sleep last night I let my box run dom0 and four VMs 
> > doing md5sum checks on a couple of large files, hammering the heck out 
> > of the block i/o drivers and CPU but with all the ifaces/vifs on the 
> > machine down.  When I woke up, all compares had been correct for the 
> > six hours or so it ran.  I re-upped the ifaces and started to ping dom0 
> > and the VMs and within a minute of the pings starting dom0 started to 
> > report incorrect md5sums.
> > 
> > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> > "We all enter this world in the    | Support Electronic Freedom
> > same way: naked; screaming; soaked |        http://www.eff.org/
> > in blood. But if you live your     |  http://www.anti-dmca.org/
> > life right, that kind of thing     |---------------------------
> > doesn't have to stop there." -- Dana Gould
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email is sponsored by BEA Weblogic Workshop
> > FREE Java Enterprise J2EE developer tools!
> > Get your free copy of BEA WebLogic Workshop 8.1 today.
> > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxxxx
> > https://lists.sourceforge.net/lists/listinfo/xen-devel
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/xen-devel
 -=- MIME -=- 
--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I just made a change so that the skbuf is always copied in netif_be_start_x=
mit but it still crashes, which means most likely that bit is fine or at le=
ast isn't the only code containing bugs.

As another test I also put the 'goto done;' after the 'if ( skb_shared(skb)=
 || skb_cloned(skb) || ...' block, (still block the receive but do it later=
) and there were no crashes, so i'm comfortable that we've exhausted netif_=
be_start_xmit as a source for bugs.

So I guess that leaves net_rx_action. I'm unsure on one thing though, the p=
ages that get passed from dom0 to domU, how/where/do they get recycled back=
 to dom0? Is it possible that domU could still write to a page that dom0 th=
ought it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James




From: Keir Fraser
Sent: Fri 23/07/2004 3:48 AM
To: Derek Glidden
Cc: xen-devel@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] segfault in VM


It's useful to have the extra data points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir


>=20
> On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
> >
> > Anyway - currently sounds like teh bug resides in the most complex
> > half of the most complex driver. Who'd've thought it? ;-)
>=20
> At this point this data is surely redundant but...
>=20
> When I went to sleep last night I let my box run dom0 and four VMs=20
> doing md5sum checks on a couple of large files, hammering the heck out=20
> of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
> machine down.  When I woke up, all compares had been correct for the=20
> six hours or so it ran.  I re-upped the ifaces and started to ping dom0=20
> and the VMs and within a minute of the pings starting dom0 started to=20
> report incorrect md5sums.
>=20
> -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
> "We all enter this world in the    | Support Electronic Freedom
> same way: naked; screaming; soaked |        http://www.eff.org/
> in blood. But if you live your     |  http://www.anti-dmca.org/
> life right, that kind of thing     |---------------------------
> doesn't have to stop there." -- Dana Gould
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxxxx
> https://lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/xen-devel

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD>
<BODY>
<DIV id=3DidOWAReplyText58627 dir=3Dltr>
<DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just made a ch=
ange so that the skbuf is always copied in netif_be_start_xmit but it still=
 crashes, which means most likely that bit is fine or at least isn't the on=
ly code containing bugs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also put the '=
goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' bloc=
k, (still block the receive but do it later) and there were no crashes, so =
i'm comfortable that we've exhausted netif_be_start_xmit as a source for bu=
gs.</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves net_rx_ac=
tion. I'm unsure on one thing though, the pages that get passed from dom0 t=
o domU, how/where/do they get recycled back to dom0? Is it possible that do=
mU could still write to a page that dom0 thought it had free to use for som=
ething else? If so, where would that be?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able to repr=
oduce these errors at all?</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV>
<DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV></DIV>
<DIV dir=3Dltr><BR>
<HR tabIndex=3D-1>
<FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Fri 2=
3/07/2004 3:48 AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B> xen-devel@lists=
.sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FON=
T><BR></DIV>
<DIV><PRE style=3D"WORD-WRAP: break-word">It's useful to have the extra dat=
a points -- it adds to our confidence
that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is
unclear. One approach is to perturb the backend driver's data path
(e.g., always copying packets into a known-safe page-sized buffer, as
a check that our current copy-avoidancxe checks are not at fault; and
replacing the current high-performance but convoluted code for
batching hypercalls with something slower but easier to grok). The
latter is useful because if the bug goes away then we have a smaller
chunk of code to look at; if the bug remains then we end up with a
less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the
functions most under suspicion are netif_be_start_xmit and
net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c.
These two form the data path for packets getting sent to guest OSes.

 -- Keir


&gt;=20
&gt; On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:
&gt; &gt;
&gt; &gt; Anyway - currently sounds like teh bug resides in the most comple=
x
&gt; &gt; half of the most complex driver. Who'd've thought it? ;-)
&gt;=20
&gt; At this point this data is surely redundant but...
&gt;=20
&gt; When I went to sleep last night I let my box run dom0 and four VMs=20
&gt; doing md5sum checks on a couple of large files, hammering the heck out=
=20
&gt; of the block i/o drivers and CPU but with all the ifaces/vifs on the=20
&gt; machine down.  When I woke up, all compares had been correct for the=20
&gt; six hours or so it ran.  I re-upped the ifaces and started to ping dom=
0=20
&gt; and the VMs and within a minute of the pings starting dom0 started to=
=20
&gt; report incorrect md5sums.
&gt;=20
&gt; -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
&gt; "We all enter this world in the    | Support Electronic Freedom
&gt; same way: naked; screaming; soaked |        http://www.eff.org/
&gt; in blood. But if you live your     |  http://www.anti-dmca.org/
&gt; life right, that kind of thing     |---------------------------
&gt; doesn't have to stop there." -- Dana Gould
&gt;=20
&gt;=20
&gt;=20
&gt; -------------------------------------------------------
&gt; This SF.Net email is sponsored by BEA Weblogic Workshop
&gt; FREE Java Enterprise J2EE developer tools!
&gt; Get your free copy of BEA WebLogic Workshop 8.1 today.
&gt; http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
&gt; _______________________________________________
&gt; Xen-devel mailing list
&gt; Xen-devel@xxxxxxxxxxxxxxxxxxxxx
&gt; https://lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/xen-devel
</PRE></DIV></BODY></HTML>

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_--


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.