[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Greater than 16 xvd devices for blkfront
On Tue, May 06, 2008 at 01:36:05PM -0400, Chris Lalancette wrote: > All, > We've had a number of requests to increase the number of xvd devices > that a > PV guest can have. Currently, if you try to connect > 16 disks, you get an > error from xend. The problem ends up being that both xend and blkfront assume > that for dev_t, major/minor is 8 bits each, where in fact there are actually > 10 > bits for major and 22 bits for minor. > Therefore, it shouldn't really be a problem giving lots of disks to > guests. > The problem is in backwards compatibility, and the details. What I am > initially proposing to do is to leave things where they are for /dev/xvd[a-p]; > that is, still put the xenstore entries in the same place, and use 8 bits for > the major and 8 bits for the minor. For anything above that, we would end up > putting the xenstore entry in a different place, and pushing the major into > the > top 10 bits (leaving the bottom 22 bits for the minor); that way old guests > won't fire when the entry is added, and we will add code to newer guests > blkfront so that they will fire when they see that entry. Does anyone see any > problems with this setup, or have any ideas how to do it better? Looking at the blkfront code I think we can increase the minor numbers available for xvdX devices without requiring changes to the where stuff is stored. The key is that in blkfront we can reliably detect the overflow triggered by the 16th disk, because the next major number 203 doesn't clash with any of the other major numbers blkfront is looking for Consider the 17th disk, which has name 'xvdq', this gives a device number in xenstore of '51968'. Upon seeing this, current blkfront code will use #define BLKIF_MAJOR(dev) ((dev)>>8) #define BLKIF_MINOR(dev) ((dev) & 0xff) And so get back major number of 203 and minor number of '0'. In the xlbd_get_major_info(int vdevice) function, it has a switch on major numbers and the xvdX case is handled as the default major = BLKIF_MAJOR(vdevice); minor = BLKIF_MINOR(vdevice); switch (major) { case IDE0_MAJOR: index = 0; break; ....snipped... case IDE9_MAJOR: index = 9; break; case SCSI_DISK0_MAJOR: index = 10; break; case SCSI_DISK1_MAJOR ... SCSI_DISK7_MAJOR: index = 11 + major - SCSI_DISK1_MAJOR; break; case SCSI_CDROM_MAJOR: index = 18; break; default: index = 19; break; } So, the 17th disk in fact gets treated as 1st disk and the front end assigns it the name 'xvda', and then promptly kernel panics because xvda already exists in sysfs. kobject_add failed for xvda with -EEXIST, don't try to register things with the same name in the same directory. Call Trace: [<ffffffff80336951>] kobject_add+0x16e/0x199 [<ffffffff8025ce3c>] exact_lock+0x0/0x14 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4 [<ffffffff802f393e>] register_disk+0x43/0x198 [<ffffffff8029b271>] keventd_create_kthread+0x0/0xc4 [<ffffffff8032e453>] add_disk+0x34/0x3d [<ffffffff88074eb8>] :xenblk:backend_changed+0x110/0x193 [<ffffffff803a4029>] xenbus_read_driver_state+0x26/0x3b Now, this kernel panic isn't a huge problem (though it ought to handle the kobject_add gracefully), because we can never do anything to make existing frontends deal with > 16 disks. If an admin tries to add more than 16 disks to an existing guest they should already expect doom. For future frontends though, it looks like we can adapt the switch(major) in xlbd_get_major_info(), so that it detects the overflow of minor numbers, and re-adjusts the major/minor numbers to their intended value: eg change case SCSI_CDROM_MAJOR: index = 18; break; default: index = 19; break; } to case SCSI_CDROM_MAJOR: index = 18; break; default: index = 19; if (major > XLBD_MAJOR_VBD_START) { minor += 16 * (major - XLBD_MAJOR_VBD_START); major = XLBD_MAJOR_VBD_START; } break; } Now, I've not actually tested this, and there's a few other places in blkfront needing similar tweaks but I don't see anything in the code which fundamentally stops this overflow detection & fixup. As far as the XenD backend is concerned, all we need todo is edit the XenD blkdev_name_to_number() function in tools/python/xen/util/blkif.py to relax the regex to allow > xvdp. And adapt the math so it overflows onto the major numbers following XVD's 202. In eg, change if re.match( '/dev/xvd[a-p]([1-9]|1[0-5])?', n): return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0) to if re.match( '/dev/xvd[a-z]([1-9]|1[0-5])?', n): return 202 * 256 + 16 * (ord(n[8:9]) - ord('a')) + int(n[9:] or 0) gets you to 26 disks. This is how I got the gues to boot and front end to crash on the 17th disk 'xvdq'. It is a little more complex to cope with 2-letter drives, but no show stopper there. So, unless I'm missing something obvious we can keep compatability with existing guests for the first 16 disks and still (indirectly) make use of a 22/12 dev_t split for the 17th+ disk, without needing to change how or where stuff is stored in XenStore. Regards, Daniel. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |