Xen project Mailing List

Fwd: [Xen-devel] RFC v1: Xen block protocol overhaul - problem statement (with pictures!)

To: Mirage List <cl-mirage@xxxxxxxxxxxxxxx>

From: Anil Madhavapeddy <anil@xxxxxxxxxx>

Date: Wed, 23 Jan 2013 14:32:49 +0000

List-id: MirageOS development <cl-mirage.lists.cam.ac.uk>

An interesting overhaul of the block protocol in Xen, which will also affect Mirage positively. -anil Begin forwarded message: > From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> > Subject: [Xen-devel] RFC v1: Xen block protocol overhaul - problem statement > (with pictures!) > Date: 18 December 2012 14:31:09 GMT > To: xen-devel@xxxxxxxxxxxxxxxxxxx, martin.petersen@xxxxxxxxxx, > felipe.franciosi@xxxxxxxxxx, matthew@xxxxxx, axboe@xxxxxxxxx > > Hey, > > I am including some folks that are not always on Xen-devel to see if they have > some extra ideas or can correct my misunderstandings. > > This is very much RFC - so there is bound to be some bugs. > The original is here > https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit > in case one wishes to modify/provide comment on that one. > > > There are outstanding issues we have now with the block protocol: > Note: I am assuming 64-bit guest/host - as the sizeâs of the structures > change on 32-bit. I am also attaching the header for the blkif ring > as of today. > > A) Segment size is limited to 11 pages. It means we can at most squeeze > in 44kB per request. The ring can hold 32 (next power of two below 36) > requests, meaning we can do 1.4M of outstanding requests. > > B). Producer and consumer index is on the same cache line. In present hardware > that means the reader and writer will compete for the same cacheline causing a > ping-pong between sockets. > > C). The requests and responses are on the same ring. This again causes the > ping-pong between sockets as the ownership of the cache line will shift > between > sockets. > > D). Cache alignment. Currently the protocol is 16-bit aligned. This is awkward > as the request and responses sometimes fit within a cacheline or sometimes > straddle them. > > E). Interrupt mitigation. We are currently doing a kick whenever we are > done âprocessingâ the ring. There are better ways to do this - and > we could use existing network interrupt mitigation techniques to make the > code poll when there is a lot of data. > > F). Latency. The processing of the request limits us to only do 44kB - which > means > that a 1MB chunk of data - which on contemporary devices would only use I/O > request > - would be split up in multiple ârequestsâ inadvertently delaying the > processing of > said block. > > G) Future extensions. DIF/DIX for integrity. There might > be other in the future and it would be good to leave space for extra > flags TBD. > > H). Separate the response and request rings. The current > implementation has one thread for one block ring. There is no reason why > there could not be two threads - one for responses and one for requests - > and especially if they are scheduled on different CPUs. Furthermore this > could also be split in multi-queues - so two queues (response and request) > on each vCPU. > > I). We waste a lot of space on the ring - as we use the > ring for both requests and responses. The response structure needs to > occupy the same amount of space as the request structure (112 bytes). If > the request structure is expanded to be able to fit more segments (say > the âstruct blkif_sring_entry is expanded to ~1500 bytes) that still > requires us to have a matching size response structure. We do not need > to use that much space for one response. Having a separate response ring > would simplify the structures. > > J). 32 bit vs 64 bit. Right now the size > of the request structure is 112 bytes under 64-bit guest and 102 bytes > under 32-bit guest. It is confusing and furthermore requires the host > to do extra accounting and processing. > > The crude drawing displays memory that the ring occupies in offset of > 64 bytes (cache line). Of course future CPUs could have different cache > lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is > a bit different as the âstruct blkif_sring_entryâ is of 102 bytes. > > > The A) has two solutions to this (look at > http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for > details). One proposed by Justin from Spectralogic has to negotiate > the segment size. This means that the âstruct blkif_sring_entryâ > is now a variable size. It can expand from 112 bytes (cover 11 pages of > data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple > extension by just making the array in the request expand from 11 to a > variable size negotiated. > > > The math is as follow. > > > struct blkif_request_segment { > uint32 grant; // 4 bytes uint8_t > first_sect, last_sect;// 1, 1 = 6 bytes > } > (6 bytes for each segment) - the above structure is in an array of size > 11 in the request. The âstruct blkif_sring_entryâ is 112 bytes. The > change is to expand the array - in this example we would tack on 245 extra > âstruct blkif_request_segmentâ - 245*6 + 112 = 1582. If we were to > use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages). > > > The other solution (from Intel - Ronghui) was to create one extra > ring that only has the âstruct blkif_request_segmentâ in them. The > âstruct blkif_requestâ would be changed to have an index in said > âsegment ringâ. There is only one segment ring. This means that the > size of the initial ring is still the same. The requests would point > to the segment and enumerate out how many of the indexes it wants to > use. The limit is of course the size of the segment. If one assumes a > one-page segment this means we can in one request cover ~4MB. The math > is as follow: > > > First request uses the half of the segment ring - so index 0 up > to 341 (out of 682). Each entry in the segment ring is a âstruct > blkif_request_segmentâ so it occupies 6 bytes. The other requests on > the ring (so there are 35 left) can use either the remaining 341 indexes > of the sgement ring or use the old style request. The old style request > can address use up to 44kB. For example: > > > sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832 > sring[1]->[use sthe old style request] = 11*4096 = 45056 > sring[2]->[uses 342->682 indexes in the segment ring] = 1392640 > sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624 > > > Total: 4145152 bytes Naturally this could be extended to have a bigger > segment ring to cover more. > > > > > > The problem with this extension is that we use 6 bytes and end up > straddling a cache line. Using 8 bytes will fix the cache straddling. This > would mean fitting only 512 segments per page. > > > There is yet another mechanism that could be employed - and it borrows > from VirtIO protocol. And that is the âindirect descriptorsâ. This > very similar to what Intel suggests, but with a twist. > > > We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the âstruct > blkif_sringâ (each entry can be up to 112 bytes if needed - so the > old style request would fit). It would look like: > > > /* so 64 bytes under 64-bit. If necessary, the array (seg) can be > expanded to fit 11 segments as the old style request did */ struct > blkif_request_indirect { > uint8_t op; /* BLKIF_OP_* (usually READ or WRITE */ > // 1 blkif_vdev_t handle; /* only for read/write requests */ > // 2 > #ifdef CONFIG_X86_64 > uint32_t _pad1; /* offsetof(blkif_request,u.rw.id) > == 8 */ // 2 > #endif > uint64_t id; /* private guest value, echoed in resp */ > grant_ref_t gref; /* reference to indirect buffer frame if > used*/ > struct blkif_request_segment_aligned seg[4]; // each is 8 bytes > } __attribute__((__packed__)); > > > struct blkif_request { > uint8_t operation; /* BLKIF_OP_??? */ > union { > struct blkif_request_rw rw; > struct blkif_request_indirect > indirect; â other .. > } u; > } __attribute__((__packed__)); > > > > > The âoperationâ would be BLKIF_OP_INDIRECT. The read/write/discard, > etc operation would now be in indirect.op. The indirect.gref points to > a page that is filled with: > > > struct blkif_request_indirect_entry { > blkif_sector_t sector_number; > struct blkif_request_segment seg; > } __attribute__((__packed__)); > //16 bytes, so we can fit in a page 256 of these structures. > > > This means that with the existing 36 slots in the ring (single page) > we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096 > ~= 32M. If we donât want to use indirect descriptor we can still use > up to 4 pages of the request (as it has enough space to contain four > segments and the structure will still be cache-aligned). > > > > > B). Both the producer (req_* and rsp_*) values are in the same > cache-line. This means that we end up with the same cacheline being > modified by two different guests. Depending on the architecture and > placement of the guest this could be bad - as each logical CPU would > try to write and read from the same cache-line. A mechanism where > the req_* and rsp_ values are separated and on a different cache line > could be used. The value of the correct cache-line and alignment could > be negotiated via XenBus - in case future technologies start using 128 > bytes for cache or such. Or the the producer and consumer indexes are in > separate rings. Meaning we have a ârequest ringâ and a âresponse > ringâ - and only the âreq_prodâ, âreq_eventâ are modified in > the ârequest ringâ. The opposite (resp_*) are only modified in the > âresponse ringâ. > > > C). Similar to B) problem but with a bigger payload. Each > âblkif_sring_entryâ occupies 112 bytes which does not lend itself > to a nice cache line size. If the indirect descriptors are to be used > for everything we could âslim-downâ the blkif_request/response to > be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST > to 5 as that would slim the largest of the structures to 64-bytes. > Naturally this means negotiating a new size of the structure via XenBus. > > > D). The first picture shows the problem. We now aligning everything > on the wrong cachelines. Worst in â of the cases we straddle > three cache-lines. We could negotiate a proper alignment for each > request/response structure. > > > E). The network stack has showed that going in a polling mode does improve > performance. The current mechanism of kicking the guest and or block > backend is not always clear. [TODO: Konrad to explain it in details] > > > F). The current block protocol for big I/Os that the backend devices can > handle ends up doing extra work by splitting the I/O in smaller chunks, > and then reassembling them. With the solutions outlined in A) this can > be fixed. This is easily seen with 1MB I/Os. Since each request can > only handle 44kB that means we have to split a 1MB I/O in 24 requests > (23 * 4096 * 11 = 1081344). Then the backend ends up sending them in > sector-sizes- which with contemporary devices (such as SSD) ends up with > more processing. The SSDs are comfortable handling 128kB or higher I/Os > in one go. > > > G). DIF/DIX. This a protocol to carry extra âchecksumâ information > for each I/O. The I/O can be a sector size, page-size or an I/O size > (most popular are 1MB). The DIF/DIX needs 8 bytes of information for > each I/O. It would be worth considering putting/reserving that amount of > space in each request/response. Also putting in extra flags for future > extensions would be worth it - however the author is not aware of any > right now. > > > H). Separate response/request. Potentially even multi-queue per-VCPU > queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was > flawed. There is no similar concept in the storage world were the > operating system can put a food down and say: âeverything before this > has to be on the disk.â There are ligther versions of this - called > âFUAâ and âFLUSHâ. Depending on the internal implementation > of the storage they are either ignored or do the right thing. The > filesystems determine the viability of these flags and change writing > tactics depending on this. From a protocol level, this means that the > WRITE/READ/SYNC requests can be intermixed - the storage by itself > determines the order of the operation. The filesystem is the one that > determines whether the WRITE should be with a FLUSH to preserve some form > of atomicity. This means we do not have to preserve an order of operations > - so we can have multiple queues for request and responses. This has > show in the network world to improve performance considerably. > > > I). Wastage of response/request on the same ring. Currently each response > MUST occupy the same amount of space that the request occupies - as the > ring can have both responses and requests. Separating the request and > response ring would remove the wastage. > > > J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring > entries is different if the guest is in 32-bit or 64-bit mode. Cleaning > this up to be the same size would save considerable accounting that the > host has to do (extra memcpy for each response/request).

Attachment: blkif.h
Description: Text document

> _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.