[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: [Xen-devel] RFC v1: Xen block protocol overhaul - problem statement (with pictures!)



An interesting overhaul of the block protocol in Xen, which will also
affect Mirage positively.

-anil

Begin forwarded message:

> From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
> Subject: [Xen-devel] RFC v1: Xen block protocol overhaul - problem statement 
> (with pictures!)
> Date: 18 December 2012 14:31:09 GMT
> To: xen-devel@xxxxxxxxxxxxxxxxxxx, martin.petersen@xxxxxxxxxx, 
> felipe.franciosi@xxxxxxxxxx, matthew@xxxxxx, axboe@xxxxxxxxx
> 
> Hey,
> 
> I am including some folks that are not always on Xen-devel to see if they have
> some extra ideas or can correct my misunderstandings.
> 
> This is very much RFC - so there is bound to be some bugs.
> The original is here
> https://docs.google.com/document/d/1Vh5T8Z3Tx3sUEhVB0DnNDKBNiqB_ZA8Z5YVqAsCIjuI/edit
> in case one wishes to modify/provide comment on that one.
> 
> 
> There are outstanding issues we have now with the block protocol:
> Note: I am assuming 64-bit guest/host - as the sizeâs of the structures
> change on 32-bit. I am also attaching the header for the blkif ring
> as of today.
> 
> A) Segment size is limited to 11 pages. It means we can at most squeeze
> in 44kB per request. The ring can hold 32 (next power of two below 36)
> requests, meaning we can do 1.4M of outstanding requests.
> 
> B). Producer and consumer index is on the same cache line. In present hardware
> that means the reader and writer will compete for the same cacheline causing a
> ping-pong between sockets.
> 
> C). The requests and responses are on the same ring. This again causes the
> ping-pong between sockets as the ownership of the cache line will shift 
> between
> sockets.
> 
> D). Cache alignment. Currently the protocol is 16-bit aligned. This is awkward
> as the request and responses sometimes fit within a cacheline or sometimes
> straddle them.
> 
> E). Interrupt mitigation. We are currently doing a kick whenever we are
> done âprocessingâ the ring. There are better ways to do this - and
> we could use existing network interrupt mitigation techniques to make the
> code poll when there is a lot of data. 
> 
> F). Latency. The processing of the request limits us to only do 44kB - which 
> means
> that a 1MB chunk of data - which on contemporary devices would only use I/O 
> request
> - would be split up in multiple ârequestsâ inadvertently delaying the 
> processing of
> said block. 
> 
> G) Future extensions. DIF/DIX for integrity. There might
> be other in the future and it would be good to leave space for extra
> flags TBD. 
> 
> H). Separate the response and request rings. The current
> implementation has one thread for one block ring. There is no reason why
> there could not be two threads - one for responses and one for requests -
> and especially if they are scheduled on different CPUs. Furthermore this
> could also be split in multi-queues - so two queues (response and request)
> on each vCPU. 
> 
> I). We waste a lot of space on the ring - as we use the
> ring for both requests and responses. The response structure needs to
> occupy the same amount of space as the request structure (112 bytes). If
> the request structure is expanded to be able to fit more segments (say
> the âstruct blkif_sring_entry is expanded to ~1500 bytes) that still
> requires us to have a matching size response structure. We do not need
> to use that much space for one response. Having a separate response ring
> would simplify the structures. 
> 
> J). 32 bit vs 64 bit. Right now the size
> of the request structure is 112 bytes under 64-bit guest and 102 bytes
> under 32-bit guest. It is confusing and furthermore requires the host
> to do extra accounting and processing.
> 
> The crude drawing displays memory that the ring occupies in offset of
> 64 bytes (cache line). Of course future CPUs could have different cache
> lines (say 32 bytes?)- which would skew this drawing. A 32-bit ring is
> a bit different as the âstruct blkif_sring_entryâ is of 102 bytes.
> 
> 
> The A) has two solutions to this (look at
> http://comments.gmane.org/gmane.comp.emulators.xen.devel/140406 for
> details). One proposed by Justin from Spectralogic has to negotiate
> the segment size. This means that the âstruct blkif_sring_entryâ
> is now a variable size. It can expand from 112 bytes (cover 11 pages of
> data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple
> extension by just making the array in the request expand from 11 to a
> variable size negotiated.
> 
> 
> The math is as follow.
> 
> 
>        struct blkif_request_segment {
>                uint32 grant;                         // 4 bytes uint8_t
>                first_sect, last_sect;// 1, 1 = 6 bytes
>        }
> (6 bytes for each segment) - the above structure is in an array of size
> 11 in the request. The âstruct blkif_sring_entryâ is 112 bytes. The
> change is to expand the array - in this example we would tack on 245 extra
> âstruct blkif_request_segmentâ - 245*6 + 112 = 1582. If we were to
> use 36 requests (so 1582*36 + 64) we would use 57016 bytes (14 pages).
> 
> 
> The other solution (from Intel - Ronghui) was to create one extra
> ring that only has the âstruct blkif_request_segmentâ in them. The
> âstruct blkif_requestâ would be changed to have an index in said
> âsegment ringâ. There is only one segment ring. This means that the
> size of the initial ring is still the same. The requests would point
> to the segment and enumerate out how many of the indexes it wants to
> use. The limit is of course the size of the segment. If one assumes a
> one-page segment this means we can in one request cover ~4MB. The math
> is as follow:
> 
> 
> First request uses the half of the segment ring - so index 0 up
> to 341 (out of 682). Each entry in the segment ring is a âstruct
> blkif_request_segmentâ so it occupies 6 bytes. The other requests on
> the ring (so there are 35 left) can use either the remaining 341 indexes
> of the sgement ring or use the old style request. The old style request
> can address use up to 44kB. For example:
> 
> 
> sring[0]->[uses 0->341 indexes in the segment ring] = 342*4096 = 1400832
> sring[1]->[use sthe old style request] = 11*4096 = 45056
> sring[2]->[uses 342->682 indexes in the segment ring] = 1392640
> sring[3..32] -> [uses the old style request] = 29*4096*11 = 1306624
> 
> 
> Total: 4145152 bytes Naturally this could be extended to have a bigger
> segment ring to cover more.
> 
> 
> 
> 
> 
> The problem with this extension is that we use 6 bytes and end up
> straddling a cache line. Using 8 bytes will fix the cache straddling. This
> would mean fitting only 512 segments per page.
> 
> 
> There is yet another mechanism that could be employed  - and it borrows
> from VirtIO protocol. And that is the âindirect descriptorsâ. This
> very similar to what Intel suggests, but with a twist.
> 
> 
> We could provide a new BLKIF_OP (say BLKIF_OP_INDIRECT), and the âstruct
> blkif_sringâ (each entry can be up to 112 bytes if needed - so the
> old style request would fit). It would look like:
> 
> 
> /* so 64 bytes under 64-bit. If necessary, the array (seg) can be
> expanded to fit 11 segments as the old style request did */ struct
> blkif_request_indirect {
>        uint8_t        op;           /* BLKIF_OP_* (usually READ or WRITE    */
> // 1 blkif_vdev_t   handle;       /* only for read/write requests         */ 
> // 2
> #ifdef CONFIG_X86_64
>        uint32_t       _pad1;             /* offsetof(blkif_request,u.rw.id) 
> == 8 */ // 2
> #endif
>        uint64_t       id;           /* private guest value, echoed in resp  */
>       grant_ref_t    gref;        /* reference to indirect buffer frame  if 
> used*/
>            struct blkif_request_segment_aligned seg[4]; // each is 8 bytes
> } __attribute__((__packed__));
> 
> 
> struct blkif_request {
>        uint8_t        operation;    /* BLKIF_OP_???  */
>       union {
>                struct blkif_request_rw rw;
>               struct blkif_request_indirect
>                indirect; â other ..
>        } u;
> } __attribute__((__packed__));
> 
> 
> 
> 
> The âoperationâ would be BLKIF_OP_INDIRECT. The read/write/discard,
> etc operation would now be in indirect.op. The indirect.gref points to
> a page that is filled with:
> 
> 
> struct blkif_request_indirect_entry {
>        blkif_sector_t sector_number;
>       struct blkif_request_segment seg;
> } __attribute__((__packed__));
> //16 bytes, so we can fit in a page 256 of these structures.
> 
> 
> This means that with the existing 36 slots in the ring (single page)
> we can cover: 32 slots * each blkif_request_indirect covers: 256 * 4096
> ~= 32M. If we donât want to use indirect descriptor we can still use
> up to 4 pages of the request (as it has enough space to contain four
> segments and the structure will still be cache-aligned).
> 
> 
> 
> 
> B). Both the producer (req_* and rsp_*) values are in the same
> cache-line. This means that we end up with the same cacheline being
> modified by two different guests. Depending on the architecture and
> placement of the guest this could be bad - as each logical CPU would
> try to write and read from the same cache-line. A mechanism where
> the req_* and rsp_ values are separated and on a different cache line
> could be used. The value of the correct cache-line and alignment could
> be negotiated via XenBus - in case future technologies start using 128
> bytes for cache or such. Or the the producer and consumer indexes are in
> separate rings. Meaning we have a ârequest ringâ and a âresponse
> ringâ - and only the âreq_prodâ, âreq_eventâ are modified in
> the ârequest ringâ. The opposite (resp_*) are only modified in the
> âresponse ringâ.
> 
> 
> C). Similar to B) problem but with a bigger payload. Each
> âblkif_sring_entryâ occupies 112 bytes which does not lend itself
> to a nice cache line size. If the indirect descriptors are to be used
> for everything we could âslim-downâ the blkif_request/response to
> be up-to 64 bytes. This means modifying BLKIF_MAX_SEGMENTS_PER_REQUEST
> to 5 as that would slim the largest of the structures to 64-bytes.
> Naturally this means negotiating a new size of the structure via XenBus.
> 
> 
> D). The first picture shows the problem. We now aligning everything
> on the wrong cachelines. Worst in â of the cases we straddle
> three cache-lines. We could negotiate a proper alignment for each
> request/response structure.
> 
> 
> E). The network stack has showed that going in a polling mode does improve
> performance. The current mechanism of kicking the guest and or block
> backend is not always clear.  [TODO: Konrad to explain it in details]
> 
> 
> F). The current block protocol for big I/Os that the backend devices can
> handle ends up doing extra work by splitting the I/O in smaller chunks,
> and then reassembling them. With the solutions outlined in A) this can
> be fixed. This is easily seen with 1MB I/Os. Since each request can
> only handle 44kB that means we have to split a 1MB I/O in 24 requests
> (23 * 4096 * 11 = 1081344). Then the backend ends up sending them in
> sector-sizes- which with contemporary devices (such as SSD) ends up with
> more processing. The SSDs are comfortable handling 128kB or higher I/Os
> in one go.
> 
> 
> G). DIF/DIX. This a protocol to carry extra âchecksumâ information
> for each I/O. The I/O can be a sector size, page-size or an I/O size
> (most popular are 1MB). The DIF/DIX needs 8 bytes of information for
> each I/O. It would be worth considering putting/reserving that amount of
> space in each request/response. Also putting in extra flags for future
> extensions would be worth it - however the author is not aware of any
> right now.
> 
> 
> H). Separate response/request. Potentially even multi-queue per-VCPU
> queues. As v2.6.37 demonstrated, the idea of WRITE_BARRIER was
> flawed. There is no similar concept in the storage world were the
> operating system can put a food down and say: âeverything before this
> has to be on the disk.â There are ligther versions of this - called
> âFUAâ and âFLUSHâ. Depending on the internal implementation
> of the storage they are either ignored or do the right thing. The
> filesystems determine the viability of these flags and change writing
> tactics depending on this. From a protocol level, this means that the
> WRITE/READ/SYNC requests can be intermixed - the storage by itself
> determines the order of the operation. The filesystem is the one that
> determines whether the WRITE should be with a FLUSH to preserve some form
> of atomicity. This means we do not have to preserve an order of operations
> - so we can have multiple queues for request and responses. This has
> show in the network world to improve performance considerably.
> 
> 
> I). Wastage of response/request on the same ring. Currently each response
> MUST occupy the same amount of space that the request occupies - as the
> ring can have both responses and requests. Separating the request and
> response ring would remove the wastage.
> 
> 
> J). 32-bit vs 64-bit (or 102 bytes vs 112 bytes). The size of the ring
> entries is different if the guest is in 32-bit or 64-bit mode. Cleaning
> this up to be the same size would save considerable accounting that the
> host has to do (extra memcpy for each response/request).

PNG image

PNG image

Attachment: blkif.h
Description: Text document

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.