Xen project Mailing List

Re: Block protocol incompatibilities with 4K logical sector size disks

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Anthony PERARD <anthony.perard@xxxxxxxxxx>

Date: Fri, 30 Aug 2024 16:09:25 +0000

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Paul Durrant <paul@xxxxxxx>, Owen Smith <owen.smith@xxxxxxxxx>, Mark Syms <mark.syms@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>

Delivery-date: Fri, 30 Aug 2024 16:09:37 +0000

Feedback-id: 30504962:30504962.20240830:md

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Thu, Aug 29, 2024 at 05:42:45PM +0200, Roger Pau Monné wrote: > On Thu, Aug 29, 2024 at 01:15:42PM +0000, Anthony PERARD wrote: > > On Thu, Aug 29, 2024 at 12:59:43PM +0200, Roger Pau Monné wrote: > > > The following table attempts to summarize in which units the following > > > fields > > > are defined for the analyzed implementations (please correct me if I got > > > some > > > of this wrong): > > > > > > │ sectors xenbus node │ requests sector_number │ > > > requests {first,last}_sect > > > ────────────────────────┼─────────────────────┼────────────────────────┼─────────────────────────── > > > FreeBSD blk{front,back} │ sector-size │ sector-size │ > > > 512 > > > ────────────────────────┼─────────────────────┼────────────────────────┼─────────────────────────── > > > Linux blk{front,back} │ 512 │ 512 │ > > > 512 > > > ────────────────────────┼─────────────────────┼────────────────────────┼─────────────────────────── > > > QEMU blkback │ sector-size │ sector-size │ > > > sector-size > > > ────────────────────────┼─────────────────────┼────────────────────────┼─────────────────────────── > > > Windows blkfront │ sector-size │ sector-size │ > > > sector-size > > > ────────────────────────┼─────────────────────┼────────────────────────┼─────────────────────────── > > > MiniOS │ sector-size │ 512 │ > > > 512 > > > ────────────────────────┼─────────────────────┼────────────────────────┼─────────────────────────── > > > tapdisk blkback │ 512 │ sector-size │ > > > 512 Tapdisk situation seems more like: tapdisk blkback │ ?????????? │ ??????????? │ ????? I've looks at the implementation at xapi-project/blktat[1] and the way sector_number or {first,last}_sect seems to be used varied on which backend is used (block-vhd, block-nbd, block-aio). [1] https://github.com/xapi-project/blktap block-vhd seems mostly sectors of 512 but recalculated with "s->spb" (sector per block?) but still, sector seems to be only 512. block-nbd seems to set "sector-size" to always 512, but uses "sector-size" for sector_number and {first,last}_sect. The weirdest one is block-aio, where on read it multiply sector_number and {first,last}_sect by 512, but on write, those are multiplied by "sector-size". With "sector-size" set by ioctl(BLKSSZGET) At least, is seems "sectors" is a multiple of 512 on all those, like in the table, but I've only look at those 3 "drivers". > > There's OVMF as well, which copied MiniOS's implementation, and looks > > like it's still the same as MiniOS for the table above: > > > > OVMF (base on MiniOS) │ sector-size │ 512 │ > > 512 > > > > > > > > It's all a mess, I'm surprised we didn't get more reports about > > > brokenness when > > > using disks with 4K logical sectors. > > > > > > Overall I think the in-kernel backends are more difficult to update (as it > > > might require a kernel rebuild), compared to QEMU or blktap. Hence my > > > slight > > > preference would be to adjust the public interface to match the behavior > > > of > > > Linux blkback, and then adjust the implementation in the rest of the > > > backends > > > and frontends. > > > > I would add that making "sector-size" been different from 512 illegal > > makes going forward easier, has every implementation will work with a > > "sector-size" of 512, and it probably going to be the most common sector > > size for a while longer. > > My main concern is the amount of backends out there that already > expose a "sector-size" different than 512. I fear any changes here > will take time to propagate to in-kernel backends, and hence my > approach was to avoid modifying Linux blkback, because (as seen in the > FreeBSD bug report) there are already instances of 4K logical sector > disks being used by users. Modifying the frontends is likely easier, > as that's under the owner of the VM control. > > > > There was an attempt in 2019 to introduce a new frontend feature flag to > > > signal > > > whether the frontend supported `sector-size` xenstore nodes different > > > than 512 [0]. > > > However that was only ever implemented for QEMU blkback and Windows > > > blkfront, > > > all the other backends will expose `sector-size` different than 512 > > > without > > > checking if `feature-large-sector-size` is exposed by the frontend. I'm > > > afraid > > > it's now too late to retrofit that feature into existing backends, seeing > > > as > > > they already expose `sector-size` nodes greater than 512 without checking > > > if > > > `feature-large-sector-size` is reported by the frontend. > > > > Much before that, "physical-sector-size" was introduced (2013): > > > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=a67e2dac9f8339681b30b0f89274a64e691ea139 > > > > Linux seems to implement it, but QEMU or OVMF don't have it. > > Yeah, I was aware of this, normal disks already have a physical sector > size (optimal sector size) and a logical sector size (minimal size > supported by the drive). Some implement a smaller logical than > physical sector size by doing read-modify-write. > > > > My proposal would be to adjust the public interface with: > > > > > > * Disk size is calculated as: `sectors` * 512 (`sectors` being the > > > contents of > > > such xenstore backend node). > > > > > > * All the sector related fields in blkif ring requests use a 512b base > > > sector > > > size, regardless of the value in the `sector-size` xenstore node. > > > > > > * The `sector-size` contains the disk logical sector size. The frontend > > > must > > > ensure that all request segments addresses are aligned and it's length > > > is > > > a multiple of such size. Otherwise the backend will refuse to process > > > the > > > request. > > > > You still want to try to have a "sector-size" different from 512? To me > > this just add confusion to the confusion. There would be no way fro > > backend or frontend to know if setting something other than 512 is going > > to work. > > But that's already the case, most (all?) backends except QEMU will set > "sector-size" to the underlying block storage logical sector size QEMU, only if feature-large-sector-size is set, indeed, otherwise it just return an error if it have to set "sector-size" to a value different from 512. Otherwise, yes for Linux, FreeBSD, and maybe yes for blktap. For blktap it seems to depend of the storage, more or less: - block-vhd: always "sector-size" = 512 - block-nbd: always "sector-size" = 512 - block-aio: physical storage sector size > without any way to tell if the frontend supports sector-sizes != 512. > So the issue is not inherently with the setting of the "sector-size" > node to a value different than 512, but rather how different > implementations have diverged regarding which is the base unit of > other fields. > > > Also, it is probably easier to update backend than frontend, so > > it is just likely that something is going to lag behind and broke. > > Hm, I'm not convinced, sometimes the owner of a VM has no control over > the version of the backends if it's not the admin of the host. OTOH > the owner of a VM could always update the kernel in order to > workaround such blkfront/blkback incompatibility issues. Hence my > preference was for solutions that didn't involve changing Linux > blkback, as I believe that's the most commonly used backend. Going the Linux way might be the least bad option indeed. sectors in requests has been described as a 512-bytes for a long while. It's only "sectors" that have been described as "sector-size"-bytes size. > > Why not make use of the node "physical-sector-size" that have existed > > for 10 years, even if unused or unadvertised, and if an IO request isn't > > aligned on it, it is just going to be slow (as backend would have to > > read,update,write instead of just write sectors). > > I don't really fancy implementing read-modify-write on the backends, > as it's going to add more complexity to blkback implementations, > specially the in-kernel ones I would assume. > > All frontends I've looked into support "sector-size" != 512, but > there's a lack of uniformity on whether other units used in the > protocol are based on the blkback exposed "sector-size", or hardcoded > to 512. > > So your suggestion would be to hardcode "sector-size" to 512 and use > the "physical-sector-size" node value to set the block device logical > sector size the frontends? > > If we go that route I would suggest that backends are free to refuse > requests that aren't a multiple of "physical-sector-size". After looking in more detail in the different implementations, and linux one, I don't think changing "physical-sector-size" meaning is going to be helpful. What to do about "feature-large-sector-size"? Should backend refuse to connect to the front end if that flag is set and "sector-size" want to be different than 512? This would just be Windows frontend I guess. (Just as an helper for updated backend) So yes, after more research, having sector in the protocol been a 512-byte size seems the least bad option. "sector_number" and "{first,last}_sect" have been described as is for a long while. Only "sectors" for the size has been described as a "sector-size" quantity. Cheers, -- Anthony Perard | Vates XCP-ng Developer XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.