[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Block protocol incompatibilities with 4K logical sector size disks



On Fri, Aug 30, 2024 at 04:09:25PM +0000, Anthony PERARD wrote:
> On Thu, Aug 29, 2024 at 05:42:45PM +0200, Roger Pau Monné wrote:
> > On Thu, Aug 29, 2024 at 01:15:42PM +0000, Anthony PERARD wrote:
> > > On Thu, Aug 29, 2024 at 12:59:43PM +0200, Roger Pau Monné wrote:
> > > > The following table attempts to summarize in which units the following 
> > > > fields
> > > > are defined for the analyzed implementations (please correct me if I 
> > > > got some
> > > > of this wrong):
> > > >
> > > >                         │ sectors xenbus node │ requests sector_number 
> > > > │ requests {first,last}_sect
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > FreeBSD blk{front,back} │     sector-size     │      sector-size       
> > > > │           512
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > Linux blk{front,back}   │         512         │          512           
> > > > │           512
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > QEMU blkback            │     sector-size     │      sector-size       
> > > > │       sector-size
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > Windows blkfront        │     sector-size     │      sector-size       
> > > > │       sector-size
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > MiniOS                  │     sector-size     │          512           
> > > > │           512
> > > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > > tapdisk blkback         │         512         │      sector-size       
> > > > │           512
> 
> Tapdisk situation seems more like:
> 
>      tapdisk blkback         │      ??????????     │      ???????????       │ 
>         ?????
> 
> I've looks at the implementation at xapi-project/blktat[1] and the way
> sector_number or {first,last}_sect seems to be used varied on which
> backend is used (block-vhd, block-nbd, block-aio).
> 
> [1] https://github.com/xapi-project/blktap
> 
> block-vhd seems mostly sectors of 512 but recalculated with "s->spb"
> (sector per block?) but still, sector seems to be only 512.
> 
> block-nbd seems to set "sector-size" to always 512, but uses
> "sector-size" for sector_number and {first,last}_sect.
> 
> The weirdest one is block-aio, where on read it multiply sector_number
> and {first,last}_sect by 512, but on write, those are multiplied by
> "sector-size". With "sector-size" set by ioctl(BLKSSZGET)
> 
> At least, is seems "sectors" is a multiple of 512 on all those, like in
> the table, but I've only look at those 3 "drivers".

You looked more than myself, I've just looked at the block-aio write
path I think, and I was happy enough to see it was yet different from
the other implementations.

I think it's clear enough that every implementation has slight
differences, and I don't plan to fix all of them.

> 
> > > There's OVMF as well, which copied MiniOS's implementation, and looks
> > > like it's still the same as MiniOS for the table above:
> > >
> > > OVMF (base on MiniOS)   │     sector-size     │          512           │  
> > >          512
> > >
> > > >
> > > > It's all a mess, I'm surprised we didn't get more reports about 
> > > > brokenness when
> > > > using disks with 4K logical sectors.
> > > >
> > > > Overall I think the in-kernel backends are more difficult to update (as 
> > > > it
> > > > might require a kernel rebuild), compared to QEMU or blktap.  Hence my 
> > > > slight
> > > > preference would be to adjust the public interface to match the 
> > > > behavior of
> > > > Linux blkback, and then adjust the implementation in the rest of the 
> > > > backends
> > > > and frontends.
> > >
> > > I would add that making "sector-size" been different from 512 illegal
> > > makes going forward easier, has every implementation will work with a
> > > "sector-size" of 512, and it probably going to be the most common sector
> > > size for a while longer.
> >
> > My main concern is the amount of backends out there that already
> > expose a "sector-size" different than 512.  I fear any changes here
> > will take time to propagate to in-kernel backends, and hence my
> > approach was to avoid modifying Linux blkback, because (as seen in the
> > FreeBSD bug report) there are already instances of 4K logical sector
> > disks being used by users.  Modifying the frontends is likely easier,
> > as that's under the owner of the VM control.
> >
> > > > There was an attempt in 2019 to introduce a new frontend feature flag 
> > > > to signal
> > > > whether the frontend supported `sector-size` xenstore nodes different 
> > > > than 512 [0].
> > > > However that was only ever implemented for QEMU blkback and Windows 
> > > > blkfront,
> > > > all the other backends will expose `sector-size` different than 512 
> > > > without
> > > > checking if `feature-large-sector-size` is exposed by the frontend.  
> > > > I'm afraid
> > > > it's now too late to retrofit that feature into existing backends, 
> > > > seeing as
> > > > they already expose `sector-size` nodes greater than 512 without 
> > > > checking if
> > > > `feature-large-sector-size` is reported by the frontend.
> > >
> > > Much before that, "physical-sector-size" was introduced (2013):
> > >     
> > > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=a67e2dac9f8339681b30b0f89274a64e691ea139
> > >
> > > Linux seems to implement it, but QEMU or OVMF don't have it.
> >
> > Yeah, I was aware of this, normal disks already have a physical sector
> > size (optimal sector size) and a logical sector size (minimal size
> > supported by the drive).  Some implement a smaller logical than
> > physical sector size by doing read-modify-write.
> >
> > > > My proposal would be to adjust the public interface with:
> > > >
> > > >  * Disk size is calculated as: `sectors` * 512 (`sectors` being the 
> > > > contents of
> > > >    such xenstore backend node).
> > > >
> > > >  * All the sector related fields in blkif ring requests use a 512b base 
> > > > sector
> > > >    size, regardless of the value in the `sector-size` xenstore node.
> > > >
> > > >  * The `sector-size` contains the disk logical sector size.  The 
> > > > frontend must
> > > >    ensure that all request segments addresses are aligned and it's 
> > > > length is
> > > >    a multiple of such size.  Otherwise the backend will refuse to 
> > > > process the
> > > >    request.
> > >
> > > You still want to try to have a "sector-size" different from 512? To me
> > > this just add confusion to the confusion. There would be no way fro
> > > backend or frontend to know if setting something other than 512 is going
> > > to work.
> >
> > But that's already the case, most (all?) backends except QEMU will set
> > "sector-size" to the underlying block storage logical sector size
> 
> QEMU, only if feature-large-sector-size is set, indeed, otherwise it
> just return an error if it have to set "sector-size" to a value
> different from 512.
> 
> Otherwise, yes for Linux, FreeBSD, and maybe yes for blktap. For blktap
> it seems to depend of the storage, more or less:
>     - block-vhd: always "sector-size" = 512
>     - block-nbd: always "sector-size" = 512
>     - block-aio: physical storage sector size
> 
> > without any way to tell if the frontend supports sector-sizes != 512.
> > So the issue is not inherently with the setting of the "sector-size"
> > node to a value different than 512, but rather how different
> > implementations have diverged regarding which is the base unit of
> > other fields.
> >
> > > Also, it is probably easier to update backend than frontend, so
> > > it is just likely that something is going to lag behind and broke.
> >
> > Hm, I'm not convinced, sometimes the owner of a VM has no control over
> > the version of the backends if it's not the admin of the host.  OTOH
> > the owner of a VM could always update the kernel in order to
> > workaround such blkfront/blkback incompatibility issues.  Hence my
> > preference was for solutions that didn't involve changing Linux
> > blkback, as I believe that's the most commonly used backend.
> 
> Going the Linux way might be the least bad option indeed. sectors in
> requests has been described as a 512-bytes for a long while. It's only
> "sectors" that have been described as "sector-size"-bytes size.
> 
> > > Why not make use of the node "physical-sector-size" that have existed
> > > for 10 years, even if unused or unadvertised, and if an IO request isn't
> > > aligned on it, it is just going to be slow (as backend would have to
> > > read,update,write instead of just write sectors).
> >
> > I don't really fancy implementing read-modify-write on the backends,
> > as it's going to add more complexity to blkback implementations,
> > specially the in-kernel ones I would assume.
> >
> > All frontends I've looked into support "sector-size" != 512, but
> > there's a lack of uniformity on whether other units used in the
> > protocol are based on the blkback exposed "sector-size", or hardcoded
> > to 512.
> >
> > So your suggestion would be to hardcode "sector-size" to 512 and use
> > the "physical-sector-size" node value to set the block device logical
> > sector size the frontends?
> >
> > If we go that route I would suggest that backends are free to refuse
> > requests that aren't a multiple of "physical-sector-size".
> 
> After looking in more detail in the different implementations, and linux
> one, I don't think changing "physical-sector-size" meaning is going to
> be helpful.
> 
> What to do about "feature-large-sector-size"? Should backend refuse to
> connect to the front end if that flag is set and "sector-size" want to
> be different than 512? This would just be Windows frontend I guess.
> (Just as an helper for updated backend)

I think it's likely too late to mandate exposing
"feature-large-sector-size" in the frontends, as for example Linux
blkfront will handle "sector-size" != 512, yet it doesn't expose
"feature-large-sector-size".  If we retroactively push this change to
the backends we might break setups that were working fine
previously.

> 
> So yes, after more research, having sector in the protocol been a
> 512-byte size seems the least bad option. "sector_number" and
> "{first,last}_sect" have been described as is for a long while. Only
> "sectors" for the size has been described as a "sector-size" quantity.

Thanks for your input.  I would also like to hear from the blktap and
Windows PV drivers maintainers, as the change that I'm proposing here
will require changes to their implementations.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.