[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Block protocol incompatibilities with 4K logical sector size disks



On Thu, Aug 29, 2024 at 05:42:45PM +0200, Roger Pau Monné wrote:
> On Thu, Aug 29, 2024 at 01:15:42PM +0000, Anthony PERARD wrote:
> > On Thu, Aug 29, 2024 at 12:59:43PM +0200, Roger Pau Monné wrote:
> > > The following table attempts to summarize in which units the following 
> > > fields
> > > are defined for the analyzed implementations (please correct me if I got 
> > > some
> > > of this wrong):
> > >
> > >                         │ sectors xenbus node │ requests sector_number │ 
> > > requests {first,last}_sect
> > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > FreeBSD blk{front,back} │     sector-size     │      sector-size       │  
> > >          512
> > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > Linux blk{front,back}   │         512         │          512           │  
> > >          512
> > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > QEMU blkback            │     sector-size     │      sector-size       │  
> > >      sector-size
> > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > Windows blkfront        │     sector-size     │      sector-size       │  
> > >      sector-size
> > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > MiniOS                  │     sector-size     │          512           │  
> > >          512
> > > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > > tapdisk blkback         │         512         │      sector-size       │  
> > >          512

Tapdisk situation seems more like:

     tapdisk blkback         │      ??????????     │      ???????????       │   
      ?????

I've looks at the implementation at xapi-project/blktat[1] and the way
sector_number or {first,last}_sect seems to be used varied on which
backend is used (block-vhd, block-nbd, block-aio).

[1] https://github.com/xapi-project/blktap

block-vhd seems mostly sectors of 512 but recalculated with "s->spb"
(sector per block?) but still, sector seems to be only 512.

block-nbd seems to set "sector-size" to always 512, but uses
"sector-size" for sector_number and {first,last}_sect.

The weirdest one is block-aio, where on read it multiply sector_number
and {first,last}_sect by 512, but on write, those are multiplied by
"sector-size". With "sector-size" set by ioctl(BLKSSZGET)

At least, is seems "sectors" is a multiple of 512 on all those, like in
the table, but I've only look at those 3 "drivers".


> > There's OVMF as well, which copied MiniOS's implementation, and looks
> > like it's still the same as MiniOS for the table above:
> >
> > OVMF (base on MiniOS)   │     sector-size     │          512           │    
> >        512
> >
> > >
> > > It's all a mess, I'm surprised we didn't get more reports about 
> > > brokenness when
> > > using disks with 4K logical sectors.
> > >
> > > Overall I think the in-kernel backends are more difficult to update (as it
> > > might require a kernel rebuild), compared to QEMU or blktap.  Hence my 
> > > slight
> > > preference would be to adjust the public interface to match the behavior 
> > > of
> > > Linux blkback, and then adjust the implementation in the rest of the 
> > > backends
> > > and frontends.
> >
> > I would add that making "sector-size" been different from 512 illegal
> > makes going forward easier, has every implementation will work with a
> > "sector-size" of 512, and it probably going to be the most common sector
> > size for a while longer.
>
> My main concern is the amount of backends out there that already
> expose a "sector-size" different than 512.  I fear any changes here
> will take time to propagate to in-kernel backends, and hence my
> approach was to avoid modifying Linux blkback, because (as seen in the
> FreeBSD bug report) there are already instances of 4K logical sector
> disks being used by users.  Modifying the frontends is likely easier,
> as that's under the owner of the VM control.
>
> > > There was an attempt in 2019 to introduce a new frontend feature flag to 
> > > signal
> > > whether the frontend supported `sector-size` xenstore nodes different 
> > > than 512 [0].
> > > However that was only ever implemented for QEMU blkback and Windows 
> > > blkfront,
> > > all the other backends will expose `sector-size` different than 512 
> > > without
> > > checking if `feature-large-sector-size` is exposed by the frontend.  I'm 
> > > afraid
> > > it's now too late to retrofit that feature into existing backends, seeing 
> > > as
> > > they already expose `sector-size` nodes greater than 512 without checking 
> > > if
> > > `feature-large-sector-size` is reported by the frontend.
> >
> > Much before that, "physical-sector-size" was introduced (2013):
> >     
> > https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=a67e2dac9f8339681b30b0f89274a64e691ea139
> >
> > Linux seems to implement it, but QEMU or OVMF don't have it.
>
> Yeah, I was aware of this, normal disks already have a physical sector
> size (optimal sector size) and a logical sector size (minimal size
> supported by the drive).  Some implement a smaller logical than
> physical sector size by doing read-modify-write.
>
> > > My proposal would be to adjust the public interface with:
> > >
> > >  * Disk size is calculated as: `sectors` * 512 (`sectors` being the 
> > > contents of
> > >    such xenstore backend node).
> > >
> > >  * All the sector related fields in blkif ring requests use a 512b base 
> > > sector
> > >    size, regardless of the value in the `sector-size` xenstore node.
> > >
> > >  * The `sector-size` contains the disk logical sector size.  The frontend 
> > > must
> > >    ensure that all request segments addresses are aligned and it's length 
> > > is
> > >    a multiple of such size.  Otherwise the backend will refuse to process 
> > > the
> > >    request.
> >
> > You still want to try to have a "sector-size" different from 512? To me
> > this just add confusion to the confusion. There would be no way fro
> > backend or frontend to know if setting something other than 512 is going
> > to work.
>
> But that's already the case, most (all?) backends except QEMU will set
> "sector-size" to the underlying block storage logical sector size

QEMU, only if feature-large-sector-size is set, indeed, otherwise it
just return an error if it have to set "sector-size" to a value
different from 512.

Otherwise, yes for Linux, FreeBSD, and maybe yes for blktap. For blktap
it seems to depend of the storage, more or less:
    - block-vhd: always "sector-size" = 512
    - block-nbd: always "sector-size" = 512
    - block-aio: physical storage sector size

> without any way to tell if the frontend supports sector-sizes != 512.
> So the issue is not inherently with the setting of the "sector-size"
> node to a value different than 512, but rather how different
> implementations have diverged regarding which is the base unit of
> other fields.
>
> > Also, it is probably easier to update backend than frontend, so
> > it is just likely that something is going to lag behind and broke.
>
> Hm, I'm not convinced, sometimes the owner of a VM has no control over
> the version of the backends if it's not the admin of the host.  OTOH
> the owner of a VM could always update the kernel in order to
> workaround such blkfront/blkback incompatibility issues.  Hence my
> preference was for solutions that didn't involve changing Linux
> blkback, as I believe that's the most commonly used backend.

Going the Linux way might be the least bad option indeed. sectors in
requests has been described as a 512-bytes for a long while. It's only
"sectors" that have been described as "sector-size"-bytes size.

> > Why not make use of the node "physical-sector-size" that have existed
> > for 10 years, even if unused or unadvertised, and if an IO request isn't
> > aligned on it, it is just going to be slow (as backend would have to
> > read,update,write instead of just write sectors).
>
> I don't really fancy implementing read-modify-write on the backends,
> as it's going to add more complexity to blkback implementations,
> specially the in-kernel ones I would assume.
>
> All frontends I've looked into support "sector-size" != 512, but
> there's a lack of uniformity on whether other units used in the
> protocol are based on the blkback exposed "sector-size", or hardcoded
> to 512.
>
> So your suggestion would be to hardcode "sector-size" to 512 and use
> the "physical-sector-size" node value to set the block device logical
> sector size the frontends?
>
> If we go that route I would suggest that backends are free to refuse
> requests that aren't a multiple of "physical-sector-size".

After looking in more detail in the different implementations, and linux
one, I don't think changing "physical-sector-size" meaning is going to
be helpful.

What to do about "feature-large-sector-size"? Should backend refuse to
connect to the front end if that flag is set and "sector-size" want to
be different than 512? This would just be Windows frontend I guess.
(Just as an helper for updated backend)


So yes, after more research, having sector in the protocol been a
512-byte size seems the least bad option. "sector_number" and
"{first,last}_sect" have been described as is for a long while. Only
"sectors" for the size has been described as a "sector-size" quantity.

Cheers,

--

Anthony Perard | Vates XCP-ng Developer

XCP-ng & Xen Orchestra - Vates solutions

web: https://vates.tech




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.