[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2 linux-next] Revert "ufs: fix deadlocks introduced by sb mutex merge"



On Sat 20-06-15 00:07:39, Al Viro wrote:
> On Wed, Jun 17, 2015 at 09:31:16PM +0100, Al Viro wrote:
> > On Wed, Jun 17, 2015 at 10:57:15AM +0200, Jan Kara wrote:
> > > > Joy...  Folks, is anybody actively maintaining fs/ufs these days?
> > > 
> > > Looking into the changelog there wasn't anyone seriously looking into UFS
> > > for at least 5-6 years... Fabian did some cleanups recently but they were
> > > mostly cosmetic. So I don't think it's really maintained. OTOH clearly
> > > there are some users since occasionally someone comes with a bug report.
> > > Welcome to the world where we have ~50 on disk / network filesystems :-|
> > 
> > FWIW, I've a partial rewrite of that area (completely untested, etc.)
> > in vfs.git#ufs; it's still in progress, but hopefully it'll be done
> > by tomorrow.  Basically, it switches the sucker to locking scheme
> > similar to ext2 (modulo seqlock instead of rwlock for protection of
> > block pointers, having to be more careful due to 64bit assignments not
> > being atomic and being unable to fall back to ->truncate_mutex in
> > case of race on the read side) and starts cleaning the hell out of
> > the truncate (and eventually create side of get_block as well).
> > 
> > As far as I can see, the root of the problems with that sucker is the
> > misplaced handling of tail-packing logics.  That had ballooned into
> > a lot of convoluted codepaths ;-/  I'm not blaming Evgeniy - it *is*
> > painful to massage into saner shape and the damn thing kept missing
> > cleanups that were done to similar filesystems due to that...
> > 
> > One thing I'm somewhat unhappy about in what Evgeniy had done is dealing 
> > with
> > UFS vs. UFS2 differences at such a low level.  Sure, the way we did 
> > endianness
> > handling in there provoked exactly that, but it might have been better to
> > go the other way round; see e.g. fs/minix/itree*.c for opposite approach.
> > 
> > All this stuff is currently completely untested; I'll be using generic
> > parts of xfstests for beating it up, but any assistance would be welcome.
> > Note: all testing I'll be realistically able to do will be for FreeBSD
> > UFS variants - I don't have Solaris left on any boxen, to start with...
> 
> FWIW, I've got to one bloody unpleasant part of UFS we'd never managed to
> get right and it seems that I have a potential solution.
> 
> Problem: UFS has larger-than-page allocation units.  And when we allocate
> a block in a hole, we *must* zero the whole thing out.  The trouble being,
> we notice that inside writepage(), where we are already holding a lock on
> our page and where we can't lock other pages without risking a deadlock.
> 
> What's more, as soon as we'd inserted a reference to that block into inode,
> we are open to readpage on another page backed by the same block coming
> and seeing a normal mapped area.  So UFS block allocator ends up zeroing
> them out via _buffer_ cache, waiting for writes to finish before returning
> the block to caller.  It doesn't wait for metadata blocks (they are accessed
> via buffer cache), but for data we end up with zeroes being written to disk,
> no matter what.
> 
> That works, but it obviously hurts the performance.  Badly.  Writes into
> a hole are relatively rare, but the same thing hurts normal sequential
> writes to the end of file, which is not rare at all.
> 
> How about the following approach:
>       * Let allocation of data block at the EOF return the sucker without
> writing zeroes out.
>       * Let ->write_iter() check if it's about to create a hole at the
> (current) EOF.  In such case start with writing zeroes (in normal fashion,
> via ->begin_write()/memset()/->end_write() through the page cache) from the
> EOF to block boundary (or beginning of the area we were going to write to,
> whichever's closer).  Then proceed in normal fashion.  Incidentally, it will
> deal with tail unpacking for free.
>       * Do the same upon extending truncate.
> 
> Does anybody see holes in that?  readpage() crossing the EOF isn't a problem -
> it will not even attempt to map past the current ->i_size and will zero the
> page tail out just fine.  We get junk past the EOF on disk, but we zero it
> out (or overwrite with our data) before increasing ->i_size.
> 
> Dealing with sparse files is also possible (taking allocations from
> ->writepage() to ->page_mkwrite(), where we are not holding a page lock
> and thus can grab all affected pages), but it would be bigger, trickier
> and benefit only a relatively rare case.
> 
> Comments?

Looks good to me. BTW also ext4 (with BIGALLOC feature) and OCFS2 can have
block allocation unit (called cluster) larger than page size. However the
block size of both filesystems is still <= page size. So at least ext4
handles fun with partially initialized clusters by just marking parts
of the cluster as uninitialized in the extent tree. But the code is still
pretty messy to be honest.

                                                                Honza

-- 
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.