[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 11/20] block: reference struct block_device from struct hd_struct



Hello,

This is great. So much simpler & better. Some nits below.

> diff --git a/block/partitions/core.c b/block/partitions/core.c
> index a02e224115943d..0ba0bf44b88af3 100644
> --- a/block/partitions/core.c
> +++ b/block/partitions/core.c
> @@ -340,12 +340,11 @@ void delete_partition(struct hd_struct *part)
>       device_del(part_to_dev(part));
>  
>       /*
> -      * Remove gendisk pointer from idr so that it cannot be looked up
> -      * while RCU period before freeing gendisk is running to prevent
> -      * use-after-free issues. Note that the device number stays
> -      * "in-use" until we really free the gendisk.
> +      * Remove the block device from the inode hash, so that it cannot be
> +      * looked up while waiting for the RCU grace period.
>        */
> -     blk_invalidate_devt(part_devt(part));
> +     remove_inode_hash(part->bdev->bd_inode);

I don't think this is necessary now that the bdev and inode lifetimes are
one. Before, punching out the association early was necessary because we
could be in a situation where we can successfully look up a part from idr
and then try to pin the associated disk which may already be freed. With the
new code, the lookup is through the inode whose lifetime is one and the same
with gendisk, so use-after-free isn't possible and __blkdev_get() will
reliably reject such open attempts.

...
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4c4d6c30382c06..e94633dc6ad93b 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -870,34 +870,50 @@ void __init bdev_cache_init(void)
>       blockdev_superblock = bd_mnt->mnt_sb;   /* For writeback */
>  }
>  
> -static struct block_device *bdget(dev_t dev)
> +struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
>  {
>       struct block_device *bdev;
>       struct inode *inode;
>  
> -     inode = iget_locked(blockdev_superblock, dev);
> +     inode = new_inode(blockdev_superblock);
>       if (!inode)
>               return NULL;
>  
> -     bdev = &BDEV_I(inode)->bdev;
> +     bdev = I_BDEV(inode);
> +     spin_lock_init(&bdev->bd_size_lock);
> +     bdev->bd_disk = disk;
> +     bdev->bd_partno = partno;
> +     bdev->bd_contains = NULL;
> +     bdev->bd_super = NULL;
> +     bdev->bd_inode = inode;
> +     bdev->bd_part_count = 0;
> +
> +     inode->i_mode = S_IFBLK;
> +     inode->i_rdev = 0;
> +     inode->i_bdev = bdev;
> +     inode->i_data.a_ops = &def_blk_aops;

Missing the call to mapping_set_gfp_mask().

>  
> -     if (inode->i_state & I_NEW) {
> -             spin_lock_init(&bdev->bd_size_lock);
> -             bdev->bd_contains = NULL;
> -             bdev->bd_super = NULL;
> -             bdev->bd_inode = inode;
> -             bdev->bd_part_count = 0;
> -             bdev->bd_dev = dev;
> -             inode->i_mode = S_IFBLK;
> -             inode->i_rdev = dev;
> -             inode->i_bdev = bdev;
> -             inode->i_data.a_ops = &def_blk_aops;
> -             mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> -             unlock_new_inode(inode);
> -     }
>       return bdev;
>  }
...
>  /**
>   * bdgrab -- Grab a reference to an already referenced block device
>   * @bdev:    Block device to grab a reference to.
> @@ -957,6 +973,10 @@ static struct block_device *bd_acquire(struct inode 
> *inode)
>               bd_forget(inode);
>  
>       bdev = bdget(inode->i_rdev);
> +     if (!bdev) {
> +             blk_request_module(inode->i_rdev);
> +             bdev = bdget(inode->i_rdev);
> +     }

One side effect here is that, before, a device which uses the traditional
consecutive devt range would reserve all minors for the partitions whether
they exist or not and fail open requests without requesting the matching
module. After the change, trying to open an non-existent partition would
trigger module probe. I don't think the behavior change is consequential in
any sane not-crazily-arcane setup but it might be worth mentioning in the
commit log.

Thank you.

-- 
tejun



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.