[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Cheap IOMMU hardware and ECC support importance

To: xen-users@xxxxxxxxxxxxx
From: Gordan Bobic <gordan@xxxxxxxxxx>
Date: Mon, 30 Jun 2014 10:04:05 +0100
Delivery-date: Mon, 30 Jun 2014 09:04:19 +0000
List-id: Xen user discussion <xen-users.lists.xen.org>

On 06/29/2014 06:07 AM, lee wrote:

Gordan Bobic <gordan@xxxxxxxxxx> writes:

On 06/28/2014 12:25 PM, lee wrote:

Kuba <kuba.0000@xxxxx> writes:

SSD caching
means two extra disks for the cache (or what happens when the cache disk
fails?),


For ZIL (write caching), yes, you can use a mirrored device. For read
caching it obviously doesn't matter.


That's not so obvious --- when the read cache fails, ZFS would
automatically have to resort to the disks.

and ZFS doesn't increase the number of SAS/SATA ports you have.


No, but it does deprecate the RAID and caching parts of a controller,


Why does it deprecate them?

Because it's RAID is far more advanced, and it makes far better use ofthe caches built into the disks. For example, ZFS parity RAID (n+1, n+2,n+3) avoids the parity RAID write hole, where a partial stripe writerequires two operations:


1) Write that needs to be committed + read of the rest of the stripe
2) Write of the updated parity block

Since ZFS uses variable width stripes, every write is always a singleoperation.

so you might as well just use an HBA (cheaper). Covering the whole
stack, ZFS can also make much better use of on-disk caches (my 4TB
HGSTs have 64MB of RAM each. If you have 20 of them on a 4-port SATA
card with a 5-port multiplier on each port,


There are multipliers for SATA ports?


Yes.

Can you connect SAS disks to them as well?


No. You cannot plug SAS disks into SATA ports, with or without multipliers.

Do the disks show up individually or bundled when you use one?

Individually - unless you get a more advanced multiplier that makes theminto one big logical RAID-ed device - but don't do that. :)

Aren't they getting into each others ways, filling up the bandwidth of
the port?

If your HBA/PMP/HDD don't support FIS+NCQ, then yes. Operations cannotbe multiplexed, and the performance reduces with every added disk sinceevery operation holds the bus until it completes.

If your HBA/PMP/HDD do support FIS+NCQ (the models I mentioned do), thenthe bandwidth is effectively multiplexed on demand. It works a bit likeVLAN tagging. A command gets issued, but while that command iscompleting (~8ms on a 7200rpm disk) you can issue commands to otherdisks, so multiple commands on multiple disks can be completing at thesame time. As each disk completes the command and returns data, the samehappens in reverse. Because the commands get interleaved this way, asyou add more disk, you are increasing the upstream port utilization (upto it's capacity, if your command mix can saturate that much bandwidth).

Saturating the upstream port is only really an issue if all of yourdisks are doing large linear transfers. With typical I/O patterns, youspend most of the time waiting for the rotational latency, so inrealistic use, the fact that you don't have a dedicated port's worth ofbandwidth for each disk doesn't matter as much.


SAS expanders work in a similar way.

How does it do the checksumming?


Every block is checksummed, and this is stored and checked on every
read of that block. In addition, every block (including it's checksum)
are encoded for any extra redundancy specified (e.g. mirroring or n+1,
n+2 or n+3). So if you read the block, you also read the checksum
stored with it, and if it checks out, you hand the data to the app
with nothing else to be done. If the checksum doesn't match the data
(silent corruption), or read of one of the disks containing a piece of
the block fails (non-silent corruption, failed sector)), ZFS will go
and


And? Correct the error?


Sorry, did that get truncated?

Yes, indeed, ZFS will initiate recovery procedures, find a combinationof blocks which, when assembled, match the checksum, return the data,re-calculate the damaged block and write it back to the disk that didn'treturn the correct data.

So it's like RAID built into the file system?  What about all the CPU
overhead?

It's expensive - for a file system. In reality, I use a number of 1.3GHzN36L HP Microservers with 4-8 disks in each in RAIDZ2 (n+2 redundancysimilar to RAID6), and even on weekly disk scrubs they never getanywhere near running out of CPU.

Read everything after it's been written to verify?


No, just written with a checksum on the block and encoded for extra
redundancy.


That means you don't really know whether the data has been written as
expected before it's read.

No, you don't - but you don't with any kind of RAID. The only featureavailable for that is disk's own WRV feature, which most disks don'tsupport. But at least with ZFS you get a decent chance of getting thedata back intact even if some of it ended up as a phantom write.

If you have Seagate disks that support the feature you can
enable Write-Read-Verify at disk level. I wrote a patch for hdparm for
toggling the feature.


Only 4 small SAS disks are Seagates (I only put two of them in), the
rest is WD SATAs --- and I'm starting to suspect that the RAID
controller in the server doesn't like the WD disks at all, which causes
the crashes.  Those disks weren't made at all for this application.

This is another problem with clever controllers, especially hardwareRAID. RAID controllers typically wait around 8-9 seconds for the disk toreturn the data. If it doesn't, they kick the disk out of the array. Awhile back, most disks shipped with Time Limited Error Recovery (TLER).Nowdays most disks ship with feature crippled firmwares to enablemanufacturers to charge extra for disks which have TLER enabled (e.g. WDReds do, other WDs don't).

My 1TB drives have it (got some from all manufacturers, but most are upto 5 years old). Recent drives generally don't unless they are the onesmarketed for use in NAS applications. HGST are one exception to the rule- I have a bunch of their 4TB drives, and they only make one 4TB model,which has TLER. Most other manufacturers make multiple variants of thesame drive, and most are selectively feature-crippled.



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

References:
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Mihail Ivanov
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Kuba
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: Gordan Bobic
- Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
  - From: lee

Prev by Date: Re: [Xen-users] How to set 'xm dmesg' output to a file
Next by Date: Re: [Xen-users] BSOD after live migrate a windows 2003(32bit) with GPL PV driver installed
Previous by thread: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Next by thread: Re: [Xen-users] Cheap IOMMU hardware and ECC support importance
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.