Tuesday, February 5, 2013

Optimizing for IOPS

So we recently discovered that if a user accidentally forgets to make his/her jobs use the local scratch space on workernodes it's not too hard to overload a shared /home filesystem with iops. Right now the home filesystem is a basic 2-drive RAID0 configuration that is mirrored to a backup host with DRBD (might describe the setup in a later post) and while the SATA drives with cache enabled seemed to be able to handle a huge amount of IOPS (the cluster did an average of 700-2000 IOPS at the time) it did show in iostat output that the drives were saturating most of the time:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdx 49.00 3047.00 222.00 888.00 6.25 15.37 39.90 4.02 3.50 0.85 94.30
sdy 26.00 2930.00 226.00 825.00 6.16 14.67 40.59 2.30 2.25 0.77 81.20
md1 0.00 0.00 518.00 7690.00 12.23 30.04 10.55 0.00 0.00 0.00 0.00

You can see this output with iostat -dmx sdx sdy md1 1

where -d means disk utilization information, -m shows the numbers in MB instead of blocks and -x shows extended information. The list of drives is followed otherwise it shows for all (and this node had 24 drives) and the final 1 means that it shows it in 1s intervals. The first output is always averaged since system boot, the next ones show results for the interval then.

So while the shared home doesn't really need much capacity (a few TB are fine) it does need IOPS if it is to be responsive while cluster jobs are on it as well as users doing interactive things like compilation, git operations etc that all do reasonable amount of IOPS while not requiring much bandwidth. So we decided to repurpose a few new storage servers. Each server has 36 drives of 3TB (2.73TB usable). The raw capacity is therefore ca 98.3TB. Now the usual way we exploit this is that we mount each drive separately and give it to hadoop as data brick. This way if a single drive fails we only lose that part and hadoop will replicate the blocks elsewhere. We still want to do that so the way we configured this is using LVM. 

Firstly we create 36 PV's that basically tell the volume manager the backend store units that it can utilize.

for i in `cat list`; 
  pvcreate /dev/$i; 

Here I used a file called list to hold all the device names (as in sdc sdd etc), as there are 36, to ease the creation.

We then create one large volume group on top of those 36 drives allowing us to span the whole system for logical volumes. 

vgcreate vg0 `cat list|xargs -i echo -n "/dev/{} "`

Again, we just parse the list file to add all the devices to the command. Basically it looks like:

vgcreate vg0 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi ...

So now we have one huge volume group that can take all that we need:

# vgs
  VG   #PV #LV #SN Attr   VSize  VFree 
  vg0   36   0   0 wz--n- 98.23T 98.23T

So let's create first the RAID10 mirror for high IOPS. As I want to use it across all drives I'll have to tell lvcreate to use mirror log in memory as there are no spare drives that it can use for it. If you do create a smaller LV, then ignore the --mirrorlog core part as it'll just use a few MB on some other drive in the VG to keep the log.

lvcreate --mirrors 1 --stripes 18 --mirrorlog core -L 2T -n home vg0

This creates a mirror that has each side striped across 18 drives. Effectively a RAID10. The size is 2TB, the mirrorlog is kept in core memory and it's named home (i.e. /dev/vg0/home) and created on the volume group called vg0.

One can see the status with lvs command:

# lvs
  LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
  home vg0  mwi-a- 2.00T                          0.01    

you'll notice that it's started a synchronization of the volume. It'll take a while and during that time the drives will be somewhat busy so if you want to measure any performance it's better to wait until it has completed the synchronization.

For further details on the volume you can look with the lvdisplay command:

# lvdisplay 
  --- Logical volume ---
  LV Name                /dev/vg0/home
  VG Name                vg0
  LV UUID                xhqUt8-vkbv-UUnZ-Hcvu-8ADX-8y8V-dtwucm
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                2.00 TB
  Current LE             524304
  Mirrored volumes       2
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

Once it has reached stability we can do some performance measurements. A sample tool to use is iozone to create the load (first you have to make a filesystem on the new raid and mount it though).

So after mkfs.ext4 -m 0 /dev/vg0/home and mount /dev/vg0/home /mnt we have the setup we need.

Running iozone in one window with the basic -a option (that is automatic testing of various tests) we run iostat in another window. If you want to figure out from iostat output which device to monitor for the raid volume, then look at the lvdisplay output. It says block device 253:2. You can then look at /proc/diskstats to find what notation it has:

 253    2 dm-2 487 0 3884 22771 52117628 0 416941024 3330370634 0 477748 3330507046

So our 253:2 corresponds to dm-2 device. You can then run for example:

iostat -dmx sdd sdm sdal dm-2 10

this would show you three individual drives and the full volume in 10s intervals with extended statistics and sizes shown in MB. Here's a sample output after letting it run for a few iterations:

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdd               0.00  3290.70  0.00 122.10     0.00    12.80   214.77    18.39  147.23   4.33  52.88
sdm               0.00  3521.30  0.80 121.50     0.07    13.70   230.47    17.91  143.31   4.42  54.08
sdal              0.00  3299.10  0.00 108.40     0.00    12.83   242.30    14.85  133.63   5.27  57.10
dm-2              0.00     0.00  0.50 65955.50     0.00   257.64     8.00 28801.15  400.63   0.01  60.08

As you can see I caught it during write tests, the RAID wrote 260MB/s and was about 60% utilized during the measurement window of 10s. It managed to average almost 66000 write requests to the volume. The OS was able to merge some of those (you can see the wrqm/s that there were 3300-3500 write requests merged on average to ca 110-120 writes as seen in the w/s column). So thanks to merging of write requests the volume was able to take almost 66000 IOPS. The actual drives average around 120-150 IOPS, which is about what you'd expect from ordinary 7200 rpm SATA drives. So in a 36 drive volume you can expect read IOPS to go as high as 5400 IOPS to the metal (and thanks to merging of requests probably up to 10-30x that) and about half of that for writes. In any case it gives you capacity AND IOPS. One doesn't always necessarily need to invest in SSDs immediately though we do have a few SSD in our system as well.

Here's an example from running iozone on one of those SSDs:

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00 30508.00  0.00 481.00     0.00   118.58   504.88     5.08    9.99   1.60  77.20

As you can see it was able to merge over 30k IOPS into 480 writes/s all by itself so SSDs are impressive, but we reached twice of that in our RAID10 config using ordinary SATA drives.

I'll cover the making of DRBD mirroring of the data and possibly setting up a clustered NFS server in a separate post.

No comments:

Post a Comment