Tuesday, February 5, 2013

DBRD mirrored disk

So today I'm gonna talk about how to set up a replication between two servers. In my case I've created one LVM on both servers that is a RAID10 and now I want to use DRBD to keep the block devices synchronized and therefore protect me against a server failure.

We start with installing the DRBD packages. I've found the easiest to just use the elrepo repository: http://elrepo.org/ follow the instructions there to set up the repository in Yum.

You then install the package and kernel module. If you have a recent enough kernel the kmod package may not be necessary. Look it up from drbd website.

yum install -y drbd84-utils kmod-drbd84

Next up we need to configure the drbd resource that we want to mirror. In our case we want to keep /home mirrored so we create a new resource file in /etc/drbd.d/ called home.res with the following contents:


resource home {
  device /dev/drbd1;
  disk /dev/vg0/home;
  meta-disk internal;
  on se3 {
    address 192.168.1.243:7789;
  }
  on se5 {
    address 192.168.1.245:7789;
  }
}

What it says is that it will use the local device /dev/vg0/home and create a new block device /dev/drbd1 that you should then use. In addition we ask to use internal metadata, which means that the metainfo is kept on the same device as data. This however means that you will have to re-create the filesystem on the device. If you want to create a drbd mirror of an already existing filesystem, then you have to use external metadata and I suggest you consule DRBD website Chapter 17 DRBD internals that discusses the metadata usage. There are downsides to using metadata on the same block device if the block device is not a raid volume, but a single disk so do take this into consideration.

So the above configuration describes a drbd resource called home that has two servers with their respective IP-s given and the port on which to talk to the other host. It is preferred to use a separate heartbeat channel if possible, but it works quite well over a single network interface that is also used for other things.

Now that we have described the resource we have to enable it the first time. Choose one of the nodes and ask drbd to create the resources metadata:

# drbdadm create-md home

as we use internal metadata it'll warn you that you are possibly destroying existing data. Remember that you are indeed if you had anything previously stored. In that case you better consult drbd manual how to set up with external metadata.

Now that we have the metadata created we can bring the resource up. For it we first load the dbrd kernel module and then ask the resource to be brought up:

# modprobe drbd
# dbrdadm up home

We can now check the status:


[root@se5 drbd.d]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R5, 2012-09-06 08:15:57

 1: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----s
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:2147483608


So right now it shows that it's waiting for connection to the other host (we've not yet installed/configured it there) and it considers itself inconsistent, which is normal at this point.

Now let's install drbd on the other server and copy over the configuration so that we can also bring up the other part of our distributed mirror.

rpm --import http://elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://elrepo.org/elrepo-release-5-4.el5.elrepo.noarch.rpm
yum install -y drbd84-utils kmod-drbd84
scp se5:/etc/drbd.d/home.res /etc/drbd.d/home.res
drbdadm create-md home
modprobe drbd
drbdadm up home

Now that we've repeated the steps the two resources should be in communication, but still unable to decide who's the master from which to do the initial synchronization:


[root@se3 ~]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R5, 2012-09-06 08:15:57

 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:2147483608

So now let's pick one server (in our case se3) that we call the primary and force it to be considered it:

drbdadm primary --force home

Now drbd knows from whom to copy the necessary data over (even though we've not really even created any data on it) and starts the synchronization:


[root@se3 ~]# cat /proc/drbd
version: 8.4.2 (api:1/proto:86-101)
GIT-hash: 7ad5f850d711223713d6dcadc3dd48860321070c build by dag@Build64R5, 2012-09-06 08:15:57

 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:4025152 nr:0 dw:0 dr:4025152 al:0 bm:245 lo:1 pe:13 ua:0 ap:0 ep:1 wo:f oos:2143460504
[>....................] sync'ed:  0.2% (2093220/2097148)M
finish: 5:59:58 speed: 99,236 (82,104) K/sec

You can now already create a filesystem on the new device and mount it as well as start using, but remember, that it's still under sync so until that is not completed the performance might be degraded.

What I have not covered here is what the default settings are and what they mean. I recommend reading up on drbd website, but basically the default protocol that drbd uses requires synchronized writes. So if the devices are connected any write to the primary device will also initiate a write on the secondary and the OK will be given back only once both have completed. It can be set up as asynchronous write if needed, but that requires changing the protocol. Also, one can set up an active-active (i.e. primary/primary) setup allowing both devices to be used at the same time, but caution should be observed in such setups to avoid split brain situations and issues. In our case we want an active-passive configuration with NFS server failover, but that we will discuss in a separate post.


Optimizing for IOPS

So we recently discovered that if a user accidentally forgets to make his/her jobs use the local scratch space on workernodes it's not too hard to overload a shared /home filesystem with iops. Right now the home filesystem is a basic 2-drive RAID0 configuration that is mirrored to a backup host with DRBD (might describe the setup in a later post) and while the SATA drives with cache enabled seemed to be able to handle a huge amount of IOPS (the cluster did an average of 700-2000 IOPS at the time) it did show in iostat output that the drives were saturating most of the time:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdx 49.00 3047.00 222.00 888.00 6.25 15.37 39.90 4.02 3.50 0.85 94.30
sdy 26.00 2930.00 226.00 825.00 6.16 14.67 40.59 2.30 2.25 0.77 81.20
md1 0.00 0.00 518.00 7690.00 12.23 30.04 10.55 0.00 0.00 0.00 0.00

You can see this output with iostat -dmx sdx sdy md1 1

where -d means disk utilization information, -m shows the numbers in MB instead of blocks and -x shows extended information. The list of drives is followed otherwise it shows for all (and this node had 24 drives) and the final 1 means that it shows it in 1s intervals. The first output is always averaged since system boot, the next ones show results for the interval then.

So while the shared home doesn't really need much capacity (a few TB are fine) it does need IOPS if it is to be responsive while cluster jobs are on it as well as users doing interactive things like compilation, git operations etc that all do reasonable amount of IOPS while not requiring much bandwidth. So we decided to repurpose a few new storage servers. Each server has 36 drives of 3TB (2.73TB usable). The raw capacity is therefore ca 98.3TB. Now the usual way we exploit this is that we mount each drive separately and give it to hadoop as data brick. This way if a single drive fails we only lose that part and hadoop will replicate the blocks elsewhere. We still want to do that so the way we configured this is using LVM. 

Firstly we create 36 PV's that basically tell the volume manager the backend store units that it can utilize.

for i in `cat list`; 
do 
  pvcreate /dev/$i; 
done

Here I used a file called list to hold all the device names (as in sdc sdd etc), as there are 36, to ease the creation.

We then create one large volume group on top of those 36 drives allowing us to span the whole system for logical volumes. 

vgcreate vg0 `cat list|xargs -i echo -n "/dev/{} "`

Again, we just parse the list file to add all the devices to the command. Basically it looks like:

vgcreate vg0 /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi ...

So now we have one huge volume group that can take all that we need:


# vgs
  VG   #PV #LV #SN Attr   VSize  VFree 
  vg0   36   0   0 wz--n- 98.23T 98.23T

So let's create first the RAID10 mirror for high IOPS. As I want to use it across all drives I'll have to tell lvcreate to use mirror log in memory as there are no spare drives that it can use for it. If you do create a smaller LV, then ignore the --mirrorlog core part as it'll just use a few MB on some other drive in the VG to keep the log.

lvcreate --mirrors 1 --stripes 18 --mirrorlog core -L 2T -n home vg0

This creates a mirror that has each side striped across 18 drives. Effectively a RAID10. The size is 2TB, the mirrorlog is kept in core memory and it's named home (i.e. /dev/vg0/home) and created on the volume group called vg0.

One can see the status with lvs command:

# lvs
  LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert
  home vg0  mwi-a- 2.00T                          0.01    

you'll notice that it's started a synchronization of the volume. It'll take a while and during that time the drives will be somewhat busy so if you want to measure any performance it's better to wait until it has completed the synchronization.

For further details on the volume you can look with the lvdisplay command:

# lvdisplay 
  --- Logical volume ---
  LV Name                /dev/vg0/home
  VG Name                vg0
  LV UUID                xhqUt8-vkbv-UUnZ-Hcvu-8ADX-8y8V-dtwucm
  LV Write Access        read/write
  LV Status              available
  # open                 0
  LV Size                2.00 TB
  Current LE             524304
  Mirrored volumes       2
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:2

Once it has reached stability we can do some performance measurements. A sample tool to use is iozone to create the load (first you have to make a filesystem on the new raid and mount it though).

So after mkfs.ext4 -m 0 /dev/vg0/home and mount /dev/vg0/home /mnt we have the setup we need.

Running iozone in one window with the basic -a option (that is automatic testing of various tests) we run iostat in another window. If you want to figure out from iostat output which device to monitor for the raid volume, then look at the lvdisplay output. It says block device 253:2. You can then look at /proc/diskstats to find what notation it has:

 253    2 dm-2 487 0 3884 22771 52117628 0 416941024 3330370634 0 477748 3330507046

So our 253:2 corresponds to dm-2 device. You can then run for example:

iostat -dmx sdd sdm sdal dm-2 10

this would show you three individual drives and the full volume in 10s intervals with extended statistics and sizes shown in MB. Here's a sample output after letting it run for a few iterations:


Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdd               0.00  3290.70  0.00 122.10     0.00    12.80   214.77    18.39  147.23   4.33  52.88
sdm               0.00  3521.30  0.80 121.50     0.07    13.70   230.47    17.91  143.31   4.42  54.08
sdal              0.00  3299.10  0.00 108.40     0.00    12.83   242.30    14.85  133.63   5.27  57.10
dm-2              0.00     0.00  0.50 65955.50     0.00   257.64     8.00 28801.15  400.63   0.01  60.08

As you can see I caught it during write tests, the RAID wrote 260MB/s and was about 60% utilized during the measurement window of 10s. It managed to average almost 66000 write requests to the volume. The OS was able to merge some of those (you can see the wrqm/s that there were 3300-3500 write requests merged on average to ca 110-120 writes as seen in the w/s column). So thanks to merging of write requests the volume was able to take almost 66000 IOPS. The actual drives average around 120-150 IOPS, which is about what you'd expect from ordinary 7200 rpm SATA drives. So in a 36 drive volume you can expect read IOPS to go as high as 5400 IOPS to the metal (and thanks to merging of requests probably up to 10-30x that) and about half of that for writes. In any case it gives you capacity AND IOPS. One doesn't always necessarily need to invest in SSDs immediately though we do have a few SSD in our system as well.

Here's an example from running iozone on one of those SSDs:


Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00 30508.00  0.00 481.00     0.00   118.58   504.88     5.08    9.99   1.60  77.20

As you can see it was able to merge over 30k IOPS into 480 writes/s all by itself so SSDs are impressive, but we reached twice of that in our RAID10 config using ordinary SATA drives.

I'll cover the making of DRBD mirroring of the data and possibly setting up a clustered NFS server in a separate post.