Building active-active RedHat 6 Cluster with GFS2 over DRBD

Prepare nodes

Install RH6 with minimal configuration on two VMs (Look into HOWTO align VMware Linux VMDK files ). Add (vmdk) disk to every node for application. Mine configuration looks as follow (on both nodes):

/dev/sda	128m	-> First partition /dev/sda1 used for /boot
/dev/sdb	8g	-> Whole disk used as PV for rootvg
/dev/sdc	30g	-> Whole disk used as PV for GFS2 shared file system

Build one /etc/hosts with all relevant to cluster IPs and names and put it on both nodes.

Copy host SSH keys from one node to another:

vorh6t01 # scp vorh6t02:/etc/ssh/ssh_host_\* /etc/ssh/
...
vorh6t01 # service sshd restart

Generate root SSH keys and exchange it over cluster nodes:

vorh6t01 # ssh-keygen -t rsa -b 1024 -C "root@vorh6t"
.....
vorh6t01 # cat .ssh/id_rsa.pub >> .ssh/authorized_keys
vorh6t01 # scp -pr .ssh vorh6t02:

Installing DRBD software

DRBD (Distributed Replicated Block Device) will convert our both /dev/sdc disks, dedicated to each node, to behaive like shared storage. This will make our VMs becomes storage independant and allows RH cluster work.

There still no binary distribution for RH6, however you can purchase it with support from author LINBIT. AND you still able to compile it from source (thanks to GPL)

# yum install make gcc kernel-devel flex rpm-build libxslt
# cd /tmp && wget -q -O - http://oss.linbit.com/drbd/8.4/drbd-8.4.4.tar.gz | tar zxvf -
# cd drbd-8.4.4/
# ./configure --with-utils --with-km --with-udev --with-rgmanager --with-bashcompletion \
              --prefix=/usr --localstatedir=/var --sysconfdir=/etc
# make
# make install

Note:: You have to recompile kernel module every time you upgrade kernel

# make module

Configuring DRBD

DRBD will not be a part of cluster in this configuration. It will only supply infrastructure for GFS2 cluster running over it. So it will work with raw disk, configured to work as active-active and provides raw block device, simulating shared storage disk.

You can put everything in /etc/drbd.conf, however recommended by LINBIT practice to separate common and resources configuration by include directive:

# cat /etc/drbd.conf
# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example

include "drbd.d/global_common.conf";
include "drbd.d/*.res";

Copy global_common.conf from distribution to /etc/drbd.d and edit it to fix your needs.

# cat /etc/drbd.d/global_common.conf 
global {
        usage-count no;
}

common {
        handlers { }

        startup {
                wfc-timeout 300;
                degr-wfc-timeout 0;
		become-primary-on both;
        }

        options { }

        disk { }

        net {
                protocol        C;
                cram-hmac-alg   sha1;
                shared-secret   "9szdFmSkQEoXU1s7UNVbpqYrhhIsGjhQ4MxzNeotPku3NkJEq3LovZcHB2pITRy";
                use-rle yes;
		allow-two-primaries yes;
        }
}

Some security is not a bad idea, use "shared-secret".

# cat /etc/drbd.d/export.res
resource export {
        device    /dev/drbd1;
        disk      /dev/sdc;
        meta-disk internal;

        disk {
		resync-rate 40M;
		fencing resource-and-stonith;
	}
        net {
		csums-alg sha1;
		after-sb-0pri discard-zero-changes;
		after-sb-1pri discard-secondary;
		after-sb-2pri disconnect;
	}
	handlers {
		fence-peer	"/usr/lib/drbd/rhcs_fence";
	}

        on vorh6t01.domain.com { address   10.10.10.240:7789; }
        on vorh6t02.domain.com { address   10.10.10.241:7789; }
}

I've added dedicated 10.10.10/24 LAN NIC to both VM for replication use only.

Disk was named here as /dev/sdc, this name is not so persistant between reboots, you can use any other (more persistant) references you can find in /dev/disk/by-{id,label,path,uuid}.

Replicate configuration to second node:

root@vorh6t01:~ # scp -pr /etc/drbd.* root@vorh6t02:/etc/

Initialize DRBD:

root@vorh6t01:~ # drbdadm create-md export
...
root@vorh6t02:~ # drbdadm create-md export
...
root@vorh6t01:~ # drbdadm up export
root@vorh6t02:~ # drbdadm up export
# cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 74402fecf24da8e5438171ee8c19e28627e1c98a build by root@vorh6t01.domain.com, 2014-03-18 12:05:58

 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:31456284

As you can see, it is in Connected state, both sides marked as Secondary and Inconsistent

Let's help DRBD to take decision:

root@vorh6t01:~ # drbdadm primary --force export
root@vorh6t01:~ # cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 74402fecf24da8e5438171ee8c19e28627e1c98a build by root@vorh6t01.domain.com, 2014-03-18 12:05:58

 1: cs:SyncSource ro:Primary/Secondary ds:UpToDate/Inconsistent C r-----
    ns:2169856 nr:0 dw:0 dr:2170520 al:0 bm:132 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:27475996
        [>...................] sync'ed:  7.4% (26832/28948)M
        finish: 0:11:03 speed: 41,416 (27,464) K/sec

OK, vorh6t01 becomes Primary and UpToDate and synchronization beguns.

Wait for initial syncronization finished and tell second node becomes primary too:

root@vorh6t02:~ # cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 74402fecf24da8e5438171ee8c19e28627e1c98a build by root@vorh6t02.domain.com, 2014-09-23 07:12:46

 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:31456284 al:0 bm:1920 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
root@vorh6t02:~ # drbdadm primary export
root@vorh6t02:~ # cat /proc/drbd
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 74402fecf24da8e5438171ee8c19e28627e1c98a build by root@vorh6t02.domain.com, 2014-09-23 07:12:46

 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:0 dw:0 dr:31456948 al:0 bm:1920 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

Fix checkconfig line of /etc/init.d/drbd script on both nodes. Also remove any hint lines between ### BEGIN INIT INFO and ### END INIT INFO. This fix will adjust drbd start/stop to correct place (as for RH6), between network and clvmd.

...
# chkconfig: 2345 23 77
...
### BEGIN INIT INFO
# Provides: drbd
### END INIT INFO
...

Make DRBD starting at boot time on both nodes:

# chkconfig --add drbd
# chkconfig drbd on

Cluster software

Install these RPMs on both nodes (with all depencies):

# yum install lvm2-cluster ccs cman rgmanager gfs2-utils

Setting Cluster

vorh6t01 and vorh6t02 are two nodes of cluser named vorh6t. Take care to make all names resolvable by DNS and add all names to /etc/hosts on both nodes.

Define cluster:

# ccs_tool create -2 vorh6t

The command above create /etc/cluster/cluster.conf file. It can be editted by hand and have to be redistributed to every node in cluster. -2 option required for two-node cluster; usual configuration suppose more than two nodes, to make quorum clear.

Open file and change nodenames to real names. The resulting file should be like:

<?xml version="1.0"?>
<cluster name="vorh6t" config_version="1">

  <cman two_node="1" expected_votes="1" transport="udpu" />
  <clusternodes>
    <clusternode name="vorh6t01.domain.com" votes="1" nodeid="1">
      <fence>
        <method name="single">
        </method>
      </fence>
    </clusternode>
    <clusternode name="vorh6t02.domain.com" votes="1" nodeid="2">
      <fence>
        <method name="single">
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
  </fencedevices>

  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>

I am using transport="udpu" here, because my network does not support multicasts and broadcasts are not welcomed too. Without this option, my cluster works upredictable. Check:

# ccs_tool lsnode

Cluster name: vorh6t, config_version: 1

Nodename                        Votes Nodeid Fencetype
vorh6t01.domain.com                1    1    
vorh6t02.domain.com                1    2    
# ccs_tool lsfence
Name             Agent

Copy /etc/cluster/cluster.conf to second node:

vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf

You can start sluster services now to see it working. Start it by /etc/init.d/cman start on both nodes. Check /var/log/messages. See clustat output:

vorh6t01 # clustat 
Cluster Status for vorh6t @ Thu Sep 27 15:04:58 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vorh6t01.domain.com                                                1 Online, Local
 vorh6t02.domain.com                                                2 Online

vorh6t02 # clustat 
Cluster Status for vorh6t @ Thu Sep 27 15:05:07 2012
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vorh6t01.domain.com                                                1 Online
 vorh6t02.domain.com                                                2 Online, Local

Add cluster services to init scripts. Start cluster and resource manager on both nodes:

# chkconfig --add cman
# chkconfig cman on

This cluster will not manage any resources, just provides infrastructure for shared clustered file system, therefore no additional configuration required. Probably fencing should be added.

LVM settings

Enable cluster featires on both nodes and start clvmd:

# lvmconf --enable-cluster
# /etc/init.d/clvmd start
# chkconfig --add clvmd
# chkconfig clvmd on

Fix filter line in /etc/lvm/lvm.conf to expicitly include drbd device and exclude others. Would LVM locked underlying device, drbd will not start, therefore brain split will occure. Here is an example of my "filter" line, adding only "rootvg" device and drbd device:

filter = [ "a|^/dev/drbd|", "a|^/dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:0:0|", "r/.*/" ]

As recommended by DRBD documentation, disable LVM write cache on both nodes fixing /etc/lvm/lvm.conf:

...
write_cache_state = 0
...

and drop stale cache:

# rm -f /etc/lvm/cache/.cache

Create PV and CLV on one node. Pay attention to use our /dev/drbd1 device, not underlaying /dev/sdc:

root@vorh6t01:~ # pvcreate --dataalignment 4k /dev/drbd1
  Physical volume "/dev/drbd1" successfully created
root@vorh6t01:~ # vgcreate exportvg /dev/drbd1
  Clustered volume group "exportvg" successfully created
root@vorh6t01:~ # lvcreate -n export -l100%FREE /dev/exportvg
  Logical volume "export" created

Check by commands pvs, vgs and lvs that everything exist on second node too.

GFS2 Settings

Create GFS2 on one node as following:

vorh6t01:~ # mkfs.gfs2 -p lock_dlm -t vorh6t:export -j 2 /dev/exportvg/export
This will destroy any data on /dev/exportvg/export.
It appears to contain: symbolic link to `../dm-6'

Are you sure you want to proceed? [y/n] y

Device:                    /dev/exportvg/export
Blocksize:                 4096
Device Size                30.00 GB (7863296 blocks)
Filesystem Size:           30.00 GB (7863294 blocks)
Journals:                  2
Resource Groups:           120
Locking Protocol:          "lock_dlm"
Lock Table:                "vorh6t:export"
UUID:                      43b39c8b-cb8b-f7d7-c35d-91a909bc3ade

where: vorh6t is ClusterName, export is FS name, -j 2 using two journals as we have two nodes.

Then, mount it:

# mkdir /export
# mount -o noatime,nodiratime -t gfs2 /dev/exportvg/export /export
# echo "/dev/exportvg/export   /export  gfs2   noatime,nodiratime   0 0" >> /etc/fstab
# chkconfig --add gfs2 ; chkconfig gfs2 on

/etc/init.d/gfs2 script as part of gfs2-utils will mount/umount GFS2 from /etc/fstab at appropriate time, after cluster started and before it goes down.

Testing

Make reboots, check if "/export" mounted after reboot. In case not, repeat checks if you have correct line in /etc/fstab, if you have correct "filter" in /etc/lvm/lvm.conf, if you fixed /etc/init.d/drbd to start/stop between network and clvmd (check for real numbers in /etc/rc.d/rc{1,3}.d). All of these were described above.

Adding VmWare fencing

Prerequisites:

# yum install openssl-devel

Install VI Perl Toolkit on both nodes ; Somtime VmWare call it vSpher SDK, CLI or whatever. It should install /usr/lib/vmware-vcli/apps/ and other tools in /usr/bin. Package that was called "VMware-vSphere-Perl-SDK-5.5.0*" was OK for me.

Fix /etc/cluster/cluster.conf

<?xml version="1.0"?>
<cluster name="vorh6t" config_version="3">

  <cman two_node="1" expected_votes="1" transport="udpu" />
  <clusternodes>
    <clusternode name="vorh6t01.domain.com" votes="1" nodeid="1">
      <fence>
        <method name="single">
                <device name="vmware" port="vorh6t01" />
        </method>
      </fence>
    </clusternode>
    <clusternode name="vorh6t02.domain.com" votes="1" nodeid="2">
      <fence>
        <method name="single">
                <device name="vmware" port="vorh6t02" />
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
        <fencedevice name="vmware" agent="fence_vmware" ipaddr="VCNAME" action="off" login="VCUSER" passwd="PASSWORD" /> 
  </fencedevices>

  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>

port is name of VM on VC, ipaddr is name or IP of VC

Copy to neighbour and propagate changes:

vorh6t01:~ # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf
vorh6t01:~ # cman_tool version -r -S

Resolving DRBD brain split manually

Brain split may occure during playing with cluster untill it configured perfect.

Let's assume 02 node data is not important and will be dropped:

vorh6t02:~ # umount /export
vorh6t01:~ # umount /export
vorh6t02:~ # vgchange -a n exportvg
vorh6t02:~ # drbdadm secondary export
vorh6t02:~ # drbdadm connect --discard-my-data export
vorh6t01:~ # drbdadm connect export
vorh6t01:~ # cat /proc/drbd
vorh6t02:~ # drbdadm primary export
vorh6t02:~ # vgchange -ay exportvg
vorh6t02:~ # mount /export
vorh6t01:~ # mount /export

Updated on Thu Dec 11 17:48:22 IST 2014 More documentations here