Building GFS2 cluster on RedHat (6.9)

It is a second edition of Building cluster with shared FS (GFS2) on RedHat6.

Prepare nodes

This time, the POC will be built in the KVM environment. Two nodes with two shared disks that connected via multipath will simulate a complete SAN environment. Read the KVM recipes, how to implement it. We will assume that the SAN is well simulated.

One shared disk (1G in size) will be for the data and will be formatted as GFS2. The second disk (10M) will be used as a quorum disk to resolve split-brain caused by the network.

We have no DNS in POC, then put node names into /etc/hosts.

root@node1:~ # cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.122.198 node1
192.168.122.181 node2

Generate root SSH keys and exchange it over cluster nodes:

root@node1:~ # ssh-keygen -t rsa -b 1024 -C "root@vorh6t0x"
.....
root@node1:~ # cat .ssh/id_rsa.pub >> .ssh/authorized_keys
root@node1:~ # scp -pr .ssh node2:

Do not forget to disable firewall (iptables) and selinux.

Software installation

Install these RPMs on both nodes (with all depencies):

# yum install openssh-clients wget rsync ntp ntpdate vim-common gfs2-utils \
      device-mapper-multipath pcs pacemaker fence-agents lvm2-cluster

This is not only cluster software, but also the software required in this POC, which is not installed in the minimal installation of the OS.

Setting up Cluster

RedHat 7 uses a pacemaker with corosync as its default cluster software. Prior to this, rgmanager and cman were used instead. The first edition of this article was about the configuration of rgmanager and cman. This time we will use a pacemaker with corosync, and on RedHat 6.9.

It is important to understand for ourselves what we want from the cluster to resolve the split-brain.

In addition to the usual fencing, I'm going to add a quorum disk that will help resolve the split-brain caused by a network failure. This additional vote will help the cluster understand which node should be fenced.

Authorizing pcsd

I do not like the password-based method recommended by the vendor. Here's a workaround that avoids using a password:

root@node1:~ # /etc/init.d/pcsd start
Starting pcsd:                                             [  OK  ]
root@node1:~ # /etc/init.d/pcsd stop 
Stopping pcsd:                                             [  OK  ]
root@node1:~ # cd /var/lib/pcsd 
root@node1:/var/lib/pcsd # ll
total 12
-rwx------. 1 root root   60 Oct 10 18:04 pcsd.cookiesecret
-rwx------. 1 root root 1180 Oct 10 18:04 pcsd.crt
-rwx------. 1 root root 1679 Oct 10 18:04 pcsd.key

You've started and immediately stopped the pcsd daemon. As result, some files were created in the /var/lib/pcsd directory. The next step is to create missing authorization files:

root@node1:/var/lib/pcsd # TOKEN=$(uuidgen)
root@node1:/var/lib/pcsd # cat > pcs_users.conf << EOFcat
[
 {
   "creation_date": "$(date)",
   "username": "hacluster",
   "token": "$TOKEN"
 }
]
EOFcat

root@node1:/var/lib/pcsd # cat > tokens << EOFcat
{
  "format_version": 2,
  "data_version": 2,
  "tokens": {
    "node1": "$TOKEN",
    "node2": "$TOKEN"
  }
}
EOFcat
root@node1:/var/lib/pcsd # chmod 600 tokens
root@node1:/var/lib/pcsd # ll
total 20
-rw-r--r--. 1 root root  141 Oct 10 18:06 pcs_users.conf
-rwx------. 1 root root   60 Oct 10 18:04 pcsd.cookiesecret
-rwx------. 1 root root 1180 Oct 10 18:04 pcsd.crt
-rwx------. 1 root root 1679 Oct 10 18:04 pcsd.key
-rw-------. 1 root root  224 Oct 10 18:07 tokens

Finally, copy the entire directory /var/lib/pcsd to the neighbors, enable and start the pcsd daemon:

root@node1:~ # rsync -a /var/lib/pcsd/ node2:/var/lib/pcsd/
root@node1:~ # for h in node{1,2} ; do
 ssh $h "chkconfig pcsd on ; /etc/init.d/pcsd start"
done

Verify that the authorization works:

root@node1:~ # pcs cluster auth node1 node2
node1: Already authorized
node2: Already authorized

Initial cluster configuration

root@node1:~ # pcs cluster setup --start --enable --name mycluster node1 node2 --transport udpu
Warning: Using udpu transport on a RHEL 6 cluster, cluster restart is required after node add or remove
Destroying cluster on nodes: node1, node2...
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node2: Successfully destroyed cluster
node1: Successfully destroyed cluster

Sending cluster config files to the nodes...
node1: Updated cluster.conf...
node2: Updated cluster.conf...

Starting cluster on nodes: node1, node2...
node1: Starting Cluster...
node2: Starting Cluster...
node1: Cluster Enabled
node2: Cluster Enabled

Synchronizing pcsd certificates on nodes node1, node2...
node1: Success
node2: Success

Restarting pcsd on the nodes in order to reload the certificates...
node1: Success
node2: Success

I am using transport="udpu" here, because my network does not support multicasts and broadcasts are not welcomed too. Without this option, my cluster works upredictable.

Check the results:

root@node1:~ # pcs status 
Cluster name: mycluster
WARNING: no stonith devices and stonith-enabled is not false
Stack: cman
Current DC: node1 (version 1.1.15-5.el6-e174ec8) - partition with quorum
Last updated: Tue Oct 10 18:15:54 2017          Last change: Tue Oct 10 18:14:51 2017 by root via crmd on node1

2 nodes and 0 resources configured

Online: [ node1 node2 ]

No resources


Daemon Status:
  cman: active/disabled
  corosync: active/disabled
  pacemaker: active/enabled
  pcsd: active/enabled

The status shows cman, used as a stack. This is good, because CLVM only knows to work with cman. However, you can also see the running process of corosync. It is run by cman and controlled by it. No need to configure corosync then.

Multipath and LVM setup

We will create an LVM structure on a multipath device, rather than on underlaying SCSI disks. It is important to configure the correct filter string in the /etc/lvm/lvm.conf file to explicitly include multi-level devices and exclude others, otherwise you will got "Duplicate PV found", and LVM can deside to use single path disk instead of multipathed. Here is an example of my "filter" line, adding only the "rootvg" device and the multipath devices:

filter = [ "a|^/dev/vda2$|", "a|^/dev/mapper/pv_|", "r|.*|" ]

Replicate configuration file to other node:

root@node1:~ # rsync -a /etc/lvm/lvm.conf node2:/etc/lvm/lvm.conf

Create /etc/multipath.conf configuration file. As usual, I setting up names (aliases) for multipath devices for easy management:

defaults {
		user_friendly_names		yes
		flush_on_last_del		yes
		queue_without_daemon	no
		no_path_retry			fail
}

blacklist {
		wwid "*"
}

blacklist_exceptions {
        wwid "0QEMU    QEMU HARDDISK   1010101"
        wwid "0QEMU    QEMU HARDDISK   1010102"
}

multipaths {
        multipath {
                wwid    "0QEMU    QEMU HARDDISK   1010101"
                alias   pv_gfs
        }
        multipath {
                wwid    "0QEMU    QEMU HARDDISK   1010102"
                alias   quorum
        }
}

Your wwid will be differ from mine. Do not forget to add them to blackilst_exception too, not only to aliases list.

Replicate configuration file to other node:

root@node1:~ # rsync -a /etc/multipath.conf node2:/etc/multipath.conf

Add multipath to be started at system boot (on both nodes):

# /etc/init.d/multipathd start
# chkconfig --add multipathd
# chkconfig multipathd on
# multipath -F
# multipath
create: quorum (0QEMU    QEMU HARDDISK   1010102) undef QEMU,QEMU HARDDISK
size=10M features='0' hwhandler='0' wp=undef
|-+- policy='round-robin 0' prio=1 status=undef
| `- 2:0:1:0 sdb 8:16 undef ready running
`-+- policy='round-robin 0' prio=1 status=undef
  `- 3:0:1:0 sdd 8:48 undef ready running
create: pv_gfs (0QEMU    QEMU HARDDISK   1010101) undef QEMU,QEMU HARDDISK
size=1.0G features='0' hwhandler='0' wp=undef
|-+- policy='round-robin 0' prio=1 status=undef
| `- 2:0:0:0 sda 8:0  undef ready running
`-+- policy='round-robin 0' prio=1 status=undef
  `- 3:0:0:0 sdc 8:32 undef ready running

As you see, my data LUN would appear as /dev/mapper/pv_gfs, exactly matching to LVM's filter line.

Enable LVM cluster featires on both nodes and start clvmd. Make it start at system boot.

# lvmconf --enable-cluster
# /etc/init.d/clvmd start
# chkconfig --add clvmd
# chkconfig clvmd on

Create PV and CLV on one node:

root@node1:~ # pvcreate --dataalignment 4k /dev/mapper/pv_gfs 
  Physical volume "/dev/mapper/pv_gfs" successfully created
root@node1:~ # vgcreate -c y vg_gfs /dev/mapper/pv_gfs
  Clustered volume group "vg_gfs" successfully created
root@node1:~ # lvcreate -n export -l100%FREE /dev/vg_gfs
  Logical volume "export" created.

Check on second node by commands pvs, vgs and lvs that everything visible there too.

root@node2:~ # vgs
  VG     #PV #LV #SN Attr   VSize    VFree 
  rootvg   1   2   0 wz--n-   19.80g 15.89g
  vg_gfs   1   1   0 wz--nc 1020.00m     0 
root@node2:~ # lvs
  LV     VG     Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  slash  rootvg -wi-ao----    2.93g                                                    
  swap   rootvg -wi-ao---- 1000.00m                                                    
  export vg_gfs -wi-a----- 1020.00m

The bold c in "vgs" command states that this VG is clustered. Stop "clvmd", the clustered VG will be not shown. Start it again, we will need it at next step.

GFS2 Settings

Create GFS2 on one node as following:

root@node1:~ # mkfs.gfs2 -p lock_dlm -t mycluster:export -j 2 /dev/vg_gfs/export 
This will destroy any data on /dev/vg_gfs/export.
It appears to contain: symbolic link to `../dm-4'

Are you sure you want to proceed? [y/n] y

Device:                    /dev/vg_gfs/export
Blocksize:                 4096
Device Size                1.00 GB (261120 blocks)
Filesystem Size:           1.00 GB (261118 blocks)
Journals:                  2
Resource Groups:           4
Locking Protocol:          "lock_dlm"
Lock Table:                "mycluster:export"
UUID:                      729885fb-9052-77c0-ae7b-da37ac498c1c

where: mycluster is ClusterName, export is FS name, -j 2 using two journals as we have two nodes.

Then, mount it (on both nodes):

# mkdir /export
# mount -o noatime,nodiratime -t gfs2 /dev/vg_gfs/export /export

Copy there some data on node1, read it on node2:

root@node1:~ # rsync -av /etc /export/
 ..
root@node2:~ # find /export/ -ls
 ..

Add GFS2 FS to /etc/fstab on both nodes, like:

# grep gfs2 /etc/fstab
/dev/vg_gfs/export      /export         gfs2    noatime,nodiratime      0 0
# chkconfig --add gfs2 ; chkconfig gfs2 on

/etc/init.d/gfs2 script as part of "gfs2-utils" will mount/umount GFS2 from /etc/fstab at appropriate time, after cluster started and before it goes down.

Cluster configuration

First of all, stop (on both nodes) all services, that we will define later as cluster resources:

# umount /export
# /etc/init.d/clvmd stop

Adding Quorum disk

Our quorum disk has already been defined in the multipath part, it remains only to format it as a quorum device.

root@node1:~ # mkqdisk -c /dev/mapper/quorum -l QD1
mkqdisk v3.0.12.1

Writing new quorum disk label 'QD1' to /dev/mapper/quorum.
WARNING: About to destroy all data on /dev/mapper/quorum; proceed [N/y] ? y
Initializing status block for node 1...
Initializing status block for node 2...
Initializing status block for node 3...
Initializing status block for node 4...
Initializing status block for node 5...
Initializing status block for node 6...
Initializing status block for node 7...
Initializing status block for node 8...
Initializing status block for node 9...
Initializing status block for node 10...
Initializing status block for node 11...
Initializing status block for node 12...
Initializing status block for node 13...
Initializing status block for node 14...
Initializing status block for node 15...
Initializing status block for node 16...
root@node1:~ #

Check that node2 can see the quorum device too:

root@node2:~ # mkqdisk -L
mkqdisk v3.0.12.1

/dev/block/253:3:
/dev/disk/by-id/dm-name-quorum:
/dev/disk/by-id/dm-uuid-mpath-0QEMU\x20\x20\x20\x20QEMU\x20HARDDISK\x20\x20\x201010102:
/dev/dm-3:
/dev/mapper/0QEMU    QEMU HARDDISK   1010102:
/dev/mapper/quorum:
        Magic:                eb7a62c2
        Label:                QD1
        Created:              Wed Oct 11 14:07:55 2017
        Host:                 node1
        Kernel Sector Size:   512
        Recorded Sector Size: 512

root@node2:~ #

There is no tool to define the quorum disk online, so it's time to shut down the cluster:

root@node1:~ # pcs cluster stop --all
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (cman)...
node1: Stopping Cluster (cman)...

Open in your favorite text editor /etc/cluster/cluster.conf and fix it:

 ..
  <cman broadcast="no" expected_votes="2" transport="udpu"/>
  <quorumd interval="1" label="QD1" tko="9" votes="1">
    <heuristic program="ping -c1 -W1 -w1 192.168.122.1" interval="1" score="1" tko="3" />
  </quorumd>
  <totem token="20000"/>
 ..

Find the definitions of cman and correct this: remove two_nodes and increase expected_votes. Then add the quorumd section; label is what you've created in step mkqdisk. The heuristic ping is set to the default gateway, your gateway will be different. The totem token must be large enough to cover quorumd timeouts.

Copy configuration file to node2 and start cluster on both nodes:

root@node1:~ # rsync -a /etc/cluster/cluster.conf node2:/etc/cluster/cluster.conf
root@node1:~ # pcs cluster start                                                 
Starting Cluster...
root@node2:~ # pcs cluster start                                                 
Starting Cluster...

Adding fencing

It is time to add a fencing to your cluster. Just because I am in KVM environment, I'll use fence_xvm as described in the KVM recepies. You must use fencing that fits your enfironment. Cluster will not work without fencing.

To see all fencing methods avaliable for you:

root@node1:~ # pcs stonith list

Pick up suitable for you and see configuration options:

root@node1:~ # pcs stonith describe fence_xvm

Simply because my guest's and node names are the same, I have not need to provide the mapping information and define separate fencing for nodes. It is enough for me define very generic fenicing method:

root@node1:~ # pcs stonith create kvm-kill fence_xvm

It is time to do some tests. Turn network off on one node. Cause kernel crush on one node (HINT: echo c > /proc/sysrq-trigger )

Adding resources

The only resource will be gfs2-healthcheck script, that we will create and put into /etc/init.d (on both nodes):

# cat /etc/init.d/gfs2-healthcheck 
#!/bin/bash
#
# chkconfig: - 24 76
# description: Check if GFS2 FS healthy
# Short-Description:    Check if GFS2 FS healthy
# Description:          Check if GFS2 FS healthy

rtrn=0
MOUNTS=$(awk '/ gfs2 /{print $2}' /proc/mounts)
for M in $MOUNTS ; do
        # Check for RW access:
        touch $M/.healthcheck.$$ || rtrn=1
        rm -f $M/.healthcheck.$$ || rtrn=1
done

exit $rtrn
# chmod +x /etc/init.d/gfs2-healthcheck

Then add it to cluster:

root@node1:~ # pcs resource create gfs-check lsb:gfs2-healthcheck clone

This script will check if GFS2 FS still available for RW operations. If not, a bad node will be fenced, the second node will continue it's job.

Updated on Thu Oct 12 18:38:18 IDT 2017 More documentations here