Stretch NFS cluster

TOC:

I was asked to create an NFS stretch cluster when one node is on the first site and serves its local clients, and another node is on the second site, and also serves its local clients. No real failover should occure, but the content must be shared.

The proposed solution will include DRBD (for emulating a shared disk) and GFS2 as a clustered file system. The third node in the third location will be used as a quorum node, and disconnected node should do suicide by itself. A real fence can not be used, because the disconnection of the site means that the fencing device also becomes unavailable. This time I chose to use CentOS 7 as the base OS.

Preparing POC

As usual, the POC will run in the KVM environment. So, you need to create three routed networks that mimic these three sites:

# cat iso1.xml
<network>
  <name>iso1</name>
  <forward mode='open'/>
  <bridge name='virbr1' stp='off' delay='0'/>
  <domain name='local'/>
  <ip address='192.168.101.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.101.128' end='192.168.101.254'/>
    </dhcp>
  </ip>
</network>
# cat iso2.xml
<network>
  <name>iso2</name>
  <forward mode='open'/>
  <bridge name='virbr2' stp='off' delay='0'/>
  <domain name='local'/>
  <ip address='192.168.102.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.102.128' end='192.168.102.254'/>
    </dhcp>
  </ip>
</network>
# cat iso3.xml
<network>
  <name>iso3</name>
  <forward mode='open'/>
  <bridge name='virbr3' stp='off' delay='0'/>
  <domain name='local'/>
  <ip address='192.168.103.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.103.128' end='192.168.103.254'/>
    </dhcp>
  </ip>
</network>
# for n in iso{1,2,3} ; do 
    virsh net-define $n.xml 
    virsh net-start $n 
    virsh net-autostart $n 
done

For more information, see my KVM recepies.

Create now CentOS7 disk clones from templates:

# for n in node{1,2,3} ; do
    qemu-img create -f qcow2 -b /var/lib/libvirt/images/template/CentOS7.qcow2 \
     /var/lib/libvirt/images/$n.qcow2
done

Finally, create virtual machines:

# for n in 1 2 3 ; do
virt-install --name node$n --memory 2048 --vcpus 2 \
        --import --disk /var/lib/libvirt/images/node$n.qcow2 \
        --network network=iso$n --os-type linux --os-variant rhel7
done

Run the helper script to view the IP addresses:

# kvm-guests-ip.sh 
node1 => ? (192.168.101.240) at 52:54:00:3c:95:21 [ether] on virbr1
node2 => ? (192.168.102.203) at 52:54:00:7f:cc:17 [ether] on virbr2
node3 => ? (192.168.103.154) at 52:54:00:83:af:9a [ether] on virbr3

Prepare hosts file with received data:

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.101.240 node1.local node1
192.168.102.203 node2.local node2
192.168.103.154 node3.local node3
and copy it to all three nodes as /etc/hosts. Do not forget to fix hostnames on nodes itself (it is about fixing /etc/hostname on CentOS7).

Exchange SSH keys for handy management:

root@node1:~ # ssh-keygen -t rsa -b 1024 -C "root@cluster"
...
root@node1:~ # cat .ssh/id_rsa.pub >> .ssh/authorized_keys
root@node1:~ # ssh-keyscan $(hostname -s) | \
 sed -e 's/'$(hostname -s)'/'$(hostname -f),$(hostname -s)'/' | \
 awk '/'$(hostname -s)'/ {a=$1;gsub("node1","node2",a);b=$1;gsub("node1","node3",b);print a","b","$0;}' \
 >> .ssh/known_hosts
root@node1:~ # scp -pr .ssh node2:
root@node1:~ # scp -pr .ssh node3:

Configure and run any NTP client. Time have to be synchronized.

Creating cluster

Due to the fact that these virtual machines can not connect to the Internet in the current POC configuration, installing programs can be a difficult task. You can install all necessary software in a template prior cloning or use a proxy on the host or move it to another network.

# yum install pcs pacemaker fence-agents-all rsync vim-common wget lvm2-cluster ntp ntpdate gfs2-utils

Authorizing pcsd

I do not like the password-based method recommended by the vendor. Here's a workaround that avoids using a password:

root@node1:~ # systemctl start pcsd.service
root@node1:~ # systemctl stop pcsd.service
root@node1:~ # cd /var/lib/pcsd 
root@node1:/var/lib/pcsd # ll
total 12
-rwx------. 1 root root   60 Sep 16 13:24 pcsd.cookiesecret
-rwx------. 1 root root 1196 Sep 16 13:24 pcsd.crt
-rwx------. 1 root root 1675 Sep 16 13:24 pcsd.key

You started and immediately stopped the pcsd daemon. As result, some files were created in the /var/lib/pcsd directory. The next step is to create missing authorization files:

root@node1:/var/lib/pcsd # TOKEN=$(dd if=/dev/urandom bs=18 count=1 2>/dev/null | xxd -p)
root@node1:/var/lib/pcsd # cat > pcs_users.conf << EOFcat
[
 {
   "creation_date": "$(date)",
   "username": "hacluster",
   "token": "$TOKEN"
 }
]
EOFcat

root@node1:/var/lib/pcsd # cat > tokens << EOFcat
{
  "format_version": 2,
  "data_version": 2,
  "tokens": {
    "node1": "$TOKEN",
    "node2": "$TOKEN",
    "node3": "$TOKEN"
  }
}
EOFcat

root@node1:/var/lib/pcsd # ll
total 20
-rw-r--r--. 1 root root  141 Sep 16 13:45 pcs_users.conf
-rwx------. 1 root root   60 Sep 16 13:24 pcsd.cookiesecret
-rwx------. 1 root root 1196 Sep 16 13:24 pcsd.crt
-rwx------. 1 root root 1675 Sep 16 13:24 pcsd.key
-rw-r--r--. 1 root root  224 Sep 16 13:47 tokens

Finally, copy the entire directory /var/lib/pcsd to the neighbors, enable and start the pcsd daemon:

root@node1:~ # rsync -a /var/lib/pcsd/ node2:/var/lib/pcsd/
root@node1:~ # rsync -a /var/lib/pcsd/ node3:/var/lib/pcsd/
root@node1:~ # for h in node{1,2,3} ; do
 ssh $h "systemctl enable pcsd ; systemctl start pcsd "
done

Verify that the authorization works:

root@node1:~ # pcs cluster auth node1 node2 node3
node1: Already authorized
node3: Already authorized
node2: Already authorized

Yahooo !! (Ghmm... , Google ?)

Initial cluster configuration

root@node1:~ # pcs cluster setup --start --enable --name nfs node1 node2 node3 --transport udpu
Destroying cluster on nodes: node1, node2, node3...
node1: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (pacemaker)...
node3: Stopping Cluster (pacemaker)...
node3: Successfully destroyed cluster
node1: Successfully destroyed cluster
node2: Successfully destroyed cluster

Sending 'pacemaker_remote authkey' to 'node1', 'node2', 'node3'
node1: successful distribution of the file 'pacemaker_remote authkey'
node2: successful distribution of the file 'pacemaker_remote authkey'
node3: successful distribution of the file 'pacemaker_remote authkey'
Sending cluster config files to the nodes...
node1: Succeeded
node2: Succeeded
node3: Succeeded

Starting cluster on nodes: node1, node2, node3...
node3: Starting Cluster...
node2: Starting Cluster...
node1: Starting Cluster...
node1: Cluster Enabled
node2: Cluster Enabled
node3: Cluster Enabled

Synchronizing pcsd certificates on nodes node1, node2, node3...
node1: Success
node3: Success
node2: Success
Restarting pcsd on the nodes in order to reload the certificates...
node1: Success
node3: Success
node2: Success
NOTE: if for some reason the command failed, try running it from another node.

Check the status of the cluster:

root@node1:~ # pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: node3 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum
 Last updated: Sat Sep 16 14:11:43 2017
 Last change: Sat Sep 16 14:11:33 2017 by hacluster via crmd on node3
 3 nodes configured
 0 resources configured

PCSD Status:
  node1: Online
  node3: Online
  node2: Online

Make sure that fencing/STONITH is disabled for the duration of the setting:

root@node1:~ # pcs property set stonith-enabled=false
root@node1:~ # pcs property list
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: nfs
 dc-version: 1.1.15-11.el7_3.5-e174ec8
 have-watchdog: false
 stonith-enabled: false

Configuring fencing and quorum

After installing the cluster, it's time to configure it. First we need to understand what we want from him. The usual fencing strategy will not work for the stretch cluster. A typical problem will be disconnecting of the entire site, then it will not be possible to fence the detached node, because the fencing device will also be unavailable. Therefore, the third node in the third location will act as a quorum node but will not have any resources. Then we will configure the cluster for a quorum of 2 votes. A failed cluster partition (one node, not in quorum) will commit suicide (reboot itself), which, of course, will stop the operation of this site, but will preserve the integrity of the data.

Searching the Internet, I found that the plug-in for suicide still available only for SuSE. Other distributions have removed it due to incorrect use by end users. Having smoked some ready /usr/sbin/fence_* python scripts, I wrote my own script fence_suicide. This script lies to the cluster that it has successfully killed a neighbor and starts a "reboot -f" if it's called for its own node. Install it on all nodes.

# wget -O /usr/sbin/fence_suicide http://www.voleg.info/scripts/fence_suicide
# chmod +x /usr/sbin/fence_suicide

The next step is to let the cluster know that two nodes make a quorum and can start serving. Open the /etc/corosync/corosync.conf file and add the highlighted line to the "quorum" section:

 ..
quorum {
    provider: corosync_votequorum
    expected_votes: 2
}
 ..

Copy the file to all three nodes and reboot them. In fact, it's enough to restart the services, but the reboot will take less time.

Then we define the fencing:

root@node1:~ # pcs stonith create suicide1 fence_suicide pcmk_host_list=node1 nodename=node1.local
root@node1:~ # pcs stonith create suicide2 fence_suicide pcmk_host_list=node2 nodename=node2.local
root@node1:~ # pcs stonith create suicide3 fence_suicide pcmk_host_list=node3 nodename=node3.local

The "nodename" parameter for the fence_suicide script should be equal to the result of the "hostname" command. If the node has a short hostname, use a short name.

Then, turn the fencing on:

root@node1:~ # pcs property set stonith-enabled=true
root@node1:~ # pcs property set no-quorum-policy=suicide
root@node1:~ # pcs property
Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: nfs
 dc-version: 1.1.16-12.el7_4.2-94ff4df
 have-watchdog: false
 no-quorum-policy: suicide
 stonith-enabled: true

I disabled the network node1 to test. As a result, it quickly rebooted, as it was designed. The rest of the cluster continued to work.

root@HOST:~ # brctl show
bridge name     bridge id               STP enabled     interfaces
virbr0          8000.5254003e2fe3       yes             virbr0-nic
virbr1          8000.5254007a3d4d       no              virbr1-nic
                                                        vnet0
virbr2          8000.525400e087f7       no              virbr2-nic
                                                        vnet1
virbr3          8000.525400b56196       no              virbr3-nic
                                                        vnet2
root@HOST:~ # brctl delif virbr1 vnet0
root@HOST:~ # brctl addif virbr1 vnet0

Stuffing DRBD

This time I'll try precompiled RPMs from EPEL. Install them on node1 and node2:

# rpm -ivh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
# yum list *drbd*
drbd84-utils.x86_64                9.1.0-1.el7.elrepo            elrepo
drbd84-utils-sysvinit.x86_64       9.1.0-1.el7.elrepo            elrepo
drbd90-utils.x86_64                9.1.0-1.el7.elrepo            elrepo
drbd90-utils-sysvinit.x86_64       9.1.0-1.el7.elrepo            elrepo
kmod-drbd84.x86_64                 8.4.10-1_2.el7_4.elrepo       elrepo
kmod-drbd90.x86_64                 9.0.9-1.el7_4.elrepo          elrepo

I used v8 for years and was pleased with it. We will not be looking for adventure with v9 as long as possible!

# yum install kmod-drbd84.x86_64 drbd84-utils.x86_64

Prepare a physical DRBD device. In my case, I will create another LV in the same rootvg.

root@node1:~ # lvcreate -n drbd -L3g /dev/rootvg
  Logical volume "drbd" created.
root@node2:~ # lvcreate -n drbd -L3g /dev/rootvg
  Logical volume "drbd" created.

Node3 will not carry a payload, so nothing has been created on it.

Create a resource definitions file:

# cat /etc/drbd.d/export.res
resource export {
        device    /dev/drbd1;
        disk      /dev/rootvg/drbd;
        meta-disk internal;

        disk { }
        handlers { }

        net {
                protocol        C;
                cram-hmac-alg   sha1;
                csums-alg sha1;
                shared-secret   "9szdFmSkQEoXU1s7UNVbpqYrhhIsGjhQ4MxzNeotPku3NkJEq3LovZcHB2pITRy";
                use-rle yes;
                allow-two-primaries yes;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
        }

        startup {
                wfc-timeout 300;
                degr-wfc-timeout 0;
                become-primary-on both;
        }

        on node1.local { address   192.168.101.240:7789; }
        on node2.local { address   192.168.102.203:7789; }
}

I think that the options I have chosen are suitable for my purpose. Read the manual to understand them.

Copy resource file to node2 (no need to copy to node3):

root@node1:~ # rsync -a /etc/drbd.d/export.res node2:/etc/drbd.d/export.res

Let's initialize DRBD:

root@node1:~ # drbdadm create-md export
 ..
root@node2:~ # drbdadm create-md export
 ..
root@node1:~ # drbdadm up export
 ..
root@node2:~ # drbdadm up export
 ..
root@node1:~ # drbdadm status
export role:Secondary
  disk:Inconsistent
  node2.local role:Secondary
    peer-disk:Inconsistent

root@node1:~ # drbdadm cstate all
Connected
root@node1:~ # drbdadm dstate all
Inconsistent/Inconsistent
root@node1:~ # cat /proc/drbd 
version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22

 1: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r-----
    ns:0 nr:0 dw:0 dr:0 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:3145596

As you can see, the resource is connected and secondary on both nodes and contains inconsistent content. Let's tell DRBD that the initial resynchronization is not required, because our disk is initially empty:

root@node1:~ # drbdadm -- --clear-bitmap new-current-uuid export
root@node1:~ # drbdadm dstate all
UpToDate/UpToDate
root@node1:~ # drbdadm status
export role:Secondary
  disk:UpToDate
  node2.local role:Secondary
    peer-disk:UpToDate

Make both nodes becomes primary:

root@node1:~ # drbdadm primary export
root@node2:~ # drbdadm primary export
root@node2:~ # drbdadm status
export role:Primary
  disk:UpToDate
  node1.local role:Primary
    peer-disk:UpToDate

Now DRBD is in the desired state, we can continue to define it in the cluster. Stop the DRBD:

root@node1:~ # drbdadm down export ; rm -f /var/lock/drbd-147-1
root@node2:~ # drbdadm down export ; rm -f /var/lock/drbd-147-1

You do not need to enable the autorun of the DRBD service because it will started by the cluster. Create a drbd command file to easily configure all the necessary resources in one shot and run it:

root@node1:~ # cat drbd 
PCS="/usr/sbin/pcs"
TCIB=$(mktemp)
PCST="$PCS -f $TCIB"

$PCS cluster cib $TCIB

$PCST resource delete drbd
$PCST resource create drbd ocf:linbit:drbd drbd_resource=export ignore_missing_notifications=true
$PCST constraint location add drbd-node3 drbd node3 -INFINITY
$PCST resource master drbd master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role="Master"

$PCS cluster cib-push $TCIB

rm -f $TCIB
root@node1:~ # sh drbd 
Error: Resource 'drbd' does not exist.
CIB updated

As a result:

root@node1:~ # pcs status 
Cluster name: nfs
Stack: corosync
Current DC: node3 (version 1.1.16-12.el7_4.2-94ff4df) - partition with quorum
Last updated: Fri Sep 22 20:36:02 2017
Last change: Fri Sep 22 20:35:56 2017 by root via cibadmin on node1

3 nodes configured
5 resources configured

Online: [ node1 node2 node3 ]

Full list of resources:

 suicide1       (stonith:fence_suicide):        Started node2
 suicide2       (stonith:fence_suicide):        Started node3
 suicide3       (stonith:fence_suicide):        Started node1
 Master/Slave Set: drbd-master [drbd]
     Masters: [ node1 node2 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

If you do not see Masters, but Slaves, then read the About selinux section.

Clustered LVM

I want to restrict LVM scanning only disks and the drbd device by editing the "filter" line in the /etc/lvm/lvm.conf file:

 ..
filter = [ "a|^/dev/vd|", "a|^/dev/drbd|", "r|.*|" ]
 ..

Note that my system disk is vda (KVM environment), correct this if you need to.

Now enable clustered LVM (it's also just editing lvm.conf):

root@node1:~ # lvmconf --enable-cluster

Copy /etc/lvm/lvm.conf to your neighbors:

root@node1:~ # rsync -a /etc/lvm/lvm.conf node2:/etc/lvm/lvm.conf
root@node1:~ # rsync -a /etc/lvm/lvm.conf node3:/etc/lvm/lvm.conf

Now we will define and run the cluster resource for dlm and clvmd following this referrence. As in the previous example, put everything in one file and run it:

root@node1:~ # cat clvm 
PCS="/usr/sbin/pcs"
TCIB=$(mktemp)
PCST="$PCS -f $TCIB"

$PCS cluster cib $TCIB

$PCST resource delete dlm
$PCST resource create dlm ocf:pacemaker:controld clone interleave=true ordered=true
$PCST constraint colocation add dlm-clone with drbd-master INFINITY with-rsc-role=Master

$PCST resource delete clvmd
$PCST resource create clvmd ocf:heartbeat:clvm clone interleave=true ordered=true

$PCST constraint order start dlm-clone then clvmd-clone
$PCST constraint colocation add clvmd-clone with dlm-clone

$PCS cluster cib-push $TCIB

rm -f $TCIB
root@node1:~ # sh clvm
Error: Resource 'dlm' does not exist.
Error: Resource 'clvmd' does not exist.
Adding dlm-clone clvmd-clone (kind: Mandatory) (Options: first-action=start then-action=start)
CIB updated

As a result:

root@node1:~ # pcs status
Cluster name: nfs
Stack: corosync
Current DC: node2 (version 1.1.16-12.el7_4.2-94ff4df) - partition with quorum
Last updated: Fri Sep 22 21:02:50 2017
Last change: Fri Sep 22 21:02:41 2017 by root via cibadmin on node1

3 nodes configured
11 resources configured

Online: [ node1 node2 node3 ]

Full list of resources:

 suicide1       (stonith:fence_suicide):        Started node2
 suicide2       (stonith:fence_suicide):        Started node3
 suicide3       (stonith:fence_suicide):        Started node1
 Master/Slave Set: drbd-master [drbd]
     Masters: [ node1 node2 ]
 Clone Set: dlm-clone [dlm]
     Started: [ node1 node2 ]
     Stopped: [ node3 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ node1 node2 ]
     Stopped: [ node3 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Create a clustered VG on a shared (by DRBD) disk:

root@node1:~ # pvcreate /dev/drbd/by-res/export/0 
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  Physical volume "/dev/drbd/by-res/export/0" successfully created.
root@node1:~ # pvs
  WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  PV         VG     Fmt  Attr PSize   PFree  
  /dev/drbd1        lvm2 ---   3.00g  3.00g
  /dev/vda2  rootvg lvm2 a--  19.75g 11.75g

Despite the warnings, the physical volume was successfully created. Warnings will disappear after the cluster is rebooted. You can do this now. Create a clustered VG (option -c y), then LV:

root@node1:~ # vgcreate -c y exportvg /dev/drbd1
  Clustered volume group "exportvg" successfully created
root@node1:~ # vgs
  VG       #PV #LV #SN Attr   VSize   VFree  
  exportvg   1   0   0 wz--nc  3.00g  3.00g
  rootvg     1   3   0 wz--n- 19.75g 11.75g

root@node1:~ # lvcreate -l100%FREE -n export /dev/exportvg
  Logical volume "export" created.

Let's format it as GFS2.

GFS2 part

Following the RedHat recommendations, format LV:

root@node1:~ # mkfs.gfs2 -j2 -p lock_dlm -t nfs:export /dev/exportvg/export 
/dev/exportvg/export is a symbolic link to /dev/dm-3
This will destroy any data on /dev/dm-3
Are you sure you want to proceed? [y/n] y
Discarding device contents (may take a while on large devices): Done
Adding journals: Done 
Building resource groups: Done   
Creating quota file: Done
Writing superblock and syncing: Done
Device:                    /dev/exportvg/export
Block size:                4096
Device size:               3.00 GB (785408 blocks)
Filesystem size:           3.00 GB (785404 blocks)
Journals:                  2
Resource groups:           13
Locking protocol:          "lock_dlm"
Lock table:                "nfs:export"
UUID:                      aa75ca81-10dc-4afc-94ed-3170a7ad1325

-t nfs:export - nfs is name of cluster, export is name of file system.

Continue with the recommendations, create a file system resource:

root@node1:~ # mkdir /export
root@node2:~ # mkdir /export
root@node1:~ # cat gfs2 
PCS="/usr/sbin/pcs"
TCIB=$(mktemp)
PCST="$PCS -f $TCIB"

$PCS cluster cib $TCIB

$PCST resource delete clusterfs
$PCST resource create clusterfs Filesystem device=/dev/exportvg/export directory=/export fstype=gfs2 options=noatime clone

$PCST constraint order start clvmd-clone then clusterfs-clone
$PCST constraint colocation add clusterfs-clone with clvmd-clone

$PCS cluster cib-push $TCIB

rm -f $TCIB
root@node1:~ # sh gfs2
Error: Resource 'clusterfs' does not exist.
Assumed agent name 'ocf:heartbeat:Filesystem' (deduced from 'Filesystem')
Adding clvmd-clone clusterfs-clone (kind: Mandatory) (Options: first-action=start then-action=start)
CIB updated
root@node1:~ # pcs status 
Cluster name: nfs
Stack: corosync
Current DC: node2 (version 1.1.16-12.el7_4.2-94ff4df) - partition with quorum
Last updated: Fri Sep 22 21:25:02 2017
Last change: Fri Sep 22 21:21:46 2017 by root via cibadmin on node1

3 nodes configured
14 resources configured

Online: [ node1 node2 node3 ]

Full list of resources:

 suicide1       (stonith:fence_suicide):        Started node1
 suicide2       (stonith:fence_suicide):        Started node2
 suicide3       (stonith:fence_suicide):        Started node3
 Master/Slave Set: drbd-master [drbd]
     Masters: [ node1 node2 ]
 Clone Set: dlm-clone [dlm]
     Started: [ node1 node2 ]
     Stopped: [ node3 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ node1 node2 ]
     Stopped: [ node3 ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ node1 node2 ]
     Stopped: [ node3 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
root@node1:~ # df /export
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/exportvg-export  3.0G  259M  2.8G   9% /export
root@node2:~ # df /export
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/exportvg-export  3.0G  259M  2.8G   9% /export

NFS part

An NFS server will be started with OS start, the only exportfs will be defined by cluster.

root@node1:~ # systemctl enable nfs-server.service ; systemctl start nfs-server.service
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.
root@node2:~ # systemctl enable nfs-server.service ; systemctl start nfs-server.service
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.

Create command file and run it:

root@node1:~ # cat export 
PCS="/usr/sbin/pcs"
TCIB=$(mktemp)
PCST="$PCS -f $TCIB"

$PCS cluster cib $TCIB

$PCST resource delete export
$PCST resource create export exportfs clientspec="*" options=rw,sync,no_root_squash directory=/export fsid=001 unlock_on_stop=1 clone

$PCST constraint order start clusterfs-clone then export-clone
$PCST constraint colocation add export-clone with clusterfs-clone

$PCS cluster cib-push $TCIB

rm -f $TCIB
root@node1:~ # sh export 
Error: Resource 'export' does not exist.
Assumed agent name 'ocf:heartbeat:exportfs' (deduced from 'exportfs')
Adding clusterfs-clone export-clone (kind: Mandatory) (Options: first-action=start then-action=start)
CIB updated
root@node1:~ # exportfs -v
/export         <world>(rw,sync,wdelay,hide,no_subtree_check,fsid=1,sec=sys,secure,no_root_squash,no_all_squash)

Conclusion

The Linux cluster is a very flexible tool. You can configure it the way you want. But it is very important to understand what exactly you want.

About selinux

When selinux is in enforcing mode, the following messages appear in /var/log/messages:

node1 drbd(drbd)[4853]: ERROR: export: Called /usr/sbin/crm_master -Q -l reboot -v 10000
node1 drbd(drbd)[4853]: ERROR: export: Exit code 107
node1 drbd(drbd)[4853]: ERROR: export: Command output:
node1 lrmd[1029]:  notice: drbd_monitor_20000:4853:stderr [ Error signing on to the CIB service: Transport endpoint is not connected ]
And no DRBD instance been promoted to master, both remains slaves. Even being promoted manually, cluster demotes it back.

Disabling selinux solves the problem, but I want to fix it in the enforcing execution mode

# sealert -a /var/log/audit/audit.log
 ..
--------------------------------------------------------------------------------

SELinux is preventing /usr/sbin/crm_attribute from connectto access on the unix_stream_socket @cib_rw.

*****  Plugin catchall_boolean (89.3 confidence) suggests   ******************

If you want to allow daemons to enable cluster mode
Then you must tell SELinux about this by enabling the 'daemons_enable_cluster_mode' boolean.

Do
setsebool -P daemons_enable_cluster_mode 1

*****  Plugin catchall (11.6 confidence) suggests   **************************

If you believe that crm_attribute should be allowed connectto access on the @cib_rw unix_stream_socket by default.
Then you should report this as a bug.
You can generate a local policy module to allow this access.
Do
allow this access for now by executing:
# ausearch -c 'crm_attribute' --raw | audit2allow -M my-crmattribute
# semodule -i my-crmattribute.pp


Additional Information:
Source Context                system_u:system_r:drbd_t:s0
Target Context                system_u:system_r:cluster_t:s0
Target Objects                @cib_rw [ unix_stream_socket ]
Source                        crm_attribute
Source Path                   /usr/sbin/crm_attribute
Port                          <Unknown>
Host                          <Unknown>
Source RPM Packages           pacemaker-1.1.16-12.el7_4.2.x86_64
Target RPM Packages           
Policy RPM                    selinux-policy-3.13.1-166.el7_4.4.noarch
Selinux Enabled               True
Policy Type                   targeted
Enforcing Mode                Permissive
Host Name                     node1.local
Platform                      Linux node1.local 3.10.0-693.2.2.el7.x86_64 #1 SMP
                              Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64
Alert Count                   102
First Seen                    2017-09-21 13:38:28 IDT
Last Seen                     2017-09-21 14:34:55 IDT
Local ID                      de1cae06-93ab-45fc-8cc2-c165375e48ce

Raw Audit Messages
type=AVC msg=audit(1505993695.143:66): avc:  denied  { connectto } for  pid=3650 comm="crm_attribute" path=006369625F72770000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 scontext=system_u:system_r:drbd_t:s0 tcontext=system_u:system_r:cluster_t:s0 tclass=unix_stream_socket


type=SYSCALL msg=audit(1505993695.143:66): arch=x86_64 syscall=connect success=yes exit=0 a0=5 a1=7ffe055f6e30 a2=6e a3=7ffe055f6b00 items=0 ppid=3647 pid=3650 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=crm_attribute exe=/usr/sbin/crm_attribute subj=system_u:system_r:drbd_t:s0 key=(null)

Hash: crm_attribute,drbd_t,cluster_t,unix_stream_socket,connectto

As suggested by sealert, a permanent fix for this would be:

# setsebool -P daemons_enable_cluster_mode 1
on all three nodes
Updated on Sat Sep 23 11:24:51 IDT 2017 by Oleg Volkov More documentations here