Building cluster with shared FS (GFS2) on RedHat6

Prepare nodes

Name resolving is important, define servers names in DNS or hard-code them into /etc/hosts.

$ host vorh6t01
vorh6t01.domain.com has address 192.168.0.105
$ host vorh6t02
vorh6t02.domain.com has address 192.168.0.108

Servers, that used here, are CISCO UCS servers, therefore we will define UCS fencing later.

Create UCS profile and install RH6 with minimal configuration on both nodes. Create shared LUN on storage and make required zones and masking, so that both servers can see the shared LUN.

Copy host SSH keys from one node to another. This is more relevant for HA (fail-over) cluster to make easy SSH to floating IP, but I still copy them even for AA clusters:

root@vorh6t01:/etc/ssh # scp ssh_host_* root@vorh6t02:/etc/ssh/
...
root@vorh6t01:/etc/ssh # >/root/.ssh/known_hosts
root@vorh6t01:/etc/ssh # ssh vorh6t02 /etc/init.d/sshd restart
...

Fix /etc/hosts with all relevant to cluster IPs and names and copy it between nodes.

Generate root SSH keys and exchange it over cluster nodes:

root@vorh6t01:~ # ssh-keygen -t rsa -b 1024 -C "root@vorh6t0x"
.....
root@vorh6t01:~ # cat .ssh/id_rsa.pub >> .ssh/authorized_keys
root@vorh6t01:~ # scp -pr .ssh vorh6t02:

Cluster software

Install these RPMs on both nodes (with all depencies):

# yum install lvm2-cluster ccs cman rgmanager gfs2-utils

Setting Cluster

vorh6t01 and vorh6t02 are two nodes of cluser named vorh6t0x. Take care to make all names resolvable by DNS and add all names to /etc/hosts on both nodes.

Define the cluster:

root@vorh6t01:~ # ccs_tool create -2 vorh6t0x

The command above creates /etc/cluster/cluster.conf file. It can be editted by hand and have to be redistributed to every node in cluster. The option -2 builds two-node cluster. The classic cluster configuration suppose more than two nodes to make quorum easy.

Open file and change nodenames to real names. The resulting file should be like:

<?xml version="1.0"?>
<cluster name="vorh6t0x" config_version="1">

  <cman two_node="1" expected_votes="1" transport="udpu" />
  <clusternodes>
    <clusternode name="vorh6t01.domain.com" votes="1" nodeid="1">
      <fence>
        <method name="single">
        </method>
      </fence>
    </clusternode>
    <clusternode name="vorh6t02.domain.com" votes="1" nodeid="2">
      <fence>
        <method name="single">
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
  </fencedevices>

  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>

I am using transport="udpu" here, because my network does not support multicasts and broadcasts are not welcomed too. Without this option, my cluster works upredictable. Check the results:

root@vorh6t01:~ # ccs_config_validate
Configuration validates
root@vorh6t01:~ # ccs_tool lsnode

Cluster name: vorh6t0x, config_version: 1

Nodename                        Votes Nodeid Fencetype
vorh6t01.domain.com           1    1
vorh6t02.domain.com           1    2

Copy /etc/cluster/cluster.conf to second node:

vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf

You can start sluster services now on both nodes by command /etc/init.d/cman start to see cluster working. Check /var/log/messages. See clustat output:

root@vorh6t01:~ # clustat
Cluster Status for vorh6t0x @ Thu Feb 26 11:37:05 2015
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vorh6t01.domain.com                                            1 Online, Local
 vorh6t02.domain.com                                            2 Online

root@vorh6t02:~ # clustat
Cluster Status for vorh6t0x @ Thu Feb 26 11:37:10 2015
Member Status: Quorate

 Member Name                                                     ID   Status
 ------ ----                                                     ---- ------
 vorh6t01.domain.com                                            1 Online
 vorh6t02.domain.com                                            2 Online, Local

Add cluster services to init scripts:

# chkconfig --add cman
# chkconfig cman on

This cluster will not manage any resources, it just provides infrastructure for shared clustered file system, therefore no additional resources configuration required. The only fencing configuration is required for normal cluster functionality. These servers are UCS, therefore fence_cisco_ucs will be used.

I've created local user "FENCEUSER" on USCMANAGER with poweroff and server-profile roles.

Check it working:

# fence_cisco_ucs --ip=USCMANAGER --username=FENCEUSER --password=FENCEUSERPASS \
	--ssl --suborg=org-YourSubOrgString --plug=vorh6t01 --action=status
Status: ON

You can use on, off as action parameter to turn neigbour server on or off. It is not smart to turn off server itself.

The --suborg string is usually your "Sub-Organization" (in CISCO terms) name with prefix "org-". For example, if you had called your "Sub-Organization" as "Test" in UCS manager, then results will be --suborg=org-Test.

Once fencing tests had worked, fix cluster.conf:

# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="vorh6t0x" config_version="2">
        <logging syslog_priority="error"/>

  <fence_daemon post_fail_delay="20" post_join_delay="30" clean_start="1" />
  <cman two_node="1" expected_votes="1" transport="udpu" />
  <clusternodes>
    <clusternode name="vorh6t01.domain.com" votes="1" nodeid="1">
      <fence>
        <method name="single">
                <device name="ucsfence" port="vorh6t01" action="off" />
        </method>
      </fence>
    </clusternode>                                                 
    <clusternode name="vorh6t02.domain.com" votes="1" nodeid="2">
      <fence>
        <method name="single">
                <device name="ucsfence" port="vorh6t02" action="off" />
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
        <fencedevice name="myfence" agent="fence_manual" />
	<fencedevice name="ucsfence"
		agent="fence_cisco_ucs"
		ipaddr="USCMANAGER"
		login="FENCEUSER"
		passwd="FENCEUSERPASS"
		ssl="on"
		suborg="org-YourSubOrgString"
		/>
  </fencedevices>

  <rm>
    <failoverdomains/>
    <resources/>
  </rm>
</cluster>

Do not forget increment config_version number and save changes. Verify config file:

vorh6t01 # ccs_config_validate
Configuration validates

Distribute file and update cluster:

vorh6t01 # scp /etc/cluster/cluster.conf vorh6t02:/etc/cluster/cluster.conf
vorh6t01 # cman_tool version -r -S

Check that fencing works by turning off network on one node

LVM settings

We will create LVM structure on multipath device, not on underlying SCSI disks. It is important to set up correct filter line in /etc/lvm/lvm.conf to expicitly include multipathed devices and exclude others, otherwise you will see "Duplicate PV found" and LVM can deside to use single path disk instead of multipathed. Here is an example of my "filter" line, adding only "rootvg" device and multipath device:

filter = [ "a|^/dev/mapper/data|", "a|^/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0|", "r/.*/" ]

Multipath does not installed in minimal installation, then add it on both nodes:

# yum install --enablerepo=updates device-mapper-multipath
# /etc/init.d/multipathd start
# chkconfig --add multipathd
# chkconfig multipathd on

My copy of /etc/multipath.conf:

defaults {
        user_friendly_names yes
        flush_on_last_del       yes
        queue_without_daemon    no
	no_path_retry		fail
}

# Local disks in UCS:
blacklist {
        device {
                vendor "LSI"
                product "UCSB-MRAID12G"
        }
}

devices {
        device {
                vendor                  "HITACHI"
                product                 "*"
                path_checker            "directio"
                path_grouping_policy    "multibus"
                path_selector           "service-time 0"
                failback                "immediate"
                rr_weight               "uniform"
                rr_min_io_rq            "128"
                features                "0"
        }
}

multipaths {
        multipath {
                wwid 360060e800756ce00003056ce00008148
                alias   data
        }
}

As you see, my LUN with the wwid would appear as /dev/mapper/data, exactly as I wrote in LVM's filter line.

Rescan multipath devices by command "multipath" and check than "/dev/mapper/data" was constructed.

Enable LVM cluster featires on both nodes and start clvmd:

# lvmconf --enable-cluster
# /etc/init.d/clvmd start
# chkconfig --add clvmd
# chkconfig clvmd on

Create PV and CLV on one node:

vorh6t01:~ # pvcreate --dataalignment 4k /dev/mapper/data
  Physical volume "/dev/mapper/data" successfully created
vorh6t01:~ # vgcreate -c y datavg /dev/mapper/data
  Clustered volume group "datavg" successfully created
vorh6t01:~ # lvcreate -n export -L 20g /dev/datavg
  Logical volume "export" created

Check on second node by commands pvs, vgs and lvs that everything visible there too.

GFS2 Settings

Create GFS2 on one node as following:

vorh6t01:~ # mkfs.gfs2 -p lock_dlm -t vorh6t0x:export -j 2 /dev/datavg/export
This will destroy any data on /dev/datavg/export.
It appears to contain: symbolic link to `../dm-7'

Are you sure you want to proceed? [y/n] y

Device:                    /dev/datavg/export
Blocksize:                 4096
Device Size                20.00 GB (5242880 blocks)
Filesystem Size:           20.00 GB (5242878 blocks)
Journals:                  2
Resource Groups:           80
Locking Protocol:          "lock_dlm"
Lock Table:                "vorh6t0x:export"
UUID:                      ae65b8eb-997c-9a3f-079d-092d7d07d2ae

where: vorh6t0x is ClusterName, export is FS name, -j 2 using two journals as we have two nodes.

Then, mount it:

# mkdir /export
# mount -o noatime,nodiratime -t gfs2 /dev/datavg/export /export
# echo "/dev/datavg/export   /export  gfs2   noatime,nodiratime   0 0" >> /etc/fstab
# chkconfig --add gfs2 ; chkconfig gfs2 on

/etc/init.d/gfs2 script as part of gfs2-utils will mount/umount GFS2 from /etc/fstab at appropriate time, after cluster started and before it goes down.

Testing

Make reboots, check if "/export" mounted after reboot. In case not, repeat checks if you have correct line in /etc/fstab, if you have correct "filter" in /etc/lvm/lvm.conf, All of these were described above.

Other configuration option

Previous configuration implies that GFS2 service (and CLVM) will cared on it's own without cluster intervention. Cluster software here supplies infrastructure only.

The second configuration implements GFS2 as cluster service. Therefore system services gfs2 and clvmd should be disabled and /etc/fstab should not include GFS2 lines.

# chkconfig --del gfs2
# chkconfig --del clvmd
# chkconfig --add rgmanager ; chkconfig rgmanager on

# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="vorh6t0x" config_version="3">

  <fence_daemon post_fail_delay="20" post_join_delay="30" clean_start="1" />
  <cman two_node="1" expected_votes="1" transport="udpu" />
  <clusternodes>
    <clusternode name="vorh6t01.domain.com" votes="1" nodeid="1">
      <fence>
        <method name="single">
                <device name="ucsfence" ipaddr="daucs01p" port="vorh6t01" action="off" />
        </method>
      </fence>
    </clusternode>
    <clusternode name="vorh6t02.domain.com" votes="1" nodeid="2">
      <fence>
        <method name="single">
                <device name="ucsfence" ipaddr="daucs02p" port="vorh6t02" action="off" />
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
        <fencedevice name="myfence" agent="fence_manual" />
        <fencedevice name="ucsfence"
                agent="fence_cisco_ucs"
                login="FENCEUSER"
                passwd="FENCEUSERPASS"
                ssl="on"
                suborg="org-YourSubOrgString"
                />
  </fencedevices>

  <rm>
        <resources>
                <script file="/etc/init.d/clvmd" name="clvmd"/>
                <clusterfs name="export"
                        device="/dev/datavg/export"
                        fstype="gfs2"
                        mountpoint="/export"
                        options="noatime,nodiratime"
                        force_unmount="1"
                        />
        </resources>
    <failoverdomains>
        <failoverdomain name="node01" restricted="1">
                <failoverdomainnode name="vorh6t01.domain.com"/>
        </failoverdomain>
        <failoverdomain name="node02" restricted="1">
                <failoverdomainnode name="vorh6t02.domain.com"/>
        </failoverdomain>
    </failoverdomains>
        <service name="mount01" autostart="1" recovery="restart" domain="node01">
                <script ref="clvmd">
                        <clusterfs ref="export"/>
                </script>
        </service>
        <service name="mount02" autostart="1" recovery="restart" domain="node02">
                <script ref="clvmd">
                        <clusterfs ref="export"/>
                </script>
        </service>
  </rm>
</cluster>

Problems

Still cannot find solution to situation when one node loosing connction to LUN. Well tuned multipath pass hardware fail to clvmd, but then LVM hangs for some reason. It also hangs dlm protocol between nodes, therefore ALL GFS2 nodes will hang too. Turning off "bad" node _may_ release rest of nodes, but usually only total reboot solves hanging problem.

That is why second configuration option was tested (cluster take care about GFS2 mounting). The idea was that cluster will fence "bad" node and release GFS2 functionality on survived nodes. But this not happen. Hanging LVM totally break cluster functionality.

Mounting snapshot/replica on single node without clusterware

Following steps are:

Change LVM VG mode from clustered to single
Activate VG
Mount GFS2 in single mode:

vorh6t03:~ # vgchange --config 'global {locking_type = 0}' -c n datavg
  WARNING: Locking disabled. Be careful! This could corrupt your metadata.
  Volume group "datavg" successfully changed
vorh6t03:~ # vgchange -ay datavg
  1 logical volume(s) in volume group "datavg" now active
vorh6t03:~ # mount -t gfs2 /dev/datavg/export /export/ -o lockproto=lock_nolock,ignore_local_fs
vorh6t03:~ # df /export
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/datavg-export  100G  668M  100G   1% /export

Reverting changes:

vorh6t03:~ # umount /export
vorh6t03:~ # vgchange -an datavg
  0 logical volume(s) in volume group "datavg" now active
vorh6t03:~ # vgexport datavg
  Volume group "datavg" successfully exported

Updated on Wed Mar 25 13:55:42 IST 2015 More documentations here