HA cluster for SAP HANA2 on SUSE 15 SP1 (POWER architecture)

Prerequisites

You have two LPARs installed for resilience in different locations according to SAP requirements. You can use my LPAR installation guide for SAP HANA. Using my autoyast file will be enough to meet SAP requirements.
Both nodes of the future cluster have their own disks for the HANA database.
There is a small LUN (about 100m in size) mapped to both nodes for use as SBD.
Both nodes are connected by a separate network segment will be used for replication.
A SAP HANA database software installed with empty demo database.

NOTE:
Prompt node1# means the command should be executed as root on node1.
Prompt node2# means the command should be executed as root on node2.
Prompt # means the command should be executed as root on any node.
Other prompts suppose running command as user described in text.

Node preparation

The preparation should be done on both nodes.

There are several points to consider before installing a cluster:

UIDs and GIDs

User and group IDs should be the same on all nodes. This is less important for our particular case, but it is very important for services with shared storage, either active-active or failover.

NTP

Both cluster nodes must be synchronized in time. It is good to keep all your servers synchronized in time in general.

SUSE 15 SP1 uses the chronyd time synchronization tool by default. Edit the /etc/chrony.conf file to fix the server line:

 ..
server <YOUR NTP SERVER NAME OR IP>
 ..

Then enable and start the chronyd daemon:

# systemctl stop chronyd
# systemctl enable --now chronyd
# chronyc sources

The last command shows that the daemon is actually connecting to specific NTP servers. The string must have an incrementing in time "Reach" counter value and some other than 0 jitter values.

Name resolution

Name resolution in SUSE 15 is managed by wicked and it's hard to get it relaxed and follow the legacy /etc/resolv.conf. There are several steps to get this:

# sed -e 's/^NETCONFIG_DNS_POLICY.*/NETCONFIG_DNS_POLICY=/' -i /etc/sysconfig/network/config
# netconfig update -f

The above commands should convince wicked to stop mess with /etc/resolv.conf. In practice, this was not enough, since the file still linked to a wicked managed file. Restoring a classic file solves the problem:

# rm -f /etc/resolv.conf
# vi /etc/resolv.conf

Place the regular content in the file as the search and nameserver definitions.

Even with well-functioning DNS, put the IP addresses of all hosts in /etc/hosts, including the IP addresses of the replication segment (use different names for them). Replicate the /etc/hosts to the second node.

Host SSH keys

This is a general recommendation not related to SAP or even SUSE. Let's imagine a virtual IP that will follow the active cluster node. SSH to this IP will always fail after a cluster failover due to changes in the SSH keys of the active host. Good practice here is to use the same SSH host keys on all nodes of the cluster.

node1# rsync -av /etc/ssh/ssh_host_* node2:/etc/ssh/

Multipath configuration

If there are multipath drives shared between cluster nodes (at least in our case, this is SBD), the following changes should be applied to the multipath daemon:

# cat /etc/multipath.conf
defaults {
	no_path_retry		fail
	queue_without_daemon	no
	flush_on_last_del	yes
}
 ..

In a default configuration, a multipath daemon will forgive failed paths, even the last one. He hopes that one of the paths will return soon and will not complain to the kernel about a faulty device. This is probably good in a stand-alone server configuration, but it causes the cluster to ignore storage subsystem failures. The changes above corrects this behavior.

Setup SBD

The STONITH block device acts as a message box between cluster nodes. This adds another message channel besides the network one. The difference is that corosync does not use this channel, but only fencing messages.

A special service is needed to read the message and decide on suicide. The message sent by the initiator node does not guarantee that the partner will read it and act accordingly. This leads to the idea of using a watchdog. Although Intel physical servers have a hardware watchdog timer, this is not the case in Power and virtualized environments. Therefore, you should use a software watchdog timer:

# echo softdog > /etc/modules-load.d/watchdog.conf
# systemctl restart systemd-modules-load
# lsmod | grep dog
softdog                16384  0

OPTIONAL: Making alias in /etc/multipath.conf for shorter name. This may be usefull later to resuming cluster referencing to handy name.

It is possible to initialize and configure SBD manually, but the cluster initialization script will do it better. However, you should enable SBD service, the script will not do this:

# systemctl enable sbd.service

NOTE: These actions should be done on both nodes.

Initial cluster configuration

Use the SUSE provided ha-cluster-init script on the first node to begin configuring the cluster.

node1# ha-cluster-init -u -n CLUSTERNAME

-u forces the cluster to use unicast interconnect instead of multicast. Most production customer networks do not support multicast, so use -u anytime and anywhere.
-n CLUSTERNAME set the cluster name be CLUSTERNAME. It is likely that your network will have more than one cluster, so give the cluster a name. This is less important in unicast mode, but it develops a useful habit of calling clusters differently.

The script is interactive and will ask you some questions.

/root/.ssh/id_rsa already exists - overwrite (y/n)? n

This question means that the script is trying to generate an SSH key pair and discovers that they already exist. An existing couple usually means that it is already being used for some purpose. Therefore, the correct answer is n.

 Address for ring0 [xxx.xxx.xxx.xxx]

There is no question mark, but the script is waiting for your input. If the proposed IP is suitable for the main network (ring0), just press Enter. If the proposal is incorrect, it's time to specify the IP address of another interface.

/etc/pacemaker/authkey already exists - overwrite (y/n)? y

This question may arise if you have already set up a cluster and now repeat the action. In my case, I want to start over, so I overwrite the cluster with an answer y.

Do you wish to use SBD (y/n)? y
  Path to storage device ...cutted output... []/dev/mapper/sbd
WARNING: All data on /dev/mapper/sbd will be destroyed!
Are you sure you wish to use this device (y/n)? y

Answer yes to use the SDB and give its name (/dev/mapper/36XXXX.. or a short alias if sat in multipath.conf). Agree to initialize the device.

Do you wish to configure a virtual IP address (y/n)? n

We will configure two VIPs later and assign one for the active HANA database and the other for the secondary. Since we want to use a secondary database for read-only queries, the second VIP can be useful.

Join node2 to cluster

node2# ha-cluster-join

This script will ask less questions.

 ..
 IP address or hostname of existing node ... []node1
 ..
/root/.ssh/id_rsa already exists - overwrite (y/n)? n
 ..
 Address for ring0 [xxx.xxx.xxx.xxx]<Press Enter here>

The explanation is similar to node1 part.

The result of script works is working cluster as shown:

# crm status
Stack: corosync
Current DC: node1

2 nodes configured
1 resource configured

Online: [ node1 node2 ]

Full list of resources:
 stonith-sbd   (stonith:external/sbd): Started node1

Security concern

During cluster initialization, a hacluster user is created with a default password. This user is very powerfull, especially when connected to the HAWK interface, which is also widely available after cluster initialization. Set a strong password for the user:

node1# passwd hacluster
node2# passwd hacluster

Change SBD behaviour

The cluster for the HANA database is slightly different from a common cluster. After failover, the primary database becomes secondary, register itself to the promoted database and resume data replication in opposite direction. It is not possible to repeat a cluster failover until the database is fully synchronized, otherwise data loss may occur. To prevent cluster ping-pong, the SBD_STARTMODE option may be useful. Edit the SBD configuration file and set the option to clean.

node1# vi /etc/sysconfig/sbd
SBD_STARTMODE=clean

Synchronize the file between nodes. The csync2 tool can be used for this job:

node1# csync2 -xv
Marking file as dirty: /etc/sysconfig/sbd
 ..
Updating /etc/sysconfig/sbd on node2
 ..
Finished with 0 errors.

As a result, the cluster software will not start on the failed server until manual intervention. Once the HANA database administrator has verified that the data replication is working properly, the block can be removed and the cluster can be resumed using the following commands (assuming node 1 is a failed node):

node1# sbd -d /dev/mapper/sbd message LOCAL clear
node1# systemctl start pacemaker.service

Adding redundant network ring1 to corosync

Since we have a replication network between our nodes, it can be added as a redundant network in corosync. This seems like a good idea, but causes corosync to ignore primary network failures. After a series of tests, I dropped the idea of adding a redundant network to use the corosync service.

Adding HMC fencing

As mentioned earlier, SBD is not a real fencing device, it is more like a message box, and the fencing operation depends on the partner’s ability to read the message and follow the order. The real fencing device does not depend on the state of the node. A good fencing device in a POWER environment is the HMC. It is good to configure both HMCs, if possible.

There are two fencing agents available for the HMC: hmchttp, which works over HTTPS using a username and password, and ibmhmc, which works over SSH and can use a passwordless connection based on a key exchange. I chose the second method, which does not use passwords.

Find the public SSH key:

# cat .ssh/id_rsa.pub
ssh-rsa Very..very..Long..String Cluster Internal

A cluster script replicates SSH keys between nodes and they are the same on both nodes.

~> mkauthkeys -a "ssh-rsa Very..very..Long..String Cluster Internal"

Repeat the same with HMC 2

Important! You should approve SSH host fingerprint on both nodes for both HMC ! Othervice the fencing agent will fail on yes/no question.

node1# ssh -l hscroot <IP-ADDR-HMC1>
node1# ssh -l hscroot <IP-ADDR-HMC2>
node2# ssh -l hscroot <IP-ADDR-HMC1>
node2# ssh -l hscroot <IP-ADDR-HMC2>

Please verify all four option that it is possible do passwordless SSH connection to both HMC.

Now it is the time to define fencing itself:

# crm configure primitive fence-hmc1 stonith:ibmhmc params ipaddr="xxx.xxx.xxx.xx1"
# crm configure primitive fence-hmc2 stonith:ibmhmc params ipaddr="xxx.xxx.xxx.xx2"

Now we have three fencing devices. The SUSE cluster will use one of them and will not continue to other devices if the first action was successful. As we have already said, writing a message on an SBD device is almost always successful, although it may not really fence a node. We must make sure that at least one HMC fence is executed. This is achieved using the fencing_topology definition. The comma in the list of fencing devices acts as the AND operator, and the space as the OR operator.

# stonith_admin -l node2
 stonith-sbd
 fence-hmc1
 fence-hmc2
3 devices found
# crm configure fencing_topology stonith-sbd,fence-hmc1 fence-hmc2

Change cluster quorum policy

A two-node cluster is a special kind of cluster. Despite the correct configuration of the fence, it will never reach a quorum. Therefore, its default policy should be adapted to this fact.

# crm configure property no-quorum-policy=ignore

It's also time to turn on STONITH (or fencing):

# crm configure property stonith-enabled=true

Configure HANA replication

Most of the following commands are executed by the administrator of the HANA instance. The HANA database is identified by the SID made up of three uppercase characters and a two-digit value called the instance number. Location and file names are often a combination of hostname, sid, and instance. There are some system variables matching to HANA values. For example you can found SID as $SAPSYSTEMNAME and instance number as $TINSTANCE. The SID admin username as <sid>adm (here the SID is used in lower case) created during installation. Become a <sid>adm :

node1# su - <sid>adm
node1>

A HANA instance can be started automatically when the server boots. This is not suitable for cluster configuration, so you should disable this. A parameter named "Autostart" is apear in the instance profile and must be equal to zero. Use the handy alias "cdpro" to go directly to the profile location and check the Autostart parameter:

node1> cdpro
node1> grep Auto <SID>_HDB<instance>_<hostname>
Autostart = 0

The next step is to create a full backup. HANA 2 works in multitenant mode, and backups can be performed for any particular part of the instance. We need a backup for the entire instance, so the backup operator includes the FOR FULL SYSTEM option and the connection is made to the system database with help of the -d SYSTEMDB option.

node1> hdbsql -u SYSTEM -i <instance num> -d SYSTEMDB "BACKUP DATA FOR FULL SYSTEM USING FILE ('FIRSTFULL')"

The resulting files will be created in the $DIR_INSTANCE/backup/data directory. If you have free space elsewhere, indicate the full path in the previous command.

Replication must use the replication network, this can be determined by the system_replication_hostname_resolution parameter in the global.ini file. The file can be updated online using the HANA Studio tool or similar. I did not have such a tool available to me. Therefore, I shut down the database and made changes to the file directly, and then bring the database up. The file is located in the place for the custom configuration, you can go there with the handy alias "cdcoc":

node1> HDB stop
node1> cdcoc
node1> vi global.ini

Once we dealing with global.init file, there are another nice options to include, like traffic compression:

 ..
[system_replication]
enable_log_compression = true
enable_data_compression = true
enable_log_retention = auto

[system_replication_communication]
listeninterface = .internal

[system_replication_hostname_resolution]
<Replication IP of node1 in form XXX.XXX.XXX.XXX> = node1
<Replication IP of node2 in form XXX.XXX.XXX.XXX> = node2

Save the file and start the databse:

node1> HDB start

NOTE: It is possible to update files online using ALTER SYSTEM hdbsql command and this is an example of such command:
node1> echo "ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'System') SET ('system_replication_communication', 'listeninterface') = '.internal' WITH RECONFIGURE" | hdbsql -u SYSTEM -i <instance num> -d SYSTEMDB

Enable replication using hdbnsutil command:

node1> hdbnsutil -sr_enable --name=PRODSITE
node1> netstat -tlnp

Check output of last command to verify replication processes listening on replication IP.

Secondary database

Do almost same actions as at primary database:

node2# su - <sid>adm
node2> cdpro
node2> grep Auto <SID>_HDB<instance>_<hostname>
Autostart = 0
node2> HDB stop
node2> cdcoc
node2> vi global.ini

Put exactly the same update into global.ini file.

You must copy the SSFS encryption keys for successful replication. Do this using root, as it is already configured for passwordless actions between cluster nodes:

node2# rsync -av node1:/usr/sap/<SID>/SYS/global/security/rsecssfs/ /usr/sap/<SID>/SYS/global/security/rsecssfs/

NOTE: When XSA in use, its own SSFS keys also should be copied over to node2:

node2# rsync -av node1:/usr/sap/<SID>/SYS/global/xsa/security/ssfs/ /usr/sap/<SID>/SYS/global/xsa/security/ssfs/
node2# su - <sid>adm
node2> cdcoc
node2> cat xscontroller.ini
[communication]
default_domain = <FQDN of primary cluster VIP>
api_url = https://<FQDN of primary cluster VIP>:30030

Replicate the xscontroller.ini to node1 after cluster will sat.

node2# su - <sid>adm
node2> hdbnsutil -sr_register \
	--remoteHost=node1 \
	--remoteInstance=<instance num> \
	--replicationMode=syncmem \
	--operationMode=logreplay_readaccess \
	--name=DRSITE
node2> HDB start

Add HANA resource to cluster

SUSE provides two resource agents that help manage HANA in a high availability cluster. The first is SAPHanaTopology, which tracks and understands the current state of HANA database. The second is SAPHana, which actually handles database switching.

The agent can work with the hdbsql interface, what requires a lot of preparation of the database itself. It can also use the systemReplicationStatus.py script, which is located in /hana/shared/<SID>/HDB<instance>/exe/python_support. This option is preferred; the script is available starting from SPS9, if it is missing, then perhaps some part of the HANA software is not installed.

An easy way to configure is to use the wizard that comes with the HAWK. You can connect by browser to any node via https to port 7630. I prefer to use the mobaxterm feature to forward graphics through the SSH tunnel, so I just do:

# firefox https://localhost:7630

Log in with user hacluster - do you remember the password you set earlier? Then go to CONFIGURATION -> Wizards -> SAP -> SAP HANA SR Scale-Up Performance-Optimized. Fill the form: enter the HANA SID, instance number and virtual IP address that will follow the primary database. Click Verify, check the contents of the proposal and click Apply.

Add VIP for secondary database

Since we are setting up replication with read-only request access, we need an additional IP address for this service.

# crm configure primitive rsc_ip_<SID>_RO IPaddr2 params ip=xxx.xxx.xxx.xxx cidr_netmask=24
# crm configure colocation col_saphana_ip_<SID>_RO 2000: rsc_ip_<SID>_RO:Started msl_SAPHana_<SID>_HDB<instance>:Slave

Fail over cluster node and recovery

If you followed me with the procedure, then the Master database is on node1. Then start the cluster monitor on node2, which is the secondary HANA database (displayed as a slave in cluster state).

node2# crm_mon

Lets stop node1's network:

node1# ifdown eth0

On the monitor screen, you will see how soon node2 fence node1 and promote itself be primary.

Recovery fenced node

If you remember that we did the SDB setup so as to avoid attaching the fenced node back to the cluster, so it will never appear online in the crm monitor without our intervention.

Log in to the fenced node (this is node1 in the context of my article) and make sure that the cluster services are not working:

node1# crm status
ERROR: status: crm_mon (rc=102): Error: cluster is not available on this node

Good.
Let's register the HANA database for replication, start it and wait for full synchronization.

node1# su - <sid>adm
node1> hdbnsutil -sr_register \
        --remoteHost=node2 \
        --remoteInstance=<instance num> \
        --replicationMode=syncmem \
        --operationMode=logreplay_readaccess \
        --name=PRODSITE
node1> HDB start

Check the replication status using HANA tools until the database is in synchronized state.

Once you ready to activate cluster back, remove SBD blocking and start cluster services:

node1# sbd -d /dev/mapper/sbd list
0	node1	reset	node2
1	node2	clear
node1# sbd -d /dev/mapper/sbd message LOCAL clear
node1# sbd -d /dev/mapper/sbd list
0	node1	clear	node1
1	node2	clear
node1# systemctl start pacemaker.service

The cluster will start HANA resource agents that detect normal database behavior and will not try to stop or start HANA. The cluster will only start the missing secondary IP address and begin to monitor the status of the cluster.

Updated on Sun May 24 16:11:15 IDT 2020 More documentations here