Diskless SUSE

The scope of POC

Having discovered the world of HPC, I found the problem of installing many identical nodes and maintaining them. Of course, you can install using all of them using the same kickstart or autoyast in SUSE case. The further management can be done by ansible, puppet or salt.

But if it is possible to create a diskless installation? The benefits are obvious.

As for the disadvantages

Unlike of many mine POCs, this description was written after testing itself. During the experiments, several problems were discovered and a couple of ideas were added. The grandest idea was the idea of mounting an overlay filesystem as root. Thus, the shared root read-only file system becomes like a template for each node. Locally written files are stored in node's memory and will be lost during poweroff or reboot. As a downside, the overlay's lower, base filesystem cannot be changed on the fly without breaking the filesystem. This means that every maintenance of the shared root file system requires a reboot of all connected nodes.

Also, a bug was discovered in the dracut NFS-root module. A patch for it including the overlay solution has been added to the body of the article. I'll describe the entire thread as a proven working recipe and add an additional chapter describing dracut's bug finding.

I will build here in the POC, a PXE server, an NFS server that can be replaced with any existing reliable NFS server (e.g. NetApp) but is here on the same PXE server for simplicity, and a management server that is also on the PXE server for simplicity.

Here is the same procedure for Red Hat 8.5

First, I deployed a minimal installation of SUSE 15 SP3 and added second NIC to it, connected to an isolated network. This network will be used for PXE boot and NFS traffic. I gave it an IP of serving Class B network.

NFS server

If you are installing an NFS server on SUSE 15 SP3, you will need to install the following packages:

# zypper in yast2-nfs-server nfs-kernel-server

We'll be using NFS version 3 for simplicity, so we'll disable v4. Create NFS export and make permissions:

# mkdir -p /srv/nfs/root
# echo "/srv/nfs/root,no_root_squash,sec=sys),no_root_squash,sec=sys)" > /etc/exports
# sed -e 's/^NFS4_SUPPORT=.*/NFS4_SUPPORT="no"/' -i /etc/sysconfig/nfs
# systemctl enable --now rpcbind.service nfs-server.service

Only the management server has read/write permission for this export.

Management server

Install the packages needed for NFS mounts, mount the NFS root volume, and do an initial install on it, like so:

# zypper in nfs-client
# systemctl enable --now nfs-mountd.service
# mkdir /mnt/net
# mount /mnt/net
# zypper --installroot /mnt/net install -y patch rsync vim kernel-default nfs-client +pattern:base

Next is to create chroot environment and enter to it:

# cat > /mnt/net/command_mount_chroot << EOF
mount -o bind /proc proc
mount -o bind /sys sys
mount -o bind /dev dev
chroot . /bin/bash -i
umount dev
umount sys
umount proc
# cd /mnt/net
# sh command_mount_chroot

Apply patch to dracut nfs module:

chroot# cd /usr/lib/dracut/modules.d/95nfs
chroot# patch -p1 << 'EOFpatch'
diff -Naur 95nfs/module-setup.sh 95nfs.patch/module-setup.sh
--- 95nfs/module-setup.sh       2022-01-11 16:21:14.000000000 +0200
+++ 95nfs.patch/module-setup.sh 2022-01-28 17:03:38.712711159 +0200
@@ -92,6 +92,7 @@
     inst_hook cmdline 90 "$moddir/parse-nfsroot.sh"
     inst_hook pre-udev 99 "$moddir/nfs-start-rpc.sh"
     inst_hook cleanup 99 "$moddir/nfsroot-cleanup.sh"
+    inst_hook mount 95 "$moddir/nfs-overlay-mount.sh"
     inst "$moddir/nfsroot.sh" "/sbin/nfsroot"
     inst "$moddir/nfs-lib.sh" "/lib/nfs-lib.sh"
     mkdir -m 0755 -p "$initdir/var/lib/nfs/rpc_pipefs"
diff -Naur 95nfs/nfs-overlay-mount.sh 95nfs.patch/nfs-overlay-mount.sh
--- 95nfs/nfs-overlay-mount.sh  1970-01-01 02:00:00.000000000 +0200
+++ 95nfs.patch/nfs-overlay-mount.sh    2022-01-28 17:01:54.755365530 +0200
@@ -0,0 +1,5 @@
+mkdir -p /run/{lower,upper,work}
+nfsroot lo $netroot /run/lower
+mount -t overlay overlay -o rw,lowerdir=/run/lower,upperdir=/run/upper,workdir=/run/work $NEWROOT
diff -Naur 95nfs/parse-nfsroot.sh 95nfs.patch/parse-nfsroot.sh
--- 95nfs/parse-nfsroot.sh      2022-01-11 16:21:14.000000000 +0200
+++ 95nfs.patch/parse-nfsroot.sh        2022-01-28 17:06:03.320999432 +0200
@@ -116,7 +116,7 @@
 # confused by having /dev/nfs[4]
-echo '[ -e $NEWROOT/proc ]' > $hookdir/initqueue/finished/nfsroot.sh
+#echo '[ -e $NEWROOT/proc ]' > $hookdir/initqueue/finished/nfsroot.sh
 mkdir -p /var/lib/rpcbind
 chown rpc:rpc /var/lib/rpcbind
chroot# chmod 755 nfs-overlay-mount.sh

Create your first initrd with NFS root support:

chroot# dracut --no-hostonly --no-hostonly-cmdline --nofscks \
--add-drivers "virtio_pci overlay" \
--install more \
-m "nfs network base" \
--force /boot/sle15sp3nfs.ird $(basename $(ls -1d /lib/modules/* | tail -1))

PXE server

Install packages required for PXE environment, as DHCP server and TFTP server:

# zypper in dhcp-server yast2-dhcp-server yast2-tftp-server tftp syslinux

Enable only second NIC to be used by DHCP server:

# sed -e 's/^DHCPD_INTERFACE=.*/DHCPD_INTERFACE=eth1/' -i /etc/sysconfig/dhcpd

Here is an example of minimal DHCP server configuration:

# /etc/dhcpd.conf
allow booting;
allow bootp;
ddns-update-style none;
default-lease-time 14400;
deny unknown-clients;
ignore client-updates;
update-static-leases on;
get-lease-hostnames true;
use-host-decl-names on;

subnet netmask {
        #option domain-name "diskless.domain.com";
        #option domain-name-servers;
        #option routers;
        #option ntp-servers;
        option subnet-mask;
	filename        "pxelinux.0";
        pool {
                range dynamic-bootp;
                host dc1 {
                        hardware ethernet be:19:00:a0:d8:70;

Please fix the "dc1" MAC address reflects the reality.

You can simplify the /etc/dhcpd.conf file by excluding the MAC address and client configuration. In this case, all nodes will have the same hostname (probably localhost), which is not very convenient. It is possible to fix this so that the hostname is resolved via DNS, but for this you need to install a DNS server and configure all the names there. You may prefer this way so you don't have to worry about MAC addresses at all.

Now you are ready to start DHCP and TFTP servers:

# systemctl enable --now dhcpd.service tftp.socket

Prepare TFTP root as follow:

# ln -s /srv/tftpboot /
# cp -v /usr/share/syslinux/pxelinux.0 /srv/tftpboot/
# cp -v /usr/share/syslinux/vesamenu.c32 /srv/tftpboot/
# mkdir /srv/tftpboot/pxelinux.cfg
# cat > /srv/tftpboot/pxelinux.cfg/default << EOF
default vesamenu.c32
timeout 15

LABEL linux
  kernel sle15sp3nfs.krl
  append initrd=sle15sp3nfs.ird splash=none root=nfs:,vers=3,sec=sys,nolock 

And finally, transfer the kernel and created initrd from /mnt/net of management server to /tftpboot on PXE:

# cp -fLv /mnt/net/boot/vmlinuz /tftpboot/sle15sp3nfs.krl 
# cp -fLv /mnt/net/boot/sle15sp3nfs.ird /tftpboot/
# chmod 644 /tftpboot/sle15sp3nfs*

Now you can boot your first "dc1" server from network and check everything works.

Working with root filesystem on example of enabling SSH

Done on management server !!

You can set a root password. Probably you did not need this, enabling only passowrdless login using ssh key instead. This can be usefull for console login when network missing. But if network missing, there is no NFS root and no OS to run.

# cd /mnt/net
# sh command_mount_chroot
chroot# passwd

An SSH cannot be enabled by "systemctl enable" command, as it run in shrooted environment. You should enable it manually by creating symbolic link as systemd do.

# cd /mnt/net/etc/systemd/system/multi-user.target.wants/
# ln -s /usr/lib/systemd/system/sshd.service

A host SSH keys will be generated at node start and kept at /etc/ssh/ directory which will dropped at next restart. If you need to keep them constant in any reboot, you need to have them on root file system:

# rsync -av /etc/ssh/ /mnt/net/etc/ssh/

You can add a passwordless ssh access from management server to any node by simple commands:

# ssh-key -t rsa -b 2048
# mkdir -m700 /mnt/net/root/.ssh
# cat /root/.ssh/id_rsa.pub > /mnt/net/root/.ssh/authorized_keys

Just to remind you, that changind lower overlay filesystem will break all functionality and every node should be rebooted.

Hunting dracut NFS bug

Just before applying the patch above, the boot would hang for a long time, then crash creating /run/initramfs/rdsosreport.txt and suggesting to add rd.shell and rd.debug to the boot options. I added the recommended options but still couldn't read the debug file because the minimal tools were missing. I rebuilt the image by including the more tool. You can see it in the final version as well. After examining the /run/initramfs/rdsosreport.txt file, I found that the boot process was stuck on the /lib/dracut/hooks/initqueue/finish/nfsroot.sh script, which checks for the existence of $NEWROOT/proc, assuming the new root filesystem to be already mounted. But in this phase there is no actual mounting, so this check is not needed and obviously failed.

Since the problems occur at the initqueue stage, I added the rd.break=initqueue boot parameter to the already existing rd.shell and rd.debug. This option opens a shell prompt when entering this phase.

initqueue:/# rm /lib/dracut/hooks/initqueue/finished/nfsroot.sh
initqueue:/# exit

This helps to advance the boot process, but the root mount fails due to another dracut bug. The NFS mount script exists but does not run for some reason. Let's check the possiblity to mount NFS root by adding the boot option rd.break=mount to all existing options.

initqueue:/# rm /lib/dracut/hooks/initqueue/finished/nfsroot.sh
initqueue:/# exit
mount:/# nfsroot lo $netroot $NEWROOT
mount:/# exit

This time the server boots well, but root filesystem remains read only and no services can be started. Let's repeat the boot implementing the overaly idea:

initqueue:/# rm /lib/dracut/hooks/initqueue/finished/nfsroot.sh
initqueue:/# exit
mount:/# mkdir /run/{lower,upper,work}
mount:/# nfsroot lo $netroot /run/lower
mount:/# mount -t overlay overlay -o lowerdir=/run/lower,upperdir=/run/upper,workdir=/run/work $NEWROOT

The server responds with an error that the overlay file system is unknown. Probably the kernel module is not included in the initramfs. You can find it explicitely included in the final version shown in the main thread above.

Updated on Sat Jan 29 15:16:42 IST 2022 More documentations here