ZFS recepies

Software installation

Install the ZFS software as explained on the ZFS on linux site. For tests, I will use the platform created during the Redundant disks without MDRAID POC There is Fedora (25) installed there, so I followed this instruction to install ZFS.

After installation, you will have a number of disabled services:

[root@lvmraid ~]# systemctl list-unit-files | grep zfs
zfs-import-cache.service                    disabled 
zfs-import-scan.service                     disabled 
zfs-mount.service                           disabled 
zfs-share.service                           disabled 
zfs-zed.service                             disabled 
zfs.target                                  disabled

Include a few of them. I will not use the sharing services from ZFS, only mount.

[root@lvmraid ~]# systemctl enable zfs.target
[root@lvmraid ~]# systemctl enable zfs-mount.service
[root@lvmraid ~]# systemctl enable zfs-import-cache.service
[root@lvmraid ~]# systemctl start zfs-mount.service 
Job for zfs-mount.service failed because the control process exited with error code.
See "systemctl status zfs-mount.service" and "journalctl -xe" for details.
[root@lvmraid ~]# zpool status
The ZFS modules are not loaded.
Try running '/sbin/modprobe zfs' as root to load them.
[root@lvmraid ~]# modprobe zfs
[root@lvmraid ~]# zpool status
no pools available

Looks like services do not load ZFS modules at startup. Having studied the sources, you can understand that the module will be loaded automatically, if only zpool is defined. We will not check this, instead we will force the load of zfs module every boot time following to Fedora's recommendations:

[root@lvmraid ~]# cat > /etc/sysconfig/modules/zfs.modules << EOFcat
#!/bin/sh
exec /usr/sbin/modprobe zfs
EOFcat
[root@lvmraid ~]# chmod 755 /etc/sysconfig/modules/zfs.modules
[root@lvmraid ~]# reboot

This script will help automate the scheduled snapshot creation and retention. I highly recommend installing this tool on the production system.

[root@lvmraid ~]# wget https://github.com/zfsonlinux/zfs-auto-snapshot/archive/master.zip
[root@lvmraid ~]# unzip master.zip
[root@lvmraid ~]# cd zfs-auto-snapshot-master/
[root@lvmraid zfs-auto-snapshot-master]# make install
[root@lvmraid zfs-auto-snapshot-master]# cd
[root@lvmraid ~]# rm -f /etc/cron.d/zfs-auto-snapshot /etc/cron.hourly/zfs-auto-snapshot \
	/etc/cron.weekly/zfs-auto-snapshot /etc/cron.monthly/zfs-auto-snapshot

ZFS snapshots use Redirect-On-Write technology (mistakenly called CopyOnWrite). In the schedule of just installed scripts, a lot of snapshots are created, which in fact do not help, and instead spend a lot of resources on them. The oldest snapshot will take up a lot of disk space. Rotating frequent tiny snapshots requires a lot of computing resources. So I use only daily snapshots and the rest of the schedule I've deleted.

Creating pool

The ZFS pool is the place to create file systems (or volumes). The pool spread data between physical disks and takes care of the redundancy. Although you can create a pool without any redundancy technique, this is not common. We will create RAID5 from the third partition of our disks.

NOTE: If you have many disks and want to specify the size of the raid group, simply specify the keyword "raidz" after the desired number of disks.

I used -m none option to not mount whole pool. The export is the name of created pool. I plan to mount its FS under the /export hierarchy, so that's the name.

[root@lvmraid ~]# zpool create -m none export raidz /dev/vd?3
[root@lvmraid ~]# zpool list export -v
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
export  11.9T   412K  11.9T         -     0%     0%  1.00x  ONLINE  -
  raidz1  11.9T   412K  11.9T         -     0%     0%
    vda3      -      -      -         -      -      -
    vdb3      -      -      -         -      -      -
    vdc3      -      -      -         -      -      -
    vdd3      -      -      -         -      -      -
[root@lvmraid ~]# zpool status export -v
  pool: export
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    ONLINE       0     0     0
            vdd3    ONLINE       0     0     0

errors: No known data errors
[root@lvmraid ~]# zpool history
History for 'export':
2017-06-25.17:11:49 zpool create -m none export raidz /dev/vda3 /dev/vdb3 /dev/vdc3 /dev/vdd3
[root@lvmraid ~]# 

The last command is very useful if you rarely deal with ZFS or share this duty with someone. Another useful command can be iostat for zpool:

[root@lvmraid ~]# zpool iostat export -v 5
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
export       412K  11.9T      0      0      0    470
  raidz1     412K  11.9T      0      0      0    470
    vda3        -      -      0      0     17  1.20K
    vdb3        -      -      0      0     17  1.19K
    vdc3        -      -      0      0     17  1.20K
    vdd3        -      -      0      0     17  1.20K
----------  -----  -----  -----  -----  -----  -----

You can check the version of ZFS and the enabled features with the zpool upgrade -v command.

We will return to zpool in this article when we move on to simulate disk failure.

Creating filesystems

[root@lvmraid ~]# zfs create -o mountpoint=/export/data export/data
[root@lvmraid ~]# df -hP
Filesystem                Size  Used Avail Use% Mounted on
 ..
export/data               8.7T  128K  8.7T   1% /export/data

This is an example of creating a simple file system. The file system will be automatically mounted if the mountpoint option is specified.

Working with snapshot

Let's create the very first snapshot for this FS. It is totally useless, because there is no data in FS, but it will demonstrate the use of disk space by snapshots.

[root@lvmraid ~]# zfs snap export/data@initial

Blue is a desired snapshot name, brown is the name of FS to take snapshot.

Now, lets copy some data into FS, then take another snapshot:

[root@lvmraid ~]# rsync -a /etc /export/data/
[root@lvmraid ~]# df -hP /export/data/
Filesystem      Size  Used Avail Use% Mounted on
export/data     8.7T   18M  8.7T   1% /export/data
[root@lvmraid ~]# zfs snap export/data@etc_copied
[root@lvmraid ~]# zfs list export/data -o space
NAME         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data  8.65T  17.4M     22.4K   17.4M              0          0
[root@lvmraid ~]# zfs list -r export/data -t snapshot -o space
NAME                    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data@initial         -  22.4K         -       -              -          -
export/data@etc_copied      -      0         -       -              -          -

As you can see, the new data (18M according to "df") does not affect the disk usage of the snapshot. It will hold previously deleted data. Let's delete something to demonstrate.

[root@lvmraid ~]# rm -rf /export/data/etc
[root@lvmraid ~]# df -hP /export/data
Filesystem      Size  Used Avail Use% Mounted on
export/data     8.7T  128K  8.7T   1% /export/data
[root@lvmraid ~]# zfs list export/data -o space
NAME         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data  8.65T  17.4M     17.4M   25.4K              0          0
[root@lvmraid ~]# zfs list -r export/data -t snapshot -o space
NAME                    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data@initial         -  22.4K         -       -              -          -
export/data@etc_copied      -  17.4M         -       -              -          -

The amount of deleted data is subtracted from the data part of the FS data (as shown by "df") and is added to the snapshot usage space ("USEDSNAP" column). A more detailed listing shows that this space belongs to the snapshot "etc_copied". The initial snapshot still uses almost zero space, because deleted data did not exist when the snapshot was created.

You can revert the whole FS only to latest snapshot. If you want to revert to other than latest snapshot, you have to remove latest snapshot before revert.

[root@lvmraid ~]# zfs rollback export/data@etc_copied
[root@lvmraid ~]# df -hP /export/data
Filesystem      Size  Used Avail Use% Mounted on
export/data     8.7T   18M  8.7T   1% /export/data
[root@lvmraid ~]# zfs list export/data -o space
NAME         AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data  8.65T  17.4M     23.9K   17.4M              0          0
[root@lvmraid ~]# zfs list -r export/data -t snapshot -o space
NAME                    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
export/data@initial         -  22.4K         -       -              -          -
export/data@etc_copied      -  1.50K         -       -              -          -

FS was reverted, and snapshot's disk usage was returned back to be data usage.

[root@lvmraid ~]# rm /export/data/etc/passwd
rm: remove regular file '/export/data/etc/passwd'? y
[root@lvmraid ~]# zfs snap export/data@passwd_removed
[root@lvmraid ~]# zfs diff export/data@etc_copied export/data@passwd_removed
M       /export/data/etc
-       /export/data/etc/passwd
-       /export/data/etc/passwd/<xattrdir>
-       /export/data/etc/passwd/<xattrdir>/security.selinux

The screenshot above explains itself. It is a very nice command - zfs diff !

Let's restore one file from the snapshot, copying it back:

[root@lvmraid ~]# cd /export/data/.zfs/snapshot
[root@lvmraid snapshot]# ll
total 0
dr-xr-xr-x. 1 root root 0 Jun 25 20:32 etc_copied
dr-xr-xr-x. 1 root root 0 Jun 25 20:32 initial
dr-xr-xr-x. 1 root root 0 Jun 25 20:32 passwd_removed
[root@lvmraid snapshot]# rsync -av etc_copied/etc/passwd /export/data/etc/
sending incremental file list
passwd

sent 1,182 bytes  received 35 bytes  2,434.00 bytes/sec
total size is 1,090  speedup is 0.90
[root@lvmraid snapshot]# df -hP
Filesystem                Size  Used Avail Use% Mounted on
 ..
export/data               8.7T   18M  8.7T   1% /export/data
export/data@etc_copied    8.7T   18M  8.7T   1% /export/data/.zfs/snapshot/etc_copied
[root@lvmraid snapshot]# zfs diff export/data@etc_copied 
M       /export/data/etc
-       /export/data/etc/passwd
-       /export/data/etc/passwd/<xattrdir>
-       /export/data/etc/passwd/<xattrdir>/security.selinux
+       /export/data/etc/passwd
+       /export/data/etc/passwd/<xattrdir>
+       /export/data/etc/passwd/<xattrdir>/security.selinux

The hidden directory .zfs automatically mounts the required snapshot for you, and then you can copy single file from there. The "zfs diff" command proves that this is not a real revert. The snapshot still contains deleted data blocks, and the file (with exactly the same name and matadata) is newly created in the FS new data blocks.

Working with clones

First, we will find all the snapshots belonging to FS that we need to clone. The zfs list command can be very slow on the loaded system. A much faster way to check the names of snapshots is to list the .zfs/snapshot pseudo directory.

[root@lvmraid ~]# zfs list -r export/data -t snapshot
NAME                         USED  AVAIL  REFER  MOUNTPOINT
export/data@initial         22.4K      -  25.4K  -
export/data@etc_copied      35.2K      -  17.4M  -
export/data@passwd_removed  28.4K      -  17.4M  -
[root@lvmraid ~]# zfs clone -o mountpoint=/clone/data export/data@etc_copied export/data_clone
[root@lvmraid ~]# df -hP
 ..
export/data               8.7T   18M  8.7T   1% /export/data
export/data_clone         8.7T   18M  8.7T   1% /clone/data

Then the clone was created using the snapshot as the basis. And, of course, you can mount it somewhere else.

It is not so easy to determine what a clone is and what is a base snapshot. Here is one of ways to find out the truth:

[root@lvmraid ~]# zfs list -o name,origin,clones -r -t snapshot export/data
NAME                        ORIGIN  CLONES
export/data@initial         -       
export/data@etc_copied      -       export/data_clone
export/data@passwd_removed  -       
[root@lvmraid ~]# zfs list -o name,origin,clones export/data_clone
NAME               ORIGIN                  CLONES
export/data_clone  export/data@etc_copied  -

ZFS has an interesting feature that I will demonstrate here:

[root@lvmraid ~]# zfs destroy -r export/data
cannot destroy 'export/data': filesystem has dependent clones
use '-R' to destroy the following datasets:
export/data_clone
[root@lvmraid ~]# zfs promote export/data_clone
[root@lvmraid ~]# zfs list -o name,origin,clones export/data
NAME         ORIGIN                        CLONES
export/data  export/data_clone@etc_copied  -
[root@lvmraid ~]# zfs list -o name,origin,clones export/data_clone -r -t snapshot
NAME                          ORIGIN  CLONES
export/data_clone@initial     -       
export/data_clone@etc_copied  -       export/data

The clone and its base switched their roles, and the clone inherited all the previous snapshots. Now it is possible to remove the origin FS:

[root@lvmraid ~]# zfs destroy -r export/data
[root@lvmraid ~]# zfs list -o name,origin,clones export/data_clone -r -t snapshot
NAME                          ORIGIN  CLONES
export/data_clone@initial     -       
export/data_clone@etc_copied  - 

Remote replication

We will use the same ZFS system as origin and target system. Therefore, the sending process is piped to the receiving process. You can use another ZFS system for data replicattion over the network. SSH can be used as a channel if you want additional protection on the wire or netcat if you want to achieve copy efficiency.

First we need to select the desired snapshot to start with it:

[root@lvmraid ~]# zfs list -r -t snapshot export/data_clone
NAME                           USED  AVAIL  REFER  MOUNTPOINT
export/data_clone@initial     22.4K      -  25.4K  -
export/data_clone@etc_copied  22.4K      -  17.4M  -
[root@lvmraid ~]# zfs send -R export/data_clone@etc_copied | zfs recv -v export/data
receiving full stream of export/data_clone@initial into export/data@initial
received 39.9KB stream in 1 seconds (39.9KB/sec)
receiving incremental stream of export/data_clone@etc_copied into export/data@etc_copied
received 20.0MB stream in 1 seconds (20.0MB/sec)
cannot mount '/clone/data': directory is not empty
[root@lvmraid ~]# zfs set mountpoint=/export/data export/data
[root@lvmraid ~]# zfs mount export/data
[root@lvmraid ~]# df -hP
 ..
export/data_clone         8.7T   18M  8.7T   1% /clone/data
export/data               8.7T   18M  8.7T   1% /export/data
[root@lvmraid ~]# zfs list -r -t snapshot export/data
NAME                     USED  AVAIL  REFER  MOUNTPOINT
export/data@initial     22.4K      -  25.4K  -
export/data@etc_copied      0      -  17.4M  -

The mount point was occupied by the original FS, then I changed it to another, and the mount was successful.

As you can see, FS is copied completely, including the contents of the snapshots. This is a good way to transfer data from one ZFS system to another. The incremental copy is supported.

Testing redundancy

Let's remove one disk. ZFS does not detect a problem until it accesses the disks, and then shows:

[root@lvmraid log]# zpool status
  pool: export
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    UNAVAIL      1   432     0  corrupted data
            vdd3    ONLINE       0     0     0

errors: No known data errors

Now reconnect the disconnected disk. ZFS does not see the reconnected disk, they may somehow have to be rescanned. I was too lazy to read the manual and I just rebooted the server. Everything returned to normal:

[root@lvmraid ~]# zpool status
  pool: export
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    ONLINE       0     0    95
            vdd3    ONLINE       0     0     0

errors: No known data errors
[root@lvmraid ~]# zpool clear export
[root@lvmraid ~]# zpool status
  pool: export
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    ONLINE       0     0     0
            vdd3    ONLINE       0     0     0

errors: No known data errors

Now, the hard part. I'm going to replace the disk with an empty one. First, we need to copy the partition table from other disks, as described in Redundant disks without MDRAID.

Then fix ZFS by replacing device:

[root@lvmraid ~]# zpool status 
  pool: export
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        export      DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vdc3    UNAVAIL      1   333     0  corrupted data
            vdd3    ONLINE       0     0     0

errors: No known data errors
[root@lvmraid ~]# zpool replace export /dev/vdc3 /dev/vde3
[root@lvmraid ~]# zpool status 
  pool: export
 state: ONLINE
  scan: resilvered 21.9M in 0h0m with 0 errors on Mon Jun 26 17:35:56 2017
config:

        NAME        STATE     READ WRITE CKSUM
        export      ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            vda3    ONLINE       0     0     0
            vdb3    ONLINE       0     0     0
            vde3    ONLINE       0     0     0
            vdd3    ONLINE       0     0     0

errors: No known data errors

The hard part becomes peace of cake.


Updated on Tue Jun 27 14:23:41 IDT 2017 by Oleg Volkov More documentations here