HOWTO LUNs on Linux using native tools.

Here is an example of particular installation. It is not a generic HOWTO. Adopt it to your needs.

Preparing

An HP server with Qlogic FC adapters was used for RedHat 5.6 installation on it. No special software was installed.

Eight FC zones were created. The server has two FC adapters connected to two fabric switches, each of them connected to each NetApp's head by two FC cards. Therefore total zones number is 8, four are for active NetApp controller and four are for partner head.

As following from above, the server can see every LUN through eight paths, four active and four backup. Four LUNs were created instead of single LUN to boost performance. This number choosed as compromise between performance and easy maintenance.

UPDATE: Using striped volume had not shown performance boost as was expected. IO bottleneck was moved to LVM device aggregating stripes. Today I'd recommend single LUN when working with central storage.

netapp> igroup create -f -t linux $IGNAME $WWN1 $WWN2
netapp> igroup set $IGNAME alua yes
netapp> vol create $VOLNAME $AGGRNAME 600g
netapp> exportfs -z /vol/$VOLNAME
netapp> vol options $VOLNAME minra on
netapp> vol autosize $VOLNAME -m 14t on
netapp> qtree create /vol/$VOLNAME/LUN0
netapp> qtree create /vol/$VOLNAME/LUN1
netapp> qtree create /vol/$VOLNAME/LUN2
netapp> qtree create /vol/$VOLNAME/LUN3
netapp> lun create -s 250g -t linux -o noreserve /vol/$VOLNAME/LUN0/lun0
netapp> lun create -s 250g -t linux -o noreserve /vol/$VOLNAME/LUN1/lun1
netapp> lun create -s 250g -t linux -o noreserve /vol/$VOLNAME/LUN2/lun2
netapp> lun create -s 250g -t linux -o noreserve /vol/$VOLNAME/LUN3/lun3
netapp> lun map /vol/$VOLNAME/LUN0/lun0 $IGNAME 0
netapp> lun map /vol/$VOLNAME/LUN1/lun1 $IGNAME 1
netapp> lun map /vol/$VOLNAME/LUN2/lun2 $IGNAME 2
netapp> lun map /vol/$VOLNAME/LUN3/lun3 $IGNAME 3

The second command define my host as ALUA enabled. I'll define ALUA at server later. LUNs were created with "noreserve" option. This simulates "thin provisioning" when used with "autoresize" option for volume.

Rescan FC/SCSI for changes

These commands were tested number of times. Usually them worked well. However, they can hang server, or cause other damage. Try to use them at maintenance window, having good backup, and do not very frequent on same server.

Common actions for SCSI (and FC) devices are: remove old (dead) SCSI devices, rescan for newly added devices and resize existing ddevices. Sometime, devices are still in use by OS, then reboot can be required

This small script will try remove all SCSI devices. Kernel would not allow it to remove devices still in use. A device can be "in use" if partition still mounted, LVM use it or multipath use it for I/O just now. Paths, that has no I/O on it will be successfully removed, therefore rescan devices immediately after this script.

Using this script is dangerous, server can hung and data lost may occur. Try to remove only relevant devices instead.

# ( cd /sys/class/scsi_device/ ; for d in * ; do
echo "scsi remove-single-device $(echo $d|tr ':' ' ')"  > /proc/scsi/scsi
done )

The next step is to rescan SCSI devices. This command will reset FC loop and all LUNs will be re-discovered. Due to event driven nature of modern Linux, you have not have to rescan SCSI devices (next command) also.

# for FC in /sys/class/fc_host/host?/issue_lip ; do echo "1" > $FC ; sleep 5 ; done ; sleep 20

This command will rescan SCSI devices (not FC, usefull when adding disks to VM):

# for SH in /sys/class/scsi_host/host?/scan ; do echo "- - -" > $SH ; done

Linux native multipath

Enable multipath daemon:

Update for RH6: Replace /dev/mpath to /dev/mapper in this document.

# chkconfig --add multipathd
# chkconfig multipathd on
# /etc/init.d/multipathd restart

Here is a content of working /etc/multipath.conf. I've tried to be informative in comments:

defaults {
        user_friendly_names yes
	# turn it to yes for clustered environment
	#flush_on_last_del	yes 
	# multipathd should always be running:
	queue_without_daemon	no
}

## Blacklist non-SAN devices
blacklist {
        # Common non-disks devices:
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        # We are on HP HW, cciss is our single boot device:
        devnode "^cciss!c[0-9]d[0-9]*"
        # This HW has card reader, identified by:
        device {
                vendor "Single"
                product "*"
        }
}

# See /usr/share/doc/device-mapper-multipath-X.X.X/ for compiled defaults.
# This section not needed anymore for modern distributions.
# Override NETAPP's default definitions to use ALUA:
devices {
        device {
                vendor                  "NETAPP"
                product                 "*"
                path_grouping_policy    group_by_prio
                getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
		prio			alua
                prio_callout            "/sbin/mpath_prio_alua /dev/%n"
                features                "3 queue_if_no_path pg_init_retries 50"
                path_checker            tur
                hardware_handler        "1 alua"                          
                failback                immediate
        }
}

# Use fixed names for our LUNs:
multipaths {
        multipath {
                wwid    360a98000646564494b34653545345552
                alias   nlun0
        }
        multipath {
                wwid    360a98000646564494b34653545347261
                alias   nlun1
        }
        multipath {
                wwid    360a98000646564494b34653545353938
                alias   nlun2
        }
        multipath {
                wwid    360a98000646564494b34653545356151
                alias   nlun3
        }
}

Aliass for LUN's name are very helpfull for maintenance tasks and would be used later in this memo widely. You can see these wwid at least here:

/dev/disk/by-id # ll
total 0
lrwxrwxrwx 1 root root  9 Jul 18 16:46 scsi-360a98000646564494b34653545345552 -> ../../sda
lrwxrwxrwx 1 root root  9 Jul 18 16:46 scsi-360a98000646564494b34653545347261 -> ../../sdb
lrwxrwxrwx 1 root root  9 Jul 18 16:46 scsi-360a98000646564494b34653545353938 -> ../../sdc
lrwxrwxrwx 1 root root  9 Jul 18 16:46 scsi-360a98000646564494b34653545356151 -> ../../sdd
lrwxrwxrwx 1 root root 10 Jul 18 16:46 usb-Single_Flash_Reader_058F63356336 -> ../../sdag

You can use this NetApp LUN serial to WWID converter if you work with NetApp.

Flush and rebuild multipath configuration:

# multipath -F
# multipath -v3 > multipath.out
# view multipath.out

Check command's output (multipath.out) to verify that ALUA is in use. Verify that paths groupped by priority:

# multipath -ll nlun0
nlun0 (360a98000646564494b34653545345552) dm-9 NETAPP,LUN
[size=250G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=200][active]
 \_ 0:0:1:0 sde  8:64   [active][ready]
 \_ 0:0:2:0 sdi  8:128  [active][ready]
 \_ 1:0:0:0 sdq  65:0   [active][ready]
 \_ 1:0:1:0 sdu  65:64  [active][ready]
\_ round-robin 0 [prio=40][enabled]
 \_ 0:0:0:0 sda  8:0    [active][ready]
 \_ 0:0:3:0 sdm  8:192  [active][ready]
 \_ 1:0:2:0 sdy  65:128 [active][ready]
 \_ 1:0:3:0 sdag 66:0   [active][ready]

The long number in bracets is wwid mentioned above.

LVM configuration

Edit /etc/lvm/lvm.conf to fix following lines:

......
    # By default we accept every block device:
    #filter = [ "a/.*/" ]
    filter = [ "a|/dev/mpath/nlun|","a|/dev/cciss/|","r/.*/" ]

    #filter = [ "a|/dev/mapper/pv_|", "a|^/dev/disk/by-path/pci-0000:01:00.0-scsi-0:2:0:0|", "r/.*/" ]
    #filter=["a|/dev/mpath/mroot|", "a|/dev/mapper/mroot|", "r/.*/"]

......

My root VG resides on cciss device, my data VG will be on multipath's devices. I do not want LVM start use SCSI disks directly, thus all other devices are ignored. More of that, the pattern is narrowed to use only "nlun" aliases, that will be important later.

Create PVs and VG by commands:

# for i in 0 1 2 3 ; do pvcreate --dataalignment 4k /dev/mpath/nlun$i ; done
# vgcreate vg_data /dev/mpath/nlun{0,1,2,3}

An option --dataalignment 4k should align data blocks to netapp blocks (to be verified). Stripe size 64k choosed to fit into DMA size (probably outdated). FS block size set to 4k. It is good for modern storages and good for Oracle DBF, since default Oracle block size become 8k. Create striped volume and format it:

# lvcreate -i 4 -I 64k -n oradbs -L 600g /dev/vg_data
# mkfs.ext3 -j -m0 -b4096 /dev/vg_data/oradbs

Fix /etc/fstab with _netdev mount options:

/dev/vg_data/oradbs    /oradbs	ext3    _netdev 2 2

The _netdev causes delay mount of this filesystem till multipath service is up. It is important at boot time.

Rebuild initrd image to reflect lvm.conf anf multipath.conf changes. This is not nessecary, but makes boot cleaner.

# mkinitrd -f /boot/initrd-$(uname -r).img $(uname -r)
Update for RH6:
# mkinitrd -f /boot/initramfs-$(uname -r).img $(uname -r)

Reboot server to see if everything works well at boot time.

Benchmarking

Updates: Not really, just dd. Reflects nothing, all run in memory, server has more then 10g memory.

# dd if=/dev/zero of=10g.file bs=1024k count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 33.2477 seconds, 323 MB/s
# sync # flush local cache
# dd if=10g.file of=/dev/null bs=1024k
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 5.01478 seconds, 2.1 GB/s
# # PAM installed on this NetApp
# dd if=10g.file of=/dev/null bs=1024k
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 4.69177 seconds, 2.3 GB/s
# # Local cache also add something

Online LUN resizing

Resize NetApp's LUNs:

netapp> lun resize /vol/$VOL/LUN0/lun0 400g
netapp> lun resize /vol/$VOL/LUN1/lun1 400g
netapp> lun resize /vol/$VOL/LUN2/lun2 400g
netapp> lun resize /vol/$VOL/LUN3/lun3 400g

Rescan SCSI devices:

# for device in /sys/block/sd* ; do echo 1 > $device/device/rescan ; sleep 2 ; done

Check /var/log/messages for size changed, like:

kernel: sdz: detected capacity change from 268435456000 to 429496729600

Ask multipath daemon to rescan slave devices:

# multipath
: nlun3 (360a98000646564494b34653545356151)  NETAPP,LUN
[size=400G][features=1 queue_if_no_path][hwhandler=0][n/a]
 ....

Update for RH6: Special command for resize invented:

# multipathd -k'resize map nlun0'
Issue this command for every resized multipahed device

Resize physical volumes using LVM command:

# pvresize /dev/mpath/nlun0
  Physical volume "/dev/mpath/nlun0" changed
  1 physical volume(s) resized / 0 physical volume(s) not resized
# pvscan
  PV /dev/mpath/nlun0    VG vg_data   lvm2 [399.97 GB / 162.47 GB free]
  PV /dev/mpath/nlun1    VG vg_data   lvm2 [249.97 GB / 12.47 GB free]
  PV /dev/mpath/nlun2    VG vg_data   lvm2 [249.97 GB / 12.47 GB free]
  PV /dev/mpath/nlun3    VG vg_data   lvm2 [249.97 GB / 12.47 GB free]
  PV /dev/cciss/c0d0p2   VG rootvg     lvm2 [136.56 GB / 130.47 GB free]
  Total: 5 [1.26 TB] / in use: 5 [1.26 TB] / in no VG: 0 [0   ]
# pvresize /dev/mpath/nlun1
# pvresize /dev/mpath/nlun2
# pvresize /dev/mpath/nlun3

Resize logical volume and FS:

# lvresize -L 1500g /dev/vg_data/oradbs 
  Using stripesize of last segment 8.00 MB
  Extending logical volume oradbs to 1.46 TB
  Logical volume oradbs successfully resized
# resize2fs /dev/vg_data/oradbs

Mounting backup snapshots

Create flex clone on NetApp using desired snapshot

netapp> vol clone create $SNAPVOL -s none -b $VOL $SNAPNAME
netapp> lun map /vol/$SNAPVOL/LUN0/lun0 $IGNAME
netapp> lun online /vol/$SNAPVOL/LUN0/lun0
netapp> lun map /vol/$SNAPVOL/LUN1/lun1 $IGNAME
netapp> lun online /vol/$SNAPVOL/LUN1/lun1
netapp> lun map /vol/$SNAPVOL/LUN2/lun2 $IGNAME
netapp> lun online /vol/$SNAPVOL/LUN2/lun2
netapp> lun map /vol/$SNAPVOL/LUN3/lun3 $IGNAME
netapp> lun online /vol/$SNAPVOL/LUN3/lun3

Rescan FC devices

# for FC in /sys/class/fc_host/host?/issue_lip ; do echo "1" > $FC ; done ; sleep 20

Rescan multipath devices

# multipath -F
# multipath
# multipath -ll

You shoud see other four mpathXX devices among of nlunX devices.

Verify that LVM still use nlunX devices.

# pvscan
  PV /dev/mpath/nlun0    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/nlun1    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/nlun2    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/nlun3    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/cciss/c0d0p2   VG rootvg     lvm2 [136.56 GB / 130.47 GB free]
  Total: 5 [1.11 TB] / in use: 5 [1.11 TB] / in no VG: 0 [0   ]

If you see "Found duplicate PV qCNLd6KGQmdVPwnwQAadJ8IH7KWo8xEy: using /dev/mpath14 not /dev/nlun0", probably, you are in trouble. This should not be happen, because our /etc/lvm/lvm.conf includes only "nlunX" devices.

Replicate lvm.conf:

# mkdir /lvmtemp 
# cp /etc/lvm/lvm.conf /lvmtemp 
# export LVM_SYSTEM_DIR=/lvmtemp/
# vi /lvmtemp/lvm.conf

Replace filter line to see mpathXX devices instead of nlunX devices:

....
    #filter = [ "a|/dev/mpath/nlun|","a|/dev/cciss/|","r/.*/" ]
    filter = [ "a|/dev/mpath/mpath|","a|/dev/cciss/|","r/.*/" ]
    #filter = [ "a|/dev/mapper/mpath|","a|/dev/cciss/|","r/.*/" ]
....

Verify, that you see only mpathXX devices now:

# pvscan
  PV /dev/mpath/mpath24   VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/mpath25   VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/mpath26   VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/mpath27   VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/cciss/c0d0p2    VG rootvg     lvm2 [136.56 GB / 130.47 GB free]
  Total: 5 [1.11 TB] / in use: 5 [1.11 TB] / in no VG: 0 [0   ]

Rename VG name and change PVID using vgimportclone script, comes with LVM2 package:

# vgimportclone -n vg_restore /dev/mpath/mpath* 
  WARNING: Activation disabled. No device-mapper interaction will be attempted.
  Physical volume "/tmp/snap.knd24669/vgimport3" changed
  1 physical volume changed / 0 physical volumes not changed
  WARNING: Activation disabled. No device-mapper interaction will be attempted.
  Physical volume "/tmp/snap.knd24669/vgimport2" changed
  1 physical volume changed / 0 physical volumes not changed
  WARNING: Activation disabled. No device-mapper interaction will be attempted.
  Physical volume "/tmp/snap.knd24669/vgimport1" changed
  1 physical volume changed / 0 physical volumes not changed
  WARNING: Activation disabled. No device-mapper interaction will be attempted.
  Physical volume "/tmp/snap.knd24669/vgimport0" changed
  1 physical volume changed / 0 physical volumes not changed
  WARNING: Activation disabled. No device-mapper interaction will be attempted.
  Volume group "vg_data" successfully changed
  Volume group "vg_data" successfully renamed to "vg_restore"
  Reading all physical volumes.  This may take a while...
  Found volume group "vg_restore" using metadata type lvm2
  Found volume group "rootvg" using metadata type lvm2

Activate VG and mount file system:

# vgchange -ay vg_restore
  1 logical volume(s) in volume group "vg_restore" now active
# mount /dev/vg_restore/oradbs /oradbs_restore

DO NOT FORGET REVERT ALL CHANGES WHEN FINISHED !!!, see in next chapter

Unmounting backup snapshot

Redefine LVM configuration directory to directory used in previous chapter. Verify, that you see mpathXX devices in pvscan output:

# export LVM_SYSTEM_DIR=/lvmtemp/
# pvscan
  PV /dev/mpath/mpath24   VG vg_restore   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/mpath25   VG vg_restore   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/mpath26   VG vg_restore   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/mpath27   VG vg_restore   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/cciss/c0d0p2    VG rootvg       lvm2 [136.56 GB / 130.47 GB free]
  Total: 5 [1.11 TB] / in use: 5 [1.11 TB] / in no VG: 0 [0   ]

Pay attention that multipaths to be removed are mpath2{4,5,6,7} that we will use later in this chapter. Your output may vary, use your devices instead.

Umount file system(s) and deactivate VG, output should show 0 active LV:

# umount /oradbs_restore
# vgchange -an vg_restore
  0 logical volume(s) in volume group "vg_restore" now active

Save list of SCSI devices to be removed:

# rm -f /tmp/scsi-disks ; for m in mpath2{4,5,6,7} ; do
multipath -ll $m | cut -c6-13 | grep ":.:" | tee -a /tmp/scsi-disks
done

Flush multipath devices, only desired maps:

# for m in mpath2{4,5,6,7} ; do multipath -f $m ; done
# cd /dev/mapper ; ls

You should not see flushed maps in /dev/mapper directory in case success, othervise you have to check what keep maps, resolve problem and repeat flush. Verify, that relevant FS unmounted, and LVM PV not in use (inactive, or exported)

Now remove SCSI devices, previously saved in /tmp/scsi-disks:

# for d in $(cat /tmp/scsi-disks) ; do
echo "scsi remove-single-device $(echo $d|tr ':' ' ')"  > /proc/scsi/scsi
done

Destroy clone on NetApp:

netapp> vol offline $SNAPVOL
netapp> vol destroy $SNAPVOL -f

Recheck configuration:

# unset LVM_SYSTEM_DIR
# pvscan
  PV /dev/mpath/nlun0    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/nlun1    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/nlun2    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/mpath/nlun3    VG vg_data   lvm2 [249.97 GB / 99.97 GB free]
  PV /dev/cciss/c0d0p2   VG rootvg    lvm2 [136.56 GB / 130.47 GB free]
  Total: 5 [1.11 TB] / in use: 5 [1.11 TB] / in no VG: 0 [0   ]

Summarizing, the adding procedure looks like:

LUN -> SCSI -> multipath -> LVM -> FS
and removing procedure should be exactly reverse:
FS -> LVM -> multipath -> SCSI -> LUN

See also Boot from SAN RedHat 5
Updated on Sun Nov 8 18:29:15 IST 2015 More documentations here