NetApp MPHA - you might not know there is a fault

We recently discovered a MPHA fault which has existed for some time on a controller, where it could only see a single path to its disks. After digging under the covers, it seems there's a gap in both ONTAP's and NetApp's ability to determine when a genuine MPHA fault has occurred. There is certainly some logic in NetApp "ignoring" single path conditions, as it is normal to lose a path when hot adding a shelf, replacing a cable, replacing an IOM or when replacing a disk drive. That said, ONTAP and NetApp don't seem to have the ability to check for a persistent fault - say a path offline for more than 1 day. Any SNMP based monitoring will also be unsuccessful in identifying such faults, as no OIDs offer monitoring of MPHA as far as I've been able to tell.

When managing a single system (e.g. a school, small business etc) an administrator tends to log onto the system very regularly and will have a good chance of detecting this condition quickly.
On the other hand, when managing multiple systems (e.g. 50 or 100), it is important to be able to rely on solid alerting mechanisms, as it is not possible to manually log in to check these sorts of things.

Linux Scripts for checking Disk Looks

The below scripting examples assume that

a Linux management server is available for parsing text through *NIX utilities such as awk
remote key based SSH authentication is configured, to allow scripts to be run against ONTAP from the Linux management host
a config file (/etc/filer_list) has been established to define all NetApp controllers in the environment

Print all disks where a single path is unavailable

This one-liner can be executed against a single controller to determine which disks are affected by a path fault.

[remoteadmin@mgthost]#ssh -l root filer01 storage show disk -p | egrep -v "PRIMARY|----" | awk '{if ($2 == "A" && $4 != "B") print $0 ; else ; if ($2 == "B" && $4 != "A") print $0}'

1a.11.0    B                     1    0
1a.11.0    B                     1    1
1a.11.0    B                     1    2
...

This excerpt shows there are three disks which only have a single path from filer01.

If only a couple of disks (e.g. 2 disks in a shelf of 24 disks) show the fault, you may have either a disk or backplane fault. Either way, NetApp should be engaged for further diagnosis of the condition

If whole shelves show only a single path (e.g. every disk in a DS2246 shelf, or multiple shelves in a stack) then it is likely you have a fault either in the SAS HBA, a SAS cable, or a shelf IOM. Again, NetApp will be able to assist with further troubleshooting and (if required) hardware replacement.

Identify which controllers have a fault

This script came in handy for an environment with over 70 controllers. I've kept it simple (so the previous script will then need to be run to determine precisely which disks are affected, once a controller with an MPHA fault is identified).

#!/bin/bash
#
#
## Check for Email reporting requirement (i.e. only need to send an Email if this is executed via cron)
##
if [ "$#" = "1" ]
        then
        if [ $1 = "email" ]
                then
                send_mail=yes
                else
                send_mail=no
        fi
        else
        send_mail=no
fi
##
## 7-mode checks
##
for controller in `cat /etc/filer_list`
do
        errors=`ssh -l root ${controller} storage show disk -p | egrep -v "PRIMARY|-----" | awk 'BEGIN{ERRORCOUNT=0};{if ($2 == "A" && $4 != "B") ++ERRORCOUNT  ; else ; if ($2 == "B" && $4 != "A") ++ERRORCOUNT;}END{print ERRORCOUNT}'`
        echo ${controller} ${errors}
        if [ ${errors} -gt 0 ]
                then notifyflag=1
                notifycontrollers="${notifycontrollers}${controller},"
                notifydetails="${notifydetails}${controller} - ${errors} failed paths ,"
        fi
        #sleep 1
done

##
## Email Notification
##
if [ ${notifyflag} -eq 1 ]
then
        echo "MultiPath IO faults were identified for the following controllers: ${notifydetails} using the script /usr/local/bin/multipath_check.sh" | mailx -s "MultiPath HA Faults detected for ${notifycontrollers}" -r $HOSTNAME@yourcompany.com.au storageadmins@yourcompany.com.au
fi

This script checks the output of storage show disk -p for either A or B to be populated in the 2nd column, then for the alternate path in column 4 (i.e. if A was shown in column 2, B must be shown in column 4).

If a single path exists, the number of affected disks are counted and the total printed to STDOUT, plus a notification flag is set (to instruct the script to send an Email notification - if Email reporting is required).