Friday, 1 August 2014

Hardware RAID sets with Kickstart and megaraid controllers

Some of our kit we purchase to use the hardware RAID adapters to control the disk storage. This tends to be service kit, or where we want to have fast local storage attached to a system. On HPC compute nodes we're not so bothered about this - if we lose a node due to disk failure, well a user can resubmit the job.

We use xcat to bare metal provision our systems running Scientific Linux 6.x, internally this just uses kickstart files, so this approach would work on anything using kickstart to install EL based distributions. Its not possible to directly configure the RAID controller, the disks aren't visible as JBOD drives, and we might have a number of systems coming in for a project, so we don't want to be configuring them all by hand.

First off, I'll start by saying that I took inspiration for this from Samuel Kielek's blog post on this, my needs are a bit different, we have multiple controllers in some of the kit and we also want RAID 10 sets rather than straight mirrors. I also found a page at Cambridge which documents parts of MegaCLI quite nicely.

Second, I've only tested it with our system x kit with megaraid adapters, you'll need a copy of the MegaCLI utility, we're running on IBM hardware and you can grab it from IBM's support site. You probably want to find your own vendor's release if you are planning on using this.

Be warned, using this script will clear the config on your megaraid adapters i.e. it will remove any RAID sets you may have configured!

I'll explain a little on the script first, basically it downloads the MegaCLI tools from our xcat server, scans for adapters and disks and builds a hash of them. Actually as its bash, its not really a multidimensional array but a hacked approach at one...

Once we have found our adapters and disks, we then split the disks on each adapter into two sets and use that to build a raid 10 set. If we only have one adapter then we have a raid 10 set already and we're done.

If we don't find an adapter, that's fine as well as that means the disk isn't controlled by a RAID controller.

Now a few of the systems (x3950 x6) have two adapters and half the drives on each adapter, this is essentially as you can partition the 3950 into two discrete systems, but it does mean that our drives are split across two controllers and we can't have a fully hardware raid 10 set.

Now initially, I was going to build two stripes and then mirror in software raid, but my colleague suggested it would be better to use two hardware raid 10 sets and then stripe. Basically as a raid 0 stripe on each controller, we could only lose one drive per controller before we hit a problem, in a hardware raid 10 set, we could lose 2 drives on both controller, and the overall space available is the same.

As we want to use the same script for all the systems we have, we do some matching of the device model string from dmidecode and use that to download a file containing the kickstart disk partitioning.

First off, edit the kickstart file you use and change your disk partition lines to be:
%include /tmp/disk-partition

This file gets downloaded by the script. If you are using just kickstart, in your %pre section of the kickstart file, place a copy of the script below, we use xcat which allows files to be included so I have:
#INCLUDE:/install/data/MegaCLI/include-in-template#

When a machine is installed, xcat builds a copy of the kickstart file for the machine and includes the file about verbatim, this means I can keep the script in a separate location and use if for multiple template files.

And finally, the script which will do the magic for us:

#------------------------------------------------------------------------------#
#                     PRE-INSTALL HARDWARE RAID SETUP                          #
#------------------------------------------------------------------------------#

DEBUG=0
#
SERVER='10.30.0.254';
SRCDIR='/install/data/MegaCLI'

MODEL=`dmidecode | grep 'Product Name' | egrep 'x[0-9]+' | sed 's/.*\(x[0-9][0-9][0-9][0-9]*\).*/\1/'`
wget -q http://$SERVER$SRCDIR/partition-table/$MODEL
mv $MODEL /tmp/disk-partition

if [ $DEBUG == 0 ]; then
  exec < /dev/tty3 > /dev/tty3 2>&1
  chvt 3
fi

echo -e "\nConfiguring the MegaRAID SAS controller ..."

cd /tmp
[ -d mega-cli ] && /bin/rm -rf mega-cli
[ ! -d mega-cli ] && mkdir mega-cli
cd mega-cli

wget -q http://$SERVER$SRCDIR/MegaCli64
wget -q http://$SERVER$SRCDIR/libsysfs.so.2.0.2
wget -q http://$SERVER$SRCDIR/libstorelibir-2.so.13.05-0

mirror=/tmp/mirror_disks

declare -A matrix

if [ $DEBUG == 0 ]; then
  [ -f $mirror ] && rm -f $mirror
fi

if [ -f MegaCli64 ]; then
  chmod +x MegaCli64
  # Probe for disks
  if [ $DEBUG != 0 ]; then
    echo ./MegaCli64 -PDList -aALL | egrep 'Adapter|Enclosure Device ID|Slot Number'
    if [ ! -f $mirror ]; then
      ./MegaCli64 -PDList -aALL | egrep 'Adapter|Enclosure Device ID|Slot Number' > $mirror
    fi
  else
    ./MegaCli64 -PDList -aALL | egrep 'Adapter|Enclosure Device ID|Slot Number' > $mirror
  fi

  if [ ! -f $mirror ]; then
    echo -e "\n\nNo MegaRAID device file\n"
    exit
  fi

  c=`grep -c Adapter $mirror`
  if [ $c == 0 ]; then
    echo -e "\n\nNo MegaRAID devices found\n"
    exit
  fi

  oIFS=$IFS
  IFS="$(echo -e "\n\r")"

  adap=foo
  enc=foo

  lowadap=999
  # Figure out where the disks are
  for line in $(cat $mirror); do
    if [[ $line =~ Adapter ]]; then
      adap=$( echo $line | sed -e 's/^\s*Adapter #\([0-9]\+\).*$/\1/' ) # -e 's/\s*$//' -e 's/#//g' )
      enc=foo
      if [ $adap -le $lowadap ]; then
        lowadap=$adap
      fi
    fi

    if [ "$adap" != "foo" ]; then
      if [[ $line =~ Enclosure ]]; then
        enc=$( echo $line | awk -F: '/Enclosure/{print $2}' | sed -e 's/^\s*//' -e 's/\s*$//' )
      fi
      if [ "$e" != "foo" ]; then
        if [[ $line =~ Slot ]]; then
          slot=$( echo $line | awk -F: '/Slot/{print $2}' | sed -e 's/^\s*//' -e 's/\s*$//' )
          matrix[$adap,$slot]=$enc
        fi
      fi
      #(( c++ ))
    fi
  done

  IFS=$oIFS

  for ((j=$lowadap;j<=$adap;j++)) do
    # Clear any existing configuration
    if [ $DEBUG != 0 ]; then
      echo ./MegaCli64 -CfgClr -Force -a$j
     else
      ./MegaCli64 -CfgClr -Force -a$j
    fi
    devs=
    devc=0
    for ((s=0;s<16;s++)) do
      if [ "${matrix[$j,$s]}" != '' ]; then
        devs=$devs,${matrix[$j,$s]}:$s
        devc=$((devc + 1))
      fi
    done

    disksperarray=$(($devc / 2))
    devsa=
    devsb=
    collected=0
    for ((s=0;s<16;s++)) do
      if [ "${matrix[$j,$s]}" != '' ]; then
        if [ $collected -lt $disksperarray ]; then
          devsa=$devsa,${matrix[$j,$s]}:$s
        else
          devsb=$devsb,${matrix[$j,$s]}:$s
        fi
        collected=$(($collected + 1))
      fi
    done

    devs=`echo $devs | sed 's/^,//'`
    devsa=`echo $devsa | sed 's/^,//'`
    devsb=`echo $devsb | sed 's/^,//'`

    # Ensure the drives are in a good state before creatingthe logical device
    if [ $DEBUG != 0 ]; then
      echo ./MegaCli64 -PDMakeGood -PhysDrv "[$devs]" -Force -a$j
    else
      ./MegaCli64 -PDMakeGood -PhysDrv "[$devs]" -Force -a$j
    fi
    # Create a RAID10 logical device from the detected disks 
    if [ $DEBUG != 0 ]; then
      echo ./MegaCli64 -CfgSpanAdd -r10 -Array0 "[$devsa]" -Array1 "[$devsb]" -a$j
    else
      ./MegaCli64 -CfgSpanAdd -r10 -Array0 "[$devsa]" -Array1 "[$devsb]" -a$j
    fi

    #sdev=`echo $j | tr 0123456789 abcdefghij`
    #if [ $DEBUG != 0 ]; then
    #  echo test -b /dev/sd$sdev  || mknod /dev/sd$sdev  b 8 0
    #else
    #  test -b /dev/sd$sdev  || mknod /dev/sd$sdev  b 8 0
    #fi
  done

  if [ $DEBUG != 0 ]; then
    echo python /mnt/runtime/usr/lib/anaconda/isys.py driveDict
  else
    python /mnt/runtime/usr/lib/anaconda/isys.py driveDict
  fi
  echo -e "\nCOMPLETED - Configuring MegaRAID SAS controller ...\n\n"

else
  echo "FAILED - Could not obtain MegaCli utility from HTTP server ($SERVER)"
fi

if [ $DEBUG == 0 ]; then
  chvt 1
fi

I've tested this on systems with multiple controllers, single controller and no controller and it seems to work fine, what I haven't tested is on systems where the disk config isn't an even number of driver or where is asymmetric across controllers. That's not a config I see us ever using.

And a final note, if you aren't using system x hardware, then you may want to have a look at the dmidecode regex as that is written to pattern match the names of systems we use to work out which disk partition file to download.

No comments:

Post a Comment