Thursday, 21 August 2014

Thoughts on Storwise V3700 and GPFS

I've recently taken delivery of our first Storwise V3700 storage array, prior to this we were using DS3500 series controllers and shelves (which I understand we essential OEM's NetApp products). The V3700 is developed by IBM, and apparently the software for Storwise is developed at IBM Hursley Labs in the UK.
V3700 with 2.5" drive option, image IBM Redbook

Its a relatively low cost storage array and has recently been upgraded to support 9 expansion storage shelves giving up to 132 drives from a single controller head. The one for this project consists of 24 1TB SAS drives and 84 4TB NL-SAS disks.

The v3700 has lots of features given the price point, its dual controller (or canister), and supports features such as clustering, mirroring, auto tiering (some via feature on demand). We're planning to use GPFS (or Elastic Storage as is now being branded) though, so these features aren't actually of use to us, I do have a lot of respect for IBM for the price point whilst allowing you to uplift to more advanced features should you want to.

The canisters can act in active/active mode and will have a set of volumes (LUNs) that are provided by the canister but will fail over to the other canister if a canister fails. This means its possible to distribute the IO over both canisters.

As standard the V3700 has four SAS ports on each canister, 1-3 are used for host connection and 4 is used for the SAS loop between shelves. It also has a PCIe slot which can take either a SAS card or FC-AL card, so you can use it on a SAN if you want. For our use-case, we're only going to have 2 GPFS NSD servers attached, so it makes sense to just use the SAS ports (two cards in each server, 1 port attached to each canister). Due to the vagaries of the config tools, we also ended up with the extra SAS cards. What is important to note is that the SAS ports on the V3700 are mini SAS-HD (SFF-8644), and the cables we were initially sent to connect to our HBAs were mini SAS-HD at both ends whereas we needed SFF-8088.

The GUI!

I must say, I'm not over-joyed by the web GUI, but its significantly more responsive than the Storage Manager for the older kit. Its a web-based GUI and seems to have been made to look pretty. One of the things I don't like about the GUI is not being able to easily create multiple mdisks (RAID sets) easily whilst specifying the size of the set. You can select the number of drives to add, but then Storwise decides how it will build arrays under that, so for example using 84 drives, I'd like 4x spare and 8x RAID 6 (8+2p) arrays, but it wanted to build several 12 disk arrays. Anyway, that's easily worked around if a little tedious (yes I could do it via the CLI, but I was playing around with the GUI), by created 10 disk arrays one at a time and by manually marking the spare drives.

One other comment on the GUI, I found it quite hard at times to navigate around, I'm not sure its entirely intuitive, but once yet get used to where things are, its actually OK to use, and as I mentioned, significantly quicker than Storage Manager.

mdisks, volumes and pools

The normal way of using the V3700 is to create mdisks (RAID sets), put these into pools and then to create volumes (LUNs) from the pools. Of course if you are interested in tiering in the hardware, then this is a neat feature, but with GPFS we'll use placement policies to drive this. We essentially make an mdisk, assign that to the pool, and then create a volume - the pool contains exactly one mdisk and the volume exactly one pool.

We're expecting to get about 250TB usable space but right now I'm a bit unsure about the size and number of files - its going to be running under OpenStack with glance images. General guidelines for GPFS metadata is 5-10% of your storage, I'm going with 4%, which is handily ~10TB metadata which we can make from 2x RAID10 sets over the 1TB SAS drives, which will also leave us 4 spare drives.

For the bulk data, I've provisioned 8x RAID6 sets with 10 drives per set and leaves 4 drives spare.

I've left the strip size the standard size of 256kb in all the RAID sets, but will probably go with a GPFS block size of 1024kb, which should allow alignment of the RAID strip with GPFS blocks.

When assigning the LUNs to the two GPFS servers, I've changed the preferred canister from automatic to balance the LUNs over the controllers. So for example, metadata LUN0 will be preferred on canister 1 and metadata LUN1 will be preferred on canister 2.

Similarly having 8 LUNs means I can evenly balance 4 LUNs per canister so hopefully IO over the two canisters and over the two NSD servers and over the two SAS cards on each NSD server should be relatively evenly balanced.

Each of the NSD servers is of course using multipathd so we should be some degree of fault tolerance against various failures. The only failure I'm not sure about is if half a controller fails - traditionally we'd use top down/bottom up loops, but the cabling docs for the V3700 don't list this, and in fact the supplied cables are too short to implement top down/bottom up. In all honesty, I'm not sure this matters - we don't have enough shelves to stripe down the shelves to be able to sustain a shelf loss without disruption, so we're probably as safe as we can be.

I'd be interested in any thoughts people have on performance tuning the V3700 and IO balancing when using for GPFS.

More detailed spec on the v3700 is of course in the IBM Redbook.

Tuesday, 12 August 2014

On OpenStack networking with a provider network ...

I've been playing around with OpenStack recently, its a proof of concept development right now which will hopefully turn into something good over the next few months and years. Its a bit of an hybrid HPC cloud type system (dynamic provisioning of large scale VMs for data processing), more details on that later in the year!

For the PoC, I've got a bunch of ports available on a properly routed public net block and a couple of machines behind that to provide the tin for the VMs. Ultimately these will be attached to 10 GbE switches with public network connectivity, I'm not if these will use IP assignment and Neutron L3 agent, or if they'll be direct external networks.  For this PoC we don't need to use per tenant networks, perhaps when we get a bit further towards production deployment, then maybe. To be honest, I'm not sure why using Neutron L3 routing is necessary when I have real routers and need to provide relatively high bandwidth external connections into the kit for large data transfers.

To be clear in the PoC, these are flat provider networks. i.e. no segregation of the network into VLANs, and a real router connected to the public internet. When we move towards production, I'm expecting (hoping maybe!) to get a class C net block for VMs, maybe we'll chop that up a bit into smaller tenant networks, but I'm not sure right now.

The kit is running Scientific Linux 6.5, and I figured packstack/RedHat RDO was a good place to start.

Packstack has a whole bunch of features and config file options, some of which aren't actually implemented.

Now whilst I want my kit running the VMs to be publicly connected, I also don't want those boxes to have public IP addresses. So I have 1 NIC connected to a management network, and a second NIC connected to the public switch. The public switch side is outside of my control at present, its provided by our networks teams, so VLAN tagging etc is out for the PoC system.

In my packstack-answers file, I set:
CONFIG_NEUTRON_L3_EXT_BRIDGE=br-eth1
CONFIG_NEUTRON_ML2_MECHANISM_DRIVERS=openvswitch
CONFIG_NEUTRON_ML2_VLAN_RANGES=physnet1:1:1
CONFIG_NEUTRON_L2_AGENT=openvswitch
CONFIG_NEUTRON_LB_INTERFACE_MAPPINGS=physnet1:br-eth1
CONFIG_NEUTRON_OVS_VLAN_RANGES=physnet1:1:1
CONFIG_NEUTRON_OVS_BRIDGE_IFACES=br-eth1:eth1

Now one might expect this to attach eth1 to the openvswitch bridge br-eth1, but this does't happen by magic and needs configuring.

Create/edit /etc/sysconfig/network-scripts/ifcfg-br-eth1
DEVICE=br-eth1
DEVICETYPE=ovs
TYPE=OVSBridge
ONBOOT=yes
OVSBOOTPROTO=none

And also eth1 is connected to the public network, so change ifcfg-eth1 to be:
DEVICE=eth1
BOOTPROTO=static
NM_CONTROLLED=no
ONBOOT=yes
TYPE=OVSPort
DEVICETYPE=ovs
OVS_BRIDGE=br-eth1

Note that I don't have an IP assigned on br-eth1. This means services like sshd on the hosting node aren't listening on the public interface which means my bare metal tin is relatively safe from the outside world, it is however perfectly possible to run VMs which are bound to this br-eth1 public bridge and have public facing services. Note the only exception to this is the box running Neutron networking - whilst we don't actually use the L3 agent for routing, it does need to be 'up' so that things like DHCP work on the network.

As we won't be using the L3 agent, we need to ensure that the dhcp server provides a route for the metadata server (provides config into the VM images, and things like ssh key pairs). Edit /etc/neutron/dhcp_agent.ini and set:
enable_isolated_metadata = True

Then restart the neutron-dhcp-agent service. (note if you re-run packstack at any point, this will revert to false).

(at this point I should note that in dhcp_agent.ini, use_namespaces = true. I've seen the dhcp agent fail to bind correctly for example after reboot, and the solution seems to be set this to false, restart service, set back to true and restart service).

The /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini should look contain something like:
[OVS]
enable_tunneling=False
integration_bridge=br-int

bridge_mappings=physnet1:br-eth1

At this point we want to create the flat network. I've swapped out my real IP addressees here for 10.10.10.0/24, yes this isn't a public net block, but swap it for your own real public class C.
neutron net-create --provider:physical_network=physnet1 --provider:network_type=flat --shared SHAREDNET
neutron subnet-create SHAREDNET 10.10.10.0/24 --name NET10 --no-gateway --host-route destination=0.0.0.0/0,nethop=10.10.10.1 --allocation-pool start=10.10.10.20,end=10.10.10.250 --dns-servers list=true 10.20.0.125

Assume that 10.10.10.1 is your router IP address, and 10.20.0.125 is your DNS server (or list of).

Note that we need to use --no-gateway, if you don't do this, then the route for the metadata server (169.254 address) won't get injected from the DHCP server even if you set enable_isolated_metadata=true as listed above).

And that should be that, start a VM on NET10 and it should talk directly on the eth1 public interface.

Of course at this point, br-eth1 could contain an Ethernet bond underneath, or different physical interfaces.

With reference and thanks to the docs at https://developer.rackspace.com/blog/neutron-networking-simple-flat-network/. Once the PoC is a bit more developed, I'll take a further look at this networking config, in particular possibly chopping up the real network with vlan provider networks https://developer.rackspace.com/blog/neutron-networking-vlan-provider-networks/.

Monday, 11 August 2014

Quick and dirty hacking hosts.equiv with xcat

I've mentioned before that we use xCAT for auto discovery and genesis of nodes in our environment, we've recently had need to add a second domain for a separate cluster of systems. We need to be able to ssh without keys around those systems, which is pretty simple using hosts.equiv, though we need it to be updated regularly as we add nodes to the system, so a quick and dirty shell script is called for called from cron every day to dump all the nodes in the relevant group. The script dumps the output into a directory which is in the synclists for the xcat profile so it gets pushed to new nodes on initial install and xdcp copies it out to the nodes each day.

Oh and there's a quick catch to make sure the file looks vaguely sane before distributing it, just in case something bad happens with the nodels...

#!/bin/bash

PATH=/sbin:/bin:/usr/bin:/usr/sbin:/opt/xcat/bin
GROUP=foo
OUTPUT=/install/data/$GROUP/etc/hosts.equiv
DOMAIN=baa.cluster
MASTER=xcatmaster
date=`date`

cat <<- EOT > $OUTPUT
# autogenerated
# on $date
$MASTER
$MASTER.main.cluster
$MASTER-foo
$MASTER-foo.$DOMAIN
EOT

for node in `nodels $GROUP`; do
  cat <<- EOT >> $OUTPUT
$node
$node.$DOMAIN
EOT
done
WC=`wc -l $OUTPUT | awk '{print $1}'` 

if [ $WC -lt 12 ]; then
  echo "WARNING: Unexpcted number of lines in $OUTPUT"
else
  xdcp $GROUP -Q $OUTPUT /etc/hosts.equiv

fi

Friday, 1 August 2014

Hardware RAID sets with Kickstart and megaraid controllers

Some of our kit we purchase to use the hardware RAID adapters to control the disk storage. This tends to be service kit, or where we want to have fast local storage attached to a system. On HPC compute nodes we're not so bothered about this - if we lose a node due to disk failure, well a user can resubmit the job.

We use xcat to bare metal provision our systems running Scientific Linux 6.x, internally this just uses kickstart files, so this approach would work on anything using kickstart to install EL based distributions. Its not possible to directly configure the RAID controller, the disks aren't visible as JBOD drives, and we might have a number of systems coming in for a project, so we don't want to be configuring them all by hand.

First off, I'll start by saying that I took inspiration for this from Samuel Kielek's blog post on this, my needs are a bit different, we have multiple controllers in some of the kit and we also want RAID 10 sets rather than straight mirrors. I also found a page at Cambridge which documents parts of MegaCLI quite nicely.

Second, I've only tested it with our system x kit with megaraid adapters, you'll need a copy of the MegaCLI utility, we're running on IBM hardware and you can grab it from IBM's support site. You probably want to find your own vendor's release if you are planning on using this.

Be warned, using this script will clear the config on your megaraid adapters i.e. it will remove any RAID sets you may have configured!

I'll explain a little on the script first, basically it downloads the MegaCLI tools from our xcat server, scans for adapters and disks and builds a hash of them. Actually as its bash, its not really a multidimensional array but a hacked approach at one...

Once we have found our adapters and disks, we then split the disks on each adapter into two sets and use that to build a raid 10 set. If we only have one adapter then we have a raid 10 set already and we're done.

If we don't find an adapter, that's fine as well as that means the disk isn't controlled by a RAID controller.

Now a few of the systems (x3950 x6) have two adapters and half the drives on each adapter, this is essentially as you can partition the 3950 into two discrete systems, but it does mean that our drives are split across two controllers and we can't have a fully hardware raid 10 set.

Now initially, I was going to build two stripes and then mirror in software raid, but my colleague suggested it would be better to use two hardware raid 10 sets and then stripe. Basically as a raid 0 stripe on each controller, we could only lose one drive per controller before we hit a problem, in a hardware raid 10 set, we could lose 2 drives on both controller, and the overall space available is the same.

As we want to use the same script for all the systems we have, we do some matching of the device model string from dmidecode and use that to download a file containing the kickstart disk partitioning.

First off, edit the kickstart file you use and change your disk partition lines to be:
%include /tmp/disk-partition

This file gets downloaded by the script. If you are using just kickstart, in your %pre section of the kickstart file, place a copy of the script below, we use xcat which allows files to be included so I have:
#INCLUDE:/install/data/MegaCLI/include-in-template#

When a machine is installed, xcat builds a copy of the kickstart file for the machine and includes the file about verbatim, this means I can keep the script in a separate location and use if for multiple template files.

And finally, the script which will do the magic for us:

#------------------------------------------------------------------------------#
#                     PRE-INSTALL HARDWARE RAID SETUP                          #
#------------------------------------------------------------------------------#

DEBUG=0
#
SERVER='10.30.0.254';
SRCDIR='/install/data/MegaCLI'

MODEL=`dmidecode | grep 'Product Name' | egrep 'x[0-9]+' | sed 's/.*\(x[0-9][0-9][0-9][0-9]*\).*/\1/'`
wget -q http://$SERVER$SRCDIR/partition-table/$MODEL
mv $MODEL /tmp/disk-partition

if [ $DEBUG == 0 ]; then
  exec < /dev/tty3 > /dev/tty3 2>&1
  chvt 3
fi

echo -e "\nConfiguring the MegaRAID SAS controller ..."

cd /tmp
[ -d mega-cli ] && /bin/rm -rf mega-cli
[ ! -d mega-cli ] && mkdir mega-cli
cd mega-cli

wget -q http://$SERVER$SRCDIR/MegaCli64
wget -q http://$SERVER$SRCDIR/libsysfs.so.2.0.2
wget -q http://$SERVER$SRCDIR/libstorelibir-2.so.13.05-0

mirror=/tmp/mirror_disks

declare -A matrix

if [ $DEBUG == 0 ]; then
  [ -f $mirror ] && rm -f $mirror
fi

if [ -f MegaCli64 ]; then
  chmod +x MegaCli64
  # Probe for disks
  if [ $DEBUG != 0 ]; then
    echo ./MegaCli64 -PDList -aALL | egrep 'Adapter|Enclosure Device ID|Slot Number'
    if [ ! -f $mirror ]; then
      ./MegaCli64 -PDList -aALL | egrep 'Adapter|Enclosure Device ID|Slot Number' > $mirror
    fi
  else
    ./MegaCli64 -PDList -aALL | egrep 'Adapter|Enclosure Device ID|Slot Number' > $mirror
  fi

  if [ ! -f $mirror ]; then
    echo -e "\n\nNo MegaRAID device file\n"
    exit
  fi

  c=`grep -c Adapter $mirror`
  if [ $c == 0 ]; then
    echo -e "\n\nNo MegaRAID devices found\n"
    exit
  fi

  oIFS=$IFS
  IFS="$(echo -e "\n\r")"

  adap=foo
  enc=foo

  lowadap=999
  # Figure out where the disks are
  for line in $(cat $mirror); do
    if [[ $line =~ Adapter ]]; then
      adap=$( echo $line | sed -e 's/^\s*Adapter #\([0-9]\+\).*$/\1/' ) # -e 's/\s*$//' -e 's/#//g' )
      enc=foo
      if [ $adap -le $lowadap ]; then
        lowadap=$adap
      fi
    fi

    if [ "$adap" != "foo" ]; then
      if [[ $line =~ Enclosure ]]; then
        enc=$( echo $line | awk -F: '/Enclosure/{print $2}' | sed -e 's/^\s*//' -e 's/\s*$//' )
      fi
      if [ "$e" != "foo" ]; then
        if [[ $line =~ Slot ]]; then
          slot=$( echo $line | awk -F: '/Slot/{print $2}' | sed -e 's/^\s*//' -e 's/\s*$//' )
          matrix[$adap,$slot]=$enc
        fi
      fi
      #(( c++ ))
    fi
  done

  IFS=$oIFS

  for ((j=$lowadap;j<=$adap;j++)) do
    # Clear any existing configuration
    if [ $DEBUG != 0 ]; then
      echo ./MegaCli64 -CfgClr -Force -a$j
     else
      ./MegaCli64 -CfgClr -Force -a$j
    fi
    devs=
    devc=0
    for ((s=0;s<16;s++)) do
      if [ "${matrix[$j,$s]}" != '' ]; then
        devs=$devs,${matrix[$j,$s]}:$s
        devc=$((devc + 1))
      fi
    done

    disksperarray=$(($devc / 2))
    devsa=
    devsb=
    collected=0
    for ((s=0;s<16;s++)) do
      if [ "${matrix[$j,$s]}" != '' ]; then
        if [ $collected -lt $disksperarray ]; then
          devsa=$devsa,${matrix[$j,$s]}:$s
        else
          devsb=$devsb,${matrix[$j,$s]}:$s
        fi
        collected=$(($collected + 1))
      fi
    done

    devs=`echo $devs | sed 's/^,//'`
    devsa=`echo $devsa | sed 's/^,//'`
    devsb=`echo $devsb | sed 's/^,//'`

    # Ensure the drives are in a good state before creatingthe logical device
    if [ $DEBUG != 0 ]; then
      echo ./MegaCli64 -PDMakeGood -PhysDrv "[$devs]" -Force -a$j
    else
      ./MegaCli64 -PDMakeGood -PhysDrv "[$devs]" -Force -a$j
    fi
    # Create a RAID10 logical device from the detected disks 
    if [ $DEBUG != 0 ]; then
      echo ./MegaCli64 -CfgSpanAdd -r10 -Array0 "[$devsa]" -Array1 "[$devsb]" -a$j
    else
      ./MegaCli64 -CfgSpanAdd -r10 -Array0 "[$devsa]" -Array1 "[$devsb]" -a$j
    fi

    #sdev=`echo $j | tr 0123456789 abcdefghij`
    #if [ $DEBUG != 0 ]; then
    #  echo test -b /dev/sd$sdev  || mknod /dev/sd$sdev  b 8 0
    #else
    #  test -b /dev/sd$sdev  || mknod /dev/sd$sdev  b 8 0
    #fi
  done

  if [ $DEBUG != 0 ]; then
    echo python /mnt/runtime/usr/lib/anaconda/isys.py driveDict
  else
    python /mnt/runtime/usr/lib/anaconda/isys.py driveDict
  fi
  echo -e "\nCOMPLETED - Configuring MegaRAID SAS controller ...\n\n"

else
  echo "FAILED - Could not obtain MegaCli utility from HTTP server ($SERVER)"
fi

if [ $DEBUG == 0 ]; then
  chvt 1
fi

I've tested this on systems with multiple controllers, single controller and no controller and it seems to work fine, what I haven't tested is on systems where the disk config isn't an even number of driver or where is asymmetric across controllers. That's not a config I see us ever using.

And a final note, if you aren't using system x hardware, then you may want to have a look at the dmidecode regex as that is written to pattern match the names of systems we use to work out which disk partition file to download.

IBM x3950 x6 and shared mode IMM

For a project, we've recently installed an x3950 x6 system.
x3950 x6, image IBM Redbook


Now these systems are very cool in their own right, its an 8U chassis which is basically made up of a pair of x3850 x6 systems. The x3950 takes 8 compute modules with E7v2 cpus and supports up to 12Tb of ram. (not that we have that much!)

Being system x it supports IMM2 for out of band management (IPMI with more stuff). On the rest of system x kit, we run the IMM in shared mode with VLAN tagging where the system attaches the IMM port to the primary Ethernet port and reduces the number of cables we have to run to the system.

Normally you can configure this via the UEFI interface, you go in, switch it to shared from dedicated, set the IP and VLAN tag and that is it.

On the x6 shares mode isn't listed as an option.

To interject here, we don't actually normally configure these by hand as we use xcat to discover and "genesis" the systems. I wont go into xcat here, but for this part it works well for us. The genesis uses ipmitool internally to configure the IMM to run in shared mode, set the IP and VLAN. Except it didn't work on the x6, the IPMI status was listed as "set in progress". From experience, this means its sent something to the IMM that it doesn't understand.

OK, these are new systems and we're running version 1.0 of the UEFI (though why was there a newer 1.0 released without the version number changing??!)

It was at this point I went to check the IMM interface and found that only dedicated mode is available.

A call to our integrator and a call was logged with IBM support.

The engineer I spoke to didn't know what shared mode was, but to be fair after I explained he went off to read up about it.

He did come back with a workaround, and we did find our own workaround as well.

We got it escalated to L2 support who have confirmed it as a bug witha new firmware to fix it expected later this year.


So the workaround...


The first is to set the IMM up in dedicated mode so that you can ssh to the IMM interface, then run
ifconfig eth0 -nic shared_option_1

(You can do all VLAN config etc from this interface as well, see the IMM docs but in short:
ifconfig eth0 -vlan enabled -vlanid 3002 -nic shared_option_1)

The second was to tweak the xcat config. In the ipmi table you can define parameters to determine shared mode. The xcat docs list using 0 in the bmcid field to use shared on the first LAN port. This was what we had set but doesn't work on the x6. In the xcat docs, it lists other options, one is to use "2 0", which means use shared mode on the first port of the mezzanine card. When I set this, I found that genesis completed correctly automatically.

The row we now have in the ipmi table in xcat is:
"x3950x6","|\A|bmc|","2 0","3002",,,,,

And then we add the machine to the x3950x6 xcat group, this auto provisions the IMM on the first shared mezzanine port and assigns it to VLAN 3002.

So now I have an autoconfig solution and the promise of an updated firmware. Yes I could have settled with doing it with work around 1, but then I wouldn't have a solution for any future x6 systems ...