Monday, 29 September 2014

xcat, configuration management and salt

As I'm mentioned several times, we use xcat for node discovery in our clusters. With the OpenStack bits, I've been finding it more and more difficult to get the massively diverse configs correct by hacking postscripts etc from xcat.

So we've implemented saltstack for config management. We don't just use xcat for the OpenStack install but also for a general HPC cluster and for a few other supporting systems. I'm fairly agnostic about the choice of config management tools, but one of my colleagues is familiar with salt, so that's the way we've gone.

Its taken a couple of weeks to butcher the legacy xcat configs into something approaching manageability with salt, which is why there's been a lack of blog posts for a while.

Anyway, I now have a nice data structure in salt pillar data which I can use to configure my OpenStack services as well as building the configs for keepalived and haproxy. For example I have something along these lines:

          password: DBPASSWORD
            - localhost
            - "%.climb.cluster"
            "%.climb.cluster": "glance.*"
        - glance
        server1: glance-1.climb.cluster
        server2: glance-2.climb.cluster
      vipname: climb-glance.climb.cluster
        glanceregistry: 9191
        glanceapi: 9292
        glanceregistry: 9191

        glanceapi: 9292

Basically this allows me to define the databases and user/passwords that are required for the system, as well as OpenStack users. We also define the users in OpenStack here as well as info relating the ha configs (for example the backend server port and the haproxy port that will be used in the system configs).

One thing to note is that I'm not keen on specifying IP addresses as far as possible in the config files (e.g. the keystone auth url can use a name), so we use vipname to specify what would be used by clients. The hosts: section defines the internal IP address that the service should bind to. This is important in cases where the haproxy is running on the same nodes as those providing the service.

Once we've specified this outline config, we then have templates in salt along the following lines:

{% set OS_CONFIG="/etc/glance/glance-registry.conf" %}
{% set OS_CONFIGAPI="/etc/glance/glance-api.conf" %}
{% set OS_CONFIGCACHE="/etc/glance/glance-cache.conf" %}
{% set OS_GLANCE=pillar['climb']['openstackconfig']['glance']['osusers'].keys()[0] %}
{% set OS_AUTHINT='http://' + pillar['climb']['openstackconfig']['keystone']['vipname'] + ':' + pillar['climb
']['openstackconfig']['keystone']['realport']['public']|string + '/v2.0/' %}
{% set OS_AUTHSRV=pillar['climb']['openstackconfig']['keystone']['vipname'] %}
{% set OS_AUTHPORT=pillar['climb']['openstackconfig']['keystone']['realport']['admin']|string %}
{% set OS_GLANCE_URL='http://' + pillar['climb']['openstackconfig']['glance']['vipname'] + ':' + pillar['clim
b']['openstackconfig']['glance']['realport']['glanceapi']|string %}

# create the database if it doesn't exist
    - run
    - name: glance-manage db_sync
    - runas: glance
    - unless: echo 'select * from images' | mysql {{ pillar['climb']['openstackconfig']['glance']['databases'][0] }}
    - require:
      - pkg: openstack-glance-pkg

# ensure the openstack user for the service is present
    - name: {{ OS_GLANCE }}
    - password: "{{ pillar['climb']['openstackconfig']['glance']['osusers'][OS_GLANCE] }}"
    - email: "{{ pillar['climb']['openstackconfig']['config']['adminmail'] }}"
    - tenant: {{ pillar['climb']['openstackconfig']['config']['tenant']['service'] }}
    - roles:
      - {{ pillar['climb']['openstackconfig']['config']['tenant']['service'] }}:
        - {{ pillar['climb']['openstackconfig']['config']['role'][pillar['climb']['openstackconfig']['config']['tenant']['service']] }}

# ensure the endpoint is present
    - name: glance
    - publicurl: "{{ OS_GLANCE_URL }}"
    - adminurl: "{{ OS_GLANCE_URL }}"
    - internalurl: "{{ OS_GLANCE_URL }}"
    - region: {{ pillar['climb']['openstackconfig']['config']['region'] }}

# define the database config to use
    - name: connection
    - filename: {{ OS_CONFIG }}
    - section: database
    - value: "mysql://{{ pillar['climb']['openstackconfig']['glance']['dbusers'].keys()[0] }}:{{ pillar['climb']['openstackconfig']['glance']['dbusers'][pillar['climb']['openstackconfig']['glance']['dbusers'].keys()[0]]['password'] }}@{{ pillar['climb']['openstackconfig']['database']['vipname'] }}/{{ pillar['climb']['openstackconfig']['glance']['databases'][0] }}"
    - require:
      - pkg: openstack-glance-pkg
    - require_in:
      - service: openstack-glance-registry
      - cmd: glance-db-init

We of course set a lot more properties in the glance registry and api configs, but as far as possible we use the salt state openstack_config.present to abstract this. I've only put a bit of sample config here.

Whilst ideally I'd like to be able to build the whole OpenStack cluster from salt, its sort of not possible. For example we have a GPFS file-system sitting under it, and getting salt to setup the GPFS file-system is kinda scary, similarly getting the HA MariaDB database up or the swift ring is scary. So my compromise it that there's a few bits and pieces that need setting up in a kind of 'chicken/egg' situation, but that salt can be used to re-provision any node in the OpenStack cluster assuming we didn't lose everything, and that the re-provision should leave the node in a fully working state. Basically this means a bit of manual intervention, e.g. setup swift ring, but they get copied back to salt and pushed out from there.

One thing I will say about jinja templates with salt is that I feel {{ overload }}...

It will be interesting to see how it performs when we have all the HPC and OpenStack nodes running via salt.

Wednesday, 10 September 2014

Render farm management with PipelineFX Qube!

Something a bit different from recent ramblings on OpenStack! As part of our research support infrastructure we've planned to provide a render farm to allow high resolution stills or video to be rendered. Right now its a small render farm made up of 1 controller and a couple of worker nodes. We got quote a bit of the kit some time ago but other projects have taken priority so we placed the workers into a general purpose HPC queue so they weren't being wasted.

I'll try to be careful to refer to it as a render farm rather than cluster, but if I mention cluster, read that as farm ;-)

Getting the render farm up and running has now bubbled to the top of my list to get it to proof of concept stage.

For various reasons we're using PipelineFX Qube! as the render manager software. It integrates with some 3d rendering software and runs across Windows, OS/X and Linux.

The farm is made up 1 controller node (the supervisor) running Linux and two render (worker) nodes running Windows 7. Initially I'd hoped to run it all under Linux, but of the applications we have licensed (Autodesk 3d studio max) is only available under Windows. The render nodes have a couple of applications directly installed on them for rendering (Blender, Autodesk 3d studio max and Autodesk Maya), if we get demand we'll add more later.

Qube! includes its own installer for all platforms to install supervisor, worker and client applications, however we like to deploy our Linux boxes with zero touch, so we use xcat to deploy the software, we also use this to deploy the config and license files so we can keep them in a centrally backed up repository.

Qube! requires mysql for its data warehouse backend, the installer will try to install this for you, so we include it as part of out xcat image, with a standard script to lock down the mysql database install. Ideally I'd like to run the database to our clustered ha database service, but as of Qube! 6.5, they only support using myISAM tables, which don't work with Galera clustering. I did ask one of the tech guys about this and they suggested something along the lines of myISAM being better performance. Whilst that may have been true many years ago, I'm not sure that holds now. Still we are where we are.

As far as possible, we push all the config into the server side of things, the qbwrk.conf file can be used to specify specific config options for classes of workers (e.g. Windows, Linux) as well as for specific nodes. This means we have to do very little config on the workers and its one of the nice things about Qube!. The basic xcat package list looks like:

Once installed, you need to do some basic configuration, I have an /etc/qbwrk.conf file which includes config to be pushed to workers:
worker_description = "Render Farm"
proxy_execution_mode = user
worker_logmode = mounted

worker_description = "Linux Render Farm"
worker_path = "/gpfs/qube/logs"

worker_description = "Windows Render Farm"

worker_logpath = "\\\\fileshare\qubelogs"


worker_logpath = "\\\\fileshare\qubelogs"

We use shared log path config, this is the recommended configuration from PipelineFX and means the workers write directly to the log directory rather than via the supervisor. A couple of things to note on this, out Linux log path "/gpfs/qube/logs" is the same directory shared via Samba as \\fileshare\quebelogs. The thing I really don't like about this is that it needs to be Full Control/o+rwx to allow logging to work, which also means users can see other user log files (and potentially interfere with them!).

Other than that in /etc/qb.conf there's very little config required:
qb_domain = d1
qb_supervisor = supervisor.cluster
client_logpath = /gpfs/qube/logs
supervisor_default_cluster = /d1
supervisor_default_priority = 9950
supervisor_highest_user_priority = 9950
supervisor_default_security = 
supervisor_host_policy = "restricted"
supervisor_logpath = /gpfs/qube/logs
supervisor_preempt_policy = passive
mail_administrator = admin@domain
mail_domain = domain
mail_host = smtpserver.cluster

mail_from = admin@domain

A couple of things to note here, supervisor_host_policy - restricted requires the worker to be defined in the qbwrk.conf file (just to prevent someone accidentally adding a worker to the cluster). We also set the default priority of jobs to 9950 and the highest a user can set their priority to at 9950. Basically this allows us as admins to bump the priority of jobs up if we need to, and allows a user to drop the priority of their own jobs if they want another to run in preference. The scheduler for Qube! isn't particularly complicated (more or less highest priority, with first in first out). There's also no way to integrate it into another scheduling system.

The default permissions seem a bit scary to me, so we locked things down by default (I think users can interact with other users jobs, which is a bit scary!) and then created out our admins which may to our other admin accounts. For example to create an admin account we'd do:
/usr/local/pfx/qube/sbin/qbusers -set -all -admin -sudo -impersonate -lock <ADMINUSER>

To clean up the default users:
/usr/local/pfx/qube/sbin/qbusers -set administrator
/usr/local/pfx/qube/sbin/qbusers -drop qube
/usr/local/pfx/qube/sbin/qbusers -drop qubesupe
/usr/local/pfx/qube/sbin/qbusers -drop system
/usr/local/pfx/qube/sbin/qbusers -drop root

We also restrict what normal users can do by default and so users have to be specifically registered with something like:
/usr/local/pfx/qube/sbin/qbusers -set -submitjob -kill -remove -modify -block -interrupt -unblock -suspend -resume -retry -requeue -fail -retire -reset <USERNAME>

On the Windows 7 worker nodes, we do use the installer to install components for us. We use a basic Windows 7 Enterprise image which is joined to our Active Directory. The installer is pretty good, it allows you to use an offline cache of the packages and will generate "kickstart" style files for replaying the install on multiple machines. It comes with pre-defined classes of system, e.g. worker, client etc.

As well as the Qube! service itself, there are also a number of job templates which can be installed, these are wrapper scripts to allow Qube! to better integrate with various applications. On the client side, some of these include plugins for applications to allow direct submission from the application to the cluster.

Pretty much all we need to do when installing the Windows farm systems is to give the name of the render server. Our render farm nodes are on a private network, the supervisor is on both a public and private network, so I just have to be a little careful at this point to specify the internal name for the farm nodes to prevent the traffic traversing the NAT gateways and back in again!

The Windows machines obviously also need the applications installed on them and Autodesk Creative suit is BIG. Ideally we'd push these in from something like SCCM or wpkg, but with only a few nodes, right now we're doing them by hand. (One thing I hate is that whilst the Autodesk apps support flexlm licenses, there is no way to specify the port on the license manager from inside the installer, so I have to go back and edit it later!).

Qube! has several ways in which the worker node can run, one is "desktop" mode, this runs jobs as the user currently logged into the system. The second is "service" mode, this in itself has two modes, proxy user and run as a user. Proxy user has a local and hardcoded user to run jobs as, run as user requires the end user to cache credentials into Qube!. We're using the latter. Either of the other two modes require "other" users to be able to access your files and write to your output folders. Whilst this may be OK in a company where everyone is working on the same projects, it doesn't work in our environment, so its better to run as the real end user. The only downside to this is that is requires caching of the user's password, which is pain when they change it, though ultimately its no worse than our Windows HPC environment I guess.

The only other thing we really need to do on the worker nodes is to allow access via the Windows firewall:
netsh advfirewall firewall add rule name="Qube! 50001 TCP" protocol=TCP dir=in localport=50001 action=allow
netsh advfirewall firewall add rule name="Qube! 50001 UDP" protocol=UDP dir=in localport=50001 action=allow
netsh advfirewall firewall add rule name="Qube! 50002 TCP" protocol=TCP dir=in localport=50002 action=allow
netsh advfirewall firewall add rule name="Qube! 50002 UDP" protocol=UDP dir=in localport=50002 action=allow
netsh advfirewall firewall add rule name="Qube! 50011 TCP" protocol=TCP dir=in localport=50011 action=allow

netsh advfirewall firewall add rule name="Qube! 50011 UDP" protocol=UDP dir=in localport=50011 action=allow

So really, this is just a basic overview of initial setup and some of the features and things I think are worth looking at. The Qube! docs are pretty good and I've found their support people pretty responsive as well for the few occasions I've needed to get in touch with them. More details on Qube! from PipelineFX. They also run regular training courses (usually free), maybe a couple of times per year.

Monday, 8 September 2014

HA RabbitMQ for OpenStack

Next up is to get HA queues for OpenStack across our servers, this is pretty simple to get working. It does need haproxy and keepalived working, as I've already done that, I just need to add the config to haproxy after rabbitmq is installed.

Again using xcat, we want some extra packages:

On each node, configure the /etc/rabbitmq/rabbitmq-env.conf file, I'm doing this to bind the rabbitmq to the IP address on the VLAN tagged 40GbE network I'm using for OpenStack management traffic:

Obviously the IP address is different on different systems.

On one of the nodes start the rabbitmq-server.

Enable queues to be mirrored queued by default:
rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'

Next you need to sync the erlang cookie to all your nodes (copy /var/lib/rabbitmq/.erlang.cookie to each node).

On the next server node start rabbitmq, stop the queue, add to the cluster and restart the queue:
service rabbitmq-server start
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@server1
rabbitmqctl start_app

Check that the queue policy is visible from the second node:
% rabbitmqctl list_policies
Listing policies ...
/ HA ^(?!amq\\.).* {"ha-mode":"all"} 0


At this point, we probably want rabbitmq to be able to start at boot time and join the cluster, so create /etc/rabbitmq/rabbitmq.config on all nodes of the form:
  [{cluster_nodes, {['rabbit@server1', 'rabbit@server2'], ram}}]}].

haproxy config

Assuming you already have haproxy installed and configured at least basically, something like the following should be added to the haproxy config file:
listen rabbitmq
    mode tcp
    balance round robin
    option tcpka
    server server1-osmgmt check inter 5s rise 2 fall 3
    server server2-osmgmt check inter 5s rise 2 fall 3

Once this is setup in haproxy, reload the haproxy config and then point the various OpenStack components at the HA IP address (or preferably the name ;-)).

Well, hopefully it should all work, I don't have quite enough of OpenStack up at the time of writing to test it all!

Rabbitmq authentication

By default there is a guest rabbitmq user, change the default guest password for this to something random:
rabbitmqctl change_password guest `openssl rand -hex 10`

We also want to create a user for OpenStack to use, generate a password (and remember it!) and then create the user:
rabbitmqctl add_user <USERNAME> <PASSWORD>

DR'ing a node

If you lose a node (say DR!), on one of the other rabbit cluster nodes, run something like:
rabbitmqctl forget_cluster_node rabbit@server2

Then follow the steps to add server2 back into the cluster.

If you reinstall a node which was already part of the cluster (or lose the contents of /var/lib/rabbit), it should be possible to rejoin the cluster by creating /etc/rabbitmq/rabbitmq.config, start the daemon, then:
rabbitmqctl stop_app
rabbitmqctl change_cluster_node_type disc
rabbitmqctl start_app

Note that if you don't change the node cluster type, then it will be a ram only cluster node and the queue won't be written to disk. 

Highly available MariaDB for OpenStack

On our OpenStack installation, we'd like to have an HA core set of services, this is the first in (probably!) a series of blog posts on getting our core stack of services into an HA environment. Wherever possible, I'd like them to be running active/active rather than using something like pacemaker to hand over services.

OpenStack supports both MySQL and Postgres for its underlying database, whilst I'm generally a Postgres preferring person, it doesn't really do active/active and most of the OpenStack distress seem to utilise MySQL and the docs are based around MySQL. We also run a HA MySQL cluster for another service so its something I'm familiar with.

Ideally you want at least three nodes in the cluster, however for this project, I'm planning to use the two GPFS servers for the database cluster, I'm also opting for the MariaDB fork rather than MySQL itself. My testing is based around 10.0.13 using the RPMs from MariaDB, I'm also using the MariaDB Galera cluster version.

To avoid split brain conditions with a two node cluster, we also use garb (in the galera rpm)

Initial build and config

I'm using xcat to install the machines, and have added the following packages to the profile:

Now as I'm running a two node cluster, I also added the percona-xtrabackup (2.2.3) from Percona. This backup tool integrates with the Galera wsrep provider and allows initial database server state transition to occur without blocking one of the cluster nodes - using mysql backup would block access to one server when a new server joined to allow state transfer to occur. In general we won't be doing this, but what we would like is a reinstall of a server node to be possible without intervention and without interruption to the service. i.e. we can do a DR on a server node without disruption.

We use xcat to push out a couple of config files out to the compute nodes, and a postbootscript which locks down the local database config (based on mysl_secure_installation script). The first is /etc/my.cnf:


# include all files from the config directory

!includedir /etc/my.cnf.d

And the second /etc/my.cnf.d/server.conf:
# Disabling symbolic-links is recommended to prevent assorted security risks
# settings recommended for OpenStack
collation-server = utf8_general_ci
init-connect = 'SET NAMES utf8'

character-set-server = utf8

A script file substitutes the two references CHANGEIP to the host's IP address on the OpenStack management network (which is a tagged VLAN on the 40GbE network we have) - this is also the name of the host in the wsrep_cluster_address line. The postbootscript also adds the galerasync user, sets the password and gives it access to all databases. This user will be used to sync data between cluster nodes. (Before I did this, wsrep picked the IP on the first network interface, which happens in my case to be bond0 which is 'up', but currently not plumbed into a switch so can't pass traffic!).  By binding the database and wsrep to a specific IP, we can control where the traffic goes, we can also use haproxy and keepalived to load-balance and additional IP over the systems without needing to run the database on a different port number.

Now you may note that we have datadir defined in both my.cnf and server.cnf, this is a quirk of using xtrabackup - that tool doesn't traverse other config files so needs it defined at the 'top' level.

Starting up the cluster

The names of both servers are included in the config file, but to start the cluster from cold, you'll need to logon to one of the nodes and start the cluster by hand:
mysqld --wsrep_cluster_address=gcomm:// --user=mysql

One this is done you can start the second node of the cluster manually, though initially I test by starting by hand which allows me to diagnose what is going on with the cluster:
mysqld --wsrep_cluster_address=gcomm://server1-osmgmt,server2-osmgmt --user=mysql

Once I can see everything is happy, I'll stop one node with the init script, restart it with the init script and then do the same procedure on the second node. A reboot of any node will now come up normally. Obviously if we ever shut all servers down, then one will need bringing up by hand to get quorum in place.

Just one other thing to note, I planned to use a separate partition for /var/lib/mysql  - that didn't quite work out.

Arbitrating to prevent split brain

Once we're happy basic two node operation is working, we then add the garb arbitrator node, this is pretty simple, just install the galera rpm on another system and configure /etc/sysconfig/garb:
# A space-separated list of node addresses (address[:port]) in the cluster
GALERA_NODES="server1-osmgmt:4567 server2-osmgmt:4567"

# Galera cluster name, should be the same as on the rest of the nodes.


I actually push this config out to all of my nova compute nodes, but I only have 1 running the garb at any point in time. The only reason for needing garb is if one of my server nodes dies, it means we don't end up with split brain condition and the database will continue running and we'll be able to re-add the server node again.

Load balancing and HA

We now have an active/active cluster in place, but this doesn't help us with load balancing or fail-over as OpenStack can only be configured to 'talk' to one database system, to help with this we'll now add an additional IP address which uses haproxy and keepalived to keep the service up and to load balance traffic to the real database IP addresses. This is where binding the database instance to a specific IP helps as we can now add another haproxy service listening on port 3306 which redirects to the backend databases.

We now need the following RPMs installing (again I get them in from xcat onto the nodes);
# For HA


First up, for haproxy we will want to have a user which can connect to the database service, so start the mysql client on one of the servers and add the user:
CREATE USER 'haproxy'@'server%.climb.cluster';

Note that we use the create user syntax for this rather than directly inserting into the mysql.user table. Create user is propagated across the cluster, whereas using insert directly into the mysql table isn't.

We now need to configure keepalived. This will be used for the HA IP address and will automatically control moving it between the servers in the event of a failure. My /etc/keepalived/keepalived.conf looks something like this:
! Configuration File for keepalived
! generated by xcat postscript

global_defs {
   notification_email {
   notification_email_from foo@bar
   smtp_connect_timeout 30
   router_id CLIMB_server1

vrrp_script OSMGMT_HA {
   script "killall -0 haproxy"
   interval 2
   weight 2

vrrp_instance VIP_OSMGMT {
    state BACKUP
    interface mlx0.3006
    virtual_router_id 10
    priority 121
    virtual_ipaddress { dev mlx0.3006
    track_script {
    authentication {
      auth_type PASS
      auth_pass somerandomstring


(Actually again an xcat postscript populates it with the correct interfaces. Note that a system with higher priority will prefer to be the master).

On server2, the priority is set to a higher number and the router_id has a different name in there. Obviously change the auth_pass to something else, it should be the same on all your nodes providing the same service.

Now we can start keepalived and keep an eye on the message file, once its sorted itself out, we should see the extra IP address on server2:
% ip addr
12: mlx0.3006@mlx0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether f4:52:14:8b:50:72 brd ff:ff:ff:ff:ff:ff
    inet brd scope global mlx0.3006
    inet scope global mlx0.3006
    inet6 fe80::f452:1400:18b:5072/64 scope link 
       valid_lft forever preferred_lft forever
We can see that is up on the interface. Test that you can ping it from another system as well!

Before we setup haproxy we need to tweak a sysctl setting to allow it to bind to interfaces which aren't present - as the HA IP address floats, it may not be present when HA proxy starts so edit /etc/sysctl.conf and add:

To activate it now, run sysctl -p.

Finally we need to configure haproxy, this is in /etc/haproxy/haproxy.cfg:
# Global settings
    log local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/
    maxconn     4000
    user        haproxy
    group       haproxy
    stats socket /var/lib/haproxy/stats

# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000

listen climbdbcluster
    balance source
    mode tcp
    option tcpka
    option mysql-check user haproxy
    server server1-osmgmt check weight 1
    server server2-osmgmt check weight 1

Note that I'm logging out to an external syslog server. The climbdbcluster IP address is the HA address and we list the two real backed database servers each having equal weight. Once haproxy is started it should be possible to connect to the database from other nodes using the HA IP address. If you see errors in the messages file about a server not being available, its possible you forgot to give the user access to authenticate to the database, they don't need permission to do anything other than complete the authentication stage.

Sunday, 7 September 2014

Breaking GPFS by reinstalling a server...!

My GPFS servers are direct attached to the storage luns via sas cards.

As part of my DR testing, I reinstalled one of them and GPFS failed to mount. Nor would it remount on other nodes in the cluster. A bit of digging in the log files indicated that GPFS thought the disks were corrupt (though data was still visible on the GPFS file-system).

A bit more digging and I worked out what had happened - the xcat kickstart template has initlabel in there, and as it could see the LUNs from Anaconda, that's what it did, it wrote new disk labels over all visible disks, wiping the GFPS disk descriptor.

So safest solution is to add addkmdline=mpt2sas.blacklist=yes in the xcat config for the server nodes. This blacklists the sas card to prevent it from seeing and wiping the LUNs. Basically this gets passed to the kernel parameters when the system boots off the network into Anaconda for install.

You of course also need a postscript to clean up the /etc/modprobe.d/anaconda.conf file so the LUNs appear on first boot.

I have a script to rebuild the gpfs, and it didn't have any data really. Anyway, shows the importance of testing the DR process before you need it...

Hat off the Laurence at OCF for the tip to blacklist.

Bitten by MySQL/MariaDB and lost+found

I should have remembered this, but MySQL or (MariaDB in my case) isn't entirely happy with running o its own partition under Linux, mostly as it tried to load the lost+found directory as a database.

For my OpenStack build, I need a database, and MariaDB with Galera clustering can provide an ha option for this - wr already have a pair of GPFS servers, so this seems a reasonable place to hold the relatively low load database to underpin it. More on that elsewhere though!

Anyway, Id like it to ideally run from a separate logical volume under Linux and whilst MariaSB will do this (with the odd whinge in a log file due to the lost+found), I found that xtrabackup (part of Percona for hot db backups) isn't so happy.

I was using this with the Galera clustering to do node synchronisation, but it kept failing with a permission denied error.

After a bit of digging around, I found it was caused by xtrabackup trying to sync the lost+found directory, it also appears that it doesn't like the ignore-db-dir config option.

So I've reverted to having /var/lib as a partition. Best compromise I can find really...!

One other quirk I found with xtrabackup, it needs the datadir defining and reads it from the MySQL config, only problem is on EL 6 etc, this is in an included file and xtrabackup appears not to traverse the included file so I've ended up defining it twice.

Anyway, a good half day wasted trying to get the sync working...! (yes I don't expect to have to reinstall any db nodes later, but Id like them to be able to DR using xcat.).

Tuesday, 2 September 2014

Setting up a new GPFS file-system (for OpenStack)

Yes, this stuff is posted all over the place, this post is mostly for my records. Its also got some discussion on my choice of parameters for the use-case, i.e. to use the OpenStack cinder driver for GPFS.

One thing to note, make sure all your nodes have the same version of GPFS installed ... I've switched to testing with GPFS 4.1 for this project, but some of my nodes in the cluster were installed before the move to 4.1 so still had some 3.5 remnant packages on them.

Oh and secondly, if you have GPFS standard or advanced licenses, make sure you also add the gpfs.ext package as well (I got an error about Pools not being supported in GPFS Express Edition, because well, gpfs.ext didn't exist before and so wasn't in the xcat package list)!

Cluster config

All the systems in the cluster are equipped with Mellanox Connect-X 3 VPI cards, these are dual personality cards supporting FDR Infiniband/40GbE, we also have a 1GbE management network used for general tasks. For GPFS, we're planning to use Verbs running over the Infiniband fabric, falling back to 40GbE and finally 1GbE if needed. There's an SX6036 FDR switch, SX1036 40GbE switch and an IBM 1GbE switch for the management side of things.

This will be a two server node cluster, each of these is an IBM x3650m4 system which is direct SAS attached to the v3700 storage array. I've already blogged about the v3700 LUN config, so won't go over it here.

I'll assume at this point that the GPFS RPMs are already installed (gpfs.base, gpfs.gpl, gpfs.ext, gpfs.msg, and gpfs.gplbin appropriately built for the kernel in use, I also have on my servers so I have the man pages available).

First up is to create a node file with all the nodes that are in the cluster (note that the nodes must be booted and ssh'able to as root). The format of the file is something like:

Now from one of the servers, create a new GPFS cluster by running:
mmcrcluster -N climb.nodefile.gpfs --ccr-disable -p server1 -s server2 -A -C climbgpfs -r /usr/bin/ssh -R /usr/bin/scp -U climb.cluster

I also created a couple of node list files to pass into mmchlicense, of course it can't read the syntax of the node file used to create the cluster, so I have two files, climb.serverlist.gpfs:

and climb.nodelist.gpfs:

We now need to confirm that we have appropriately licensed all of our nodes:
mmchlicense server --accept -N climb.serverlist.gpfs 
mmchlicense client --accept -N climb.nodelist.gpfs

Now before we get  on with the process of creating NSDs and filesystems, there's a bunch of cluster settings we want to configure first, first we're going to restrict the port range used for some GPFS admin type commands, this will be handy if we ever get around to firewalling or if we need to expose the cluster over IP to a remote cluster.
mmchconfig tscCmdPortRange=30000-30100

We also want to configure verbs so we can use RDMA over the Infiniband network:
mmchconfig verbsPorts="mlx4_0/1"
mmchconfig verbsRdma=enable
mmchconfig verbsRdmaSend=yes

The systems all have IvyBridge based CPUs in, so NUMA domains are likely to be present, set the flag to allow GPFS to interleave memory usage across domains to prevent out of memory in a single domain:
mmchconfig numaMemoryInterleave=yes

Note that our systems have a single dual port ConnectX-3 card, port 1 is connected to the IB network and port 2 connected to the 40GbE network.

Now I mentioned earlier that we also have a 1GbE for the management network, according to the GPFS docs, its possible to use the admin node name for to tell GPFS to use that network name for admin traffic. Now as we created the cluster using the 'normal' host names on the 1GbE network, its difficult for us to specify a different node unless we have another network for admin traffic which isn't what we want.

The solution here is to the the GPFS subnets config option, this allows us to specify the network for the high performance storage network (the 40GbE one) which is used in preference by GPFS for node communication. So we can work around having an admin network name by using this:
mmchconfig privateSubnetOverride=yes
mmchconfig subnets=""

In our system, is the network assigned to the VLAN tagged interface for storage on the 40GbE cards. In fact Sven (IBM GPFS Team) confirmed that GPFS will run traffic in preference on "RDMA, subnets, default", so our data network will run ideally over the Infiniband, then the 40GbE network, finally over the 1GbE network as a last resort, additionally as we set the admin name to the main hostname, the admin traffic should run over the 1GbE network and be separate, though I concede its not clear from the docs if this will still prefer subnets.

Block sizes and metadata space

Initially I was sizing metadata requirements based on the 5-10% of usable storage, but I've since come across a couple of docs indicating that this isn't a great way of sizing. The first is by and Scott (an IBM GPFSer), and the second is really a summary. In short, worst case for metadata is 16kb per (file/directory), so for 40million files, double replicated, 1.3TB of metadata, which is a lot less than the ~12TB estimated for 250TB usable space at 5% ish. OK, so we'll be using snapshots as well for glance image clones, but I don't expect the image blocks to actually vary massively once provisioned, and we have ~10TB of metadata space. I guess the worst case is we have to disable metadata replication at some point in the future if we need more metadata space!

Our data to go not the filesystem is likely to be mostly big files, as its either VM images for OpenStack, or ephemeral nova disks, or genome data (200Mb-3Gb) files, I decided to go with quite a large block size (8Mb), this is of course a multiple of the RAID strip size (256kb), and 1/32th (sub-block size) is 256kb, so it should align nicely with the underlying RAID controller strip.

Which brings me on to inode size. I was originally going with the default, by Jim at IBM suggested I think about 4k inodes. GPFS is nice in that small files can actually be contained in the inode as part of the metadata, so this actually seems like a nice compromise for the 8Mb data block size. i.e. for the few config files we might have, these can sit inside the inode in metadata, but the majority large files will fit nicely in the 8Mb data blocks.

Bearing in mind the big data block size, we need to tweak a couple of config options, the first to increase the max block size, and the second is to increase the page pool. The default is 64M for 256k blocks, 8192/256 = 32, so we want a page pool sized 32 * 64 = 2048Mb:
mmchconfig maxblocksize=8192K
mmchconfig pagepool=2048M

I'm using the stanza format for NSD definition in a file, we'll define both the NSDs and the pools in the file (though for NSD creation some of the lines are ignored, but they are used when creating the file system. A sample of the stanza file is:
 %nsd: device=dm-2

 %nsd: device=dm-3

%nsd: device=dm-4



In this file you can see that the first two NSDs are for metadata only and are in different failure groups to allow replication of metadata. There's then a number of NSDs which are data only and are the NL-SAS LUNs from the v3700 array. Finally the pools are defined, this isn't used by mmcrnsd, but is for mmcrfs later. One thing to note is the device name dm-X, this is the multi path device name. Look very carefully at these, and it refers to the device name as on the first listed server (as these may vary across server) - GPFS writes onto the header so the other server can find it.

And also actually create the NSDs as well:
mmcrnsd -F climb.nsd.gpfs -v yes

Update (02-09-2014) - I've since changed the NSDs so that the server order isn't always server1,server2, its actually now balanced across the two servers and across the canisters such that each server has an equal number of primary NSDs and that they are evenly distributed over the LUNs by their preferred canister owner.

As we have a two node cluster, we need to have tiebreaker disks enabled, so we're just going to use the two NSDs we are planning to use for meta data:
mmchconfig tiebreakerDisks="climb_v3700_clds01_md_lun01;climb_v3700_clds01_md_lun02"

And actually create the file system!

Now we need to actually create the filesystem, to do this, we need gpfs running on both the GPFS server systems. We're using the same nsd config file defined above used when creating the NSDs, this assigns the NSDs into pools and sets the pool and underlying storage block size.
mmcrfs climbgpfs -F climb.nsd.gpfs --filesetdf --perfileset-quota -Q yes -A yes -z yes -D nfs4 -i 4K -B 8M -m 2 -k all -n 26 -r 1 -T /climb --metadata-block-size 256K
mmchfs climbgpfs -z no

Just to clarify those options:
--filesetdf - df on a file-set will return the quote of the file-set not of the whole filesystem
--perfileset-quota - enable quotas on file-sets
-Q yes - activate quotas on file-system mount
-A yes - automatically mount the file-system on GPFS startup
-z yes - enable DMAPI
-D nfs4 - deny-write locks for NFS4 clients, not sure if we will use NFSv4, but needed if we will
-i 4K - 4k inode size
-B 8M - GPFS block size
-m 2 - 2 metadata replicas by default
-k all - allow (NFSv4 and POSIX) ACLs
-n 26 - 26 nodes to be in the cluster (we don't expect it to go significantly higher)
-r 1 - 1 replica of data by default
-T /climb - mount point
--metadata-block-size 256K - block size for metadata blocks.

Note that if want different GPFS data block sizes to metadata blocks, then you need to have different pools for storage and metadata.

And now lets mount the file system... note that initially the file-ssytem failed to mount, this was because I had "-z yes" enabled at file-system creation time, which enables DMAPI. My understanding from talking to people previously is that DMAPI needs to be enabled at creation time if you plan to use, I'm not sure for this project, so enabled it anyway, but because there is not HSM component installed, the file-system can't mount, hence the mmchfs -z no command above.
mmmount all

At this point I was momentarily stumped by not being able to create new files on the file system. Well, actually I could create some files, but not ones that were big (nor vi swap files actually!). Of course this was a file placement problem. As my system pool only contains NSDs which are marked for metadataOnly, this means there's no space available for actual data files, and we can create small files as they fit in the inode itself hence they fit can go into the system pool. The solution is to create a GPFS file placement policy, mine is simple, just a single rule in the file at present to put all files into the nlsas pool. Rules are executed in sequential order in the file, so if we had other pools, we could have a placement rule for specific file-sets or file extensions.
/* files are placed on nlsas */
RULE 'default' SET POOL 'nlsas'

And activate the policy file:
mmchpolicy climbgpfs climb.policy.gpfs

Finally, we want a couple of file sets to contain out data, we may want to quota eventually, but as we enabled quotas at file-system creation time, we can do this without shutting the cluster down or unmounting the file-system:
mmcrfileset climbgpfs openstack-swift -t "data for swift"
mmlinkfileset climbgpfs openstack-swift -J /climb/openstack-swift

mmcrfileset climbgpfs openstack-data -t "data for glance/cinder/nova"
mmlinkfileset climbgpfs openstack-data -J /climb/openstack-data

mmcrfileset climbgpfs climb-data -t "general CLIMB data"
mmlinkfileset climbgpfs climb-data -J /climb/climb-data

Now I've mentioned that we will be running the GPFS driver for OpenStack on top of this file-system, so we have two file sets, one which will be used for Swift data and a second for glance/cinder/nova ephemeral disks. The theory behind this is that the GPFS driver can use snapshot clones when provisioning glance images, so by placing glance and cinder on the same fileset, a snapshot provision of a glance image onto cinder block storage should happen almost instantly regardless of the size of the image. Placing the Nova ephemeral disks onto GPFS also allow live migration of VMs as the ephemeral disks are on shared storage. Swift is on a separate file-set to allow ease of management, backup etc.

The file-set config is based on a suggestion from a contact inside IBM, and it makes sense, so in the absence of other guidelines, I'm happy to run with it.

And a little performance testing...

Just to test out the file-system and v3700, I built the gpfsperf tool and did a couple of tests from an NSD client node in the GPFS cluster. I'm fairly sure we can run the storage array flat out over the Infiniband network, the follow is creating a 100GB file with random write pattern and is running over the GPFS:
./gpfsperf create rand /climb/climb-data/perf
  recSize 1045773310 nBytes 1045773310 fileSize 1045773310
  nProcesses 1 nThreadsPerProcess 1
  file cache flushed before test
  not using data shipping
  not using direct I/O
  offsets accessed will cycle through the same file segment
  not using shared memory buffer
  not releasing byte-range token after open
  no fsync at end of test
    Data rate was 887524.23 Kbytes/sec, thread utilization 0.999

So about 800Mb/sec, which I think is about the maximum we can expect from a 6Gbit SAS controller ...

I also ran the test having shutdown openibd, so we only have the 1GbE link available:
./gpfsperf create rand /climb/climb-data/perf
  recSize 1045773310 nBytes 1045773310 fileSize 1045773310
  nProcesses 1 nThreadsPerProcess 1
  file cache flushed before test
  not using data shipping
  not using direct I/O
  offsets accessed will cycle through the same file segment
  not using shared memory buffer
  not releasing byte-range token after open
  no fsync at end of test
    Data rate was 190084.41 Kbytes/sec, thread utilization 0.980

I'm not sure that's as fast as we could get over 1GbE, but shows that the IB link was working nicely. I haven't yet managed to test with just the 40GbE link up, that would mean walking down and unplugging the FDR cable as stopping openibd unloads the mlx4_core driver and so the 40GbE link also drops.

Suffice to say, I think we should get some nice performance out of the array, whether or not I've picked the right magic numbers for block size etc for the use-case remains to be seen, but the whole project is a bit of an experiment. We might have to rebuild the GPFS file-system later, but if we do, its not the end of the world!

Finally, I'm not made reference to file-system descriptor quorum. This is also important, GPFS will normally write 3 primary copies across different failure group NSDs. If these are lost, then the file-system become unusable. As we only have one storage array behind the GPFS, I'm not too worried as losing the storage array means we'll lose data anyway. If there were more storage arrays, then I might worry a bit more about this, as well as different pools for different file-sets across the arrays.

I'll post more on actually using GPFS with OpenStack when I get a chance to configure and test it!

UPDATE (Jan 2015):

I'm thinking about reducing the block size to 2MB for the GPFS file-system used for OpenStack images. This is because the VM images are likely to be doing small Linux inode updates and an 8MB block size will mean that if the update spans a block, then GPFS will have to do a full 8MB zero for the block if its not been used which could be a significant overhead for what should be a small write.


You may also be interested in my post on using and testing the GPFS Cinder driver!