Saturday, 24 October 2015

Meet the Devs comes to Scotland

(This is reproduced from my blog post on the Spectrum Scale user group website).

Well, this is the first time I've made it to a meet the devs meeting. And I chose probably the furthest location out from me, Edinburgh. But its the last of the Meet the Devs tour for 2015* and we like to have a User Group exec member present, so an early start and short flight later I was in Edinburgh. Hopefully we'll be back in 2016 with some more meet the dev dates, we've managed 4 this year, which is about what we set out to do, once a quarter. If you are interested in meet the devs coming to your area, please get in touch with the group.

The event was well attended, and I'd like to thank Orlando for hosting us at University of Edinburgh.

As usual we went in with a general agenda, some topics to discuss but we let the informal sessions go where they want. There was a mixture of re-attendees and people who hadn't been before, so its good to see people going back as it shows they must be getting value out of them. This time we had Rick Welp and Joseph Taylor from IBM, thanks to both for coming along and leading the day.

I know Rick or Joseph were planning on a blog post as well, so I'll leave them to cover what they discussed. We did however get into a few side discussions, and this is one of the great things about the user group and meet the dev as its informal and allows us to go where the discussion wants to. And the pizza company letting us down didn't break the day as we just carried on discussing in some smaller groups whilst we waited.

We got into some in depth discussion of protocol support, and as I'm running this in production at Birmingham, I was able to provide quite a bit of feedback to the others there on my experience with running it as well as how it works under the hood.

We also got into discussion on backup and ILM, and I think its amazing how everyone does these things in their own slightly different way. I think this might be an interesting area for discussion over on the group mailing list. There's a lot of options and different ways to do things!

At the end of the day, I had a long list of questions people had raised and as usual we'll take these back with IBM and hopefully get some answers for the people who made it there.

Thanks again to Orlando, Rick and Joseph and hopefully meet the dev will return in 2016. (We'll even try and get the pizza company to deliver on time!)

Simon (Group Chair)

*OK, so we also have the user group meeting at Computing Insight UK in December, but you have to be attending the conference and its not quite a meet the dev session.

Friday, 11 September 2015

Characters on the keyboard

A slight diversion for this post, we run a couple of internal training courses for our users, one of these is introduction to remote command line Linux and I delivered this course this week to a group of postgrad students and staff. This was to a group of non-traditional HPC users, but still this is something that surprised me.

We found that people don't seem to know what the names of those "funny characters" on the keyboard are. With a technical audience, I'm sure many readers are familiar with using & to background processes.

Now maybe its because I work in the tech field and these are characters we use all the time, but I'm surprised that people don't know what they are called! And it was more than one person in the group, we had a couple of occasions of saying to people "use ampersand", "what's that", "well, shift+7", "oh you mean the and".

All a bit bizarre, so for the next time I run the course, I think I'm going to add a slide:

  • & - ampersand - "and"
  • ~ - tilde - "squiggly line"
  • ^ - caret - "like a little hat"
  • | - pipe - moves around on your keyboard

But it was good feedback to get "thanks, it was really useful, and I liked learning the names of those characters"...

Tuesday, 21 July 2015

SMB protocol support with Spectrum Scale (aka GPFS 4.1.1)

In a break from posting on OpenStack and GPFS, I've been working on one of my other GPFS related projects.

First lets get the naming out of the way, I've been using GPFS for a few years, and it will forever stay GPFS, but with version 4.1.1 which was released in June 2015, the product was renamed. Spectrum Scale. Now we've got that out of the way, I can get on with posting about protocol support!

4.1.1 was the first release to come with official protocol support for SMB and Object (it also includes the new Ganesha NFS server as well). One of my projects has been to build a research data store, naturally we looked at GPFS for this - its scale out storage after all, and has nice features like the policy engine, tiering and tape support meaning we can automatically move files which have aged to cheaper storage tiers and down to tape, yet have them come back online automatically if a user requests them.

Before I go on to talk about IBM Spectrum Scale protocol support, a bit of history first!

Plan 1. Use SerNet samba precompiled packages.

Initially the client presentation layer was built on SerNet samba, as this was the only pre-compiled SMB package set to include the GPFS tools in the build (AFAIK RedHat Enterprise packages aren't built with the GPFS VFS module).

The plan was to use the pre-compiled binaries with CTDB to do IP address fail-over, and this all worked when we were building on CentOS 6.3.

However as we moved on in time, we looked at upgrading to CentOS 7.0 and this is where our CTDB woes started. 7.0 releases come with CTDB 2.x and the SerNet binaries had dependencies on 1.x. OK, SerNet also provide some CTDB packages (if you dig around on their site enough) based on 1.x, however these didn't support systemd and seemed rather unstable for us.

At this point we looked at two options, recompile the src rpms, or roll back to CentOS 6.x. The second it became clear wasn't really an option as moving to the latest 6.x release also included CTDB 2.x based packages, so essentially the same problem. Which brought us to...

Plan 2. Re-compile SerNet samba packages for CTDB 2.x

This was actually quite easy, I used to do a lot of rpm building in a previous role, so I know my way round a spec file and it was pretty simple to tweak the spec to allow CTDB 2 packages, strip out the bits that conflicted with the OS CTDB packages and go with that.
An hour or so later, I had some working packages that we deployed and tested, which all seemed to work fine. A little cautiously we decided to proceed on this basis, though uneasy about the prospect of regularly having to fix the spec file and rebuild, we felt we didn't have much choice.

Along comes Spectrum Scale 4.1.1!

In May I was talking at the GPFS User Group in York, and Scott Fadden from IBM was talking the GPFS roadmap (see slides) and he mentioned the release date of the long promised protocol support. This got me thinking and I decided to push off the pilot phase of our data store for a few weeks to try out the upcoming 4.1.1 release including protocol support.

We were moving from 4.1.0, and the 4.1.1 upgrade also wasn't the smoothest GPFS upgrade I've ever done. I managed to deadlock the file-system. I'm putting that down to me doing something silly with quorum or quorum nodes at the time, but I'd strongly suggest you test this process carefully before doing it on a live GPFS system as a non-destructive upgrade. As an aside, I've had 4.1.1 deadlock since when we were doing some DR/HA testing of our solution, but its circumstances I wouldn't expect to see if normal operation and was following a number of quite convoluted DR tests. (We've tested many failure modes like split brain the cluster, cutting the fibres between data centres, pulling parts of the storage systems).

But overall it was fine. As we were not quite piloting the system, I was OK to shutdown all the nodes and it restarted fine.

Getting SMB protocol support working

The next step was to actually install IBM SMB protocol support. The expectation currently is that this is down using the new installer. We use a separate config management tool and being able to reinstall nodes in the system and get them working is essential to us, so I unpicked the chef recipes (as that is how the installer is implemented) to work out that really, we just need to add the gpfs.smb package which is provided in the protocol release of Spectrum Scale.

I posted a few messages to the GPFS User Group list about getting things working and got some guidance back from some IBMers (thanks!).

SMB support is provided as part of Cluster Export Services, the cluster needs to be running on EL7, has to be running CCR and with the LATEST file-system features.

CCR worried me at first as in the previous release you couldn't use mmsdrrestore to add a node that had been reinstalled into a CCR based cluster, however Bob on the GPFS UG mailing list pointed out that this was fixed in 4.1.1 - thanks Bob!

The rest of the requirements were just a few GPFS commands, these were run on my NSD server cluster:
mmchconfig release=LATEST
mmcrfileset gpfs ces -t "ces shared root for GPFS protocols"
mmlinkfileset gpfs ces -J /gpfs/.ces-root
mmchfs gpfs -k nfs4

CES needs a space to store its config. The documentation suggests using a separate file-system, but it works fine with a file-set. We might revisit this at some point in the future and create a small file-system with local replicas on our protocol cluster.

Then a few more config commands on the protocol cluster to setup CES:
mmchconfig cesSharedRoot=/gpfs/.ces-root
mmchcluster --ccr-enable
mmchnode -N <NODECLASS> --ces-enable

CES will handle IP address allocation for the protocol cluster. We have 4 protocol servers and 4 floating IP addresses with a DNS round robin name pointing to the 4 servers to provide some level of client load balancing. Adding IP addresses is a simple process:
mmces address add --ces-ip
mmces address add --ces-ip

and mmces address list will show how the addresses are currently distributed.

Once CES is enabled, it is then necessary to enable the SMB service on the protocol nodes. Again just a single GPFS command is needed - mmces service enable SMB

Once enabled authentication needs configuring for the SMB services and Spectrum Scale provides a number of options for this (pure AD with SFU or RFC2037, LDAP + KRB), but neither of these fits our requirement. - Like many research institutions, we use AD for authentication, but local LDAP settings for identity and this isn't available as one of the pre-defined authentication schemes, however user defined authentication is possible.

One thing to note here, if you are using the "pure" approach, the starting and stopping GPFS will change the contents of nsswitch.conf and also krb5.conf. There's also currently an issue where nsswitch.conf gets edited in user defined authentication on shutdown (IBM have a ticket open on this and as its a ksh script that does it, I fixed this locally for now).

To use user defined authentication, krb5.conf and nsswitch.conf need to be configured appropriately for your AD and chosen identity source (I use nslcd, but sssd would also work).

We now need to configure CES to use user defined for file:
mmuserauth service create --type userdefined --data-access-method file

At this point there is an SMB cluster, but its not joined to the domain and needs a little tweaking to get it working. The mmsmb command is provided to manipulate the samba registry, however it restricts which properties you can set, however the net command is shipped and so its possible to use this to change the samba registry. Some of the following might not be needed as I played with various config options and mmuserauth settings before getting to a stable and working state.
net conf delparm global "idmap config * : backend"
net conf delparm global "idmap config * : range"
net conf delparm global "idmap config * : rangesize"
net conf delparm global "idmap config * : read only"
net conf delparm global "idmap:cache"

net conf setparm global "netbios name" my-netbios-server-name
net conf setparm global "realm" DOMAINSHORTNAME.FULL.DOMAIN
net conf setparm global "workgroup" DOMAINSHORTNAME
net conf setparm global "security" ADS

net ads join -U myadminaccount

One final thing to note is that I also required winbind running on the protocol servers for authentication to work. This is provided by the gpfs-winbind service which is started along with gpfs-smb when CES starts up on a node, however its only started if you are using a pre-defined authentication type and its not possible to enable it from systemd as it requires CES to be running first and the file-system to be mounted. It would be nice to have a flag in the CES config to enable the service for user defined mode, but there is a workaround, and that is to set gpfs-winbind as a dependency in systemd for gpfs-smb:

The only other command which is worth mentioning is creating a share, this is simply down with:
mmsmb export add shareName "/gpfs/path/to/directory"

Offline (HSM) files just work!

We have a mix of Windows, Linux and OS X clients to contend with, but providing SMB access is a reasonable compromise for all our users as everyone has a centrally provided AD account. Historically our experience of other tiered file-systems is that the mostly work with Windows clients when the archive bit is set, but OS X finder users tended to always call file recall when accessing a folder for preview, however with SMB3, there is also offline as a file flag which seems to be respected on my Mac and files were't recalled until accessed (and in Windows they even show up with a little X icon to show they are offline).

I'm very impressed that this has all been thought through and the GPFS VFS module for samba does all these things!

One area I still need to look at is VSS and previous versions - in theory GPFS snapshots should appear as previous versions of files, but I just haven't had time to verify this yet.

CES compared to Samba/CTDB

With CES, IP address failover is handled by CES rather than the CTDB process as in normal cluster samba. There are various policies for CES to move IP addresses round, for example even-coverage (default), or based on load on the protocol server. One slight downside is that CES node failure is handled in the same way as GPFS expel, and so this can take a bit longer for an IP address to be moved over in the event the protocol server fails. Now in normal operation, you can move the IP address off a node and disable CES if you planned to reboot the node, so its only really in an HA failure handover where its slower to complete.

One thing that did stump me for a while was accessing shares from my Mac client, it repeatedly failed to connect, however I eventually worked out that I was still using the legacy cifs:// paths, whereas as it only supports SMB2 and SMB3, then in fact you need to use smb://.

Overall impressions

Overall, I'm very happy with the SMB protocol support, with a few niggles and tweaks to documentation, I think it will be an excellent addition to the GPFS product. I've had a few issues with it, for example if the multi-cluster file-system goes inquorate then CES fails which I'd expect, however it doesn't restart when the file-system remounts, and I think a predefined callback would be good to resolve this.

And I'm pretty happy with the support I've had from IBM, I've had a couple of con-calls with various people in the UK and USA on my experience, and provided them with my feedback directly on a number of things I think need tweaking in the docs to get things moved on. So thanks IBM GPFS team for listening and taking an interest!

I'm very interested in picking up the Object support on our protocol nodes, now I just need to find some time in my schedule! I'm hoping that will be pretty easy to do, and I might even try out the installer to get the first node into the system to see what packages need adding.

Monday, 22 June 2015

STFC SCD Seminar

A few weeks ago I was invited to give a seminar at STFC's Daresbury Lab for the Scientific Computing Division, this was mostly on the CLIMB project with a focus on how we're using GPFS.

Wednesday, 17 June 2015

Tweaking v3700 memory

I was doing some work on one of our v3700 arrays today creating a bunch of new RAID sets and got back the message:

"The command cannot be initiated because there is insufficient free memory that is available to the I/O group"

Which confused me as at this point I wasn't assigning the raid sets to Pools or Volumes, just purely trying to create some new raid sets.

Digging around, the IBM docs don't give a lot of clues on how to fix this other than to "increase the amount of memory that is allocated to the I/O group".

Looking at the config on the array (and you'll need to delve in by ssh to do this), there are 5 pre-defined I/O groups:

id name            node_count vdisk_count host_count 
0  io_grp0         2          21          2          
1  io_grp1         0          0           0          
2  io_grp2         0          0           0          
3  io_grp3         0          0           0          

4  recovery_io_grp 0          0           0  

And by default we see that they are using 40MB memory for RAID services:
>lsiogrp -delim : 0

I had to increate the raid_total_memory to 80MB before I could create the new RAID sets (something  smaller would have probably done but I was in a hurry!). You do this with:
>chiogrp -feature raid -size 80 io_grp0

This got me thinking, this memory is carved out of the cache available on the system, and as I'm not using flash copy, remote copy and mirroring, or I/O groups 1/2/3, can I reclaim this memory? Well the answer appears to be yes:
>chiogrp -feature remote -size 0 io_grp0
>chiogrp -feature flash -size 0 io_grp0
>chiogrp -feature mirror -size 0 io_grp0

(and repeat for the other unused I/O groups)

Dealing with faults with storage ... migrating data without downtime!

One of my V3700 storage arrays has been having issues recently and now it looks like one of the canisters in the controller needs to be replaced. This process looks like it might be disruptive to the service running on top of it as the fault might be a software issue that requires us to reboot both canisters in the controller to resolve it.

This is the storage array running our GPFS for our OpenStack cloud.

But of course this is GPFS, and I have a spare storage array waiting to join the CLIMB storage here, so I'm planning to move all the data over to the new storage array before doing the maintenance on the controller.

Why do I have a spare controller you ask - well, it was bought to add to the file system, but we wanted to do some testing with block sizes on these controllers before doing that, actually we'll probably end up rebuilding the GPFS file system at some point to reduce the block size to 1MB. For various time reasons I haven't done this, so I have a fully decked v3700 with no data on it.

Now when I originally set up the CLIMB file system here, I set metadata to be replicated across two RAID 10 LUNs on the controller.

On the new controller, I've instead setup a number of RAID 1 sets. Eventually this will happen on the original controller instead of the RAID 10s.

Now for the magic of software defined storage.... I've added the LUNs as new NSD disks in the same failure group as one of the RAID 10s holding metadata.

I then simply issue a "mmdeldisk climbgpfs diskname", and hey presto, GPFS replicates all the metadata from the LUN on the one v3700 to the new LUNs on the new v3700.

Once that is complete I plan to use mmrepldisk to replace the disks on the faulty v3700, and GPFS will magically move all the data to the replacement v3700.

All with no disruption to service. Nice!

Wednesday, 20 May 2015

Saturday, 9 May 2015

GPFS User Group

The Spring 2015 GPFS User Group will be taking place in York on 20th May, I hear there are only a few places left now, if you are a UK based GPFS user (or want to travel!), then this is a great event to find out what is happening with GPFS development, talk with the developers and other customers.

Oh, and incidentally I'm on the agenda for the May meeting. Customer experiences of using GPFS.

Shh! Its working!

Well, after a couple of very fragile weeks, it looks like out OpenStack environment is working and seems stable at the moment!

We've been running some test VMs on it whilst we work through the initial teething issues including some users from Warwick building an environment on there - they were using our IceHouse config, until it ate itself. - RabbitMQ got upset and that was that - as long as we didn;t want to change anything, the VMs carried on running, and we managed to pull all the images into the Juno install. I'm not sure there is a proper process for that, but what we did worked.

But this weekend CLIMB are running a Hackthon with users who aren't from our inner circle of alpha testers. I thought I'd have a quick check to see how they were getting on and there appear to VMs running for the weekend, and no emails in my mailbox (or WhatsApp messages either) complaining that stuff wasn't working!

I won't say it hasn't been a lot of work to get to this point, but I'm pretty happy to see VMs for others running on there!

Thursday, 7 May 2015

Directing the OpenStack scheduler ... aka helping it place our flavours!

We have two distinct types of hardware for our OpenStack environment:

  • large memory (3TB RAM, up to 240vCPUS)
  • standard memory (512GB RAM, 64vCPUS)
What we really want to do is reserve the large memory machines for special flavours (or flavor!) which we only allow special users to be able to use. From an HPC background, this is something that would be trivial to do, and actually we can get OpenStack to do this using host aggregates and filters, but I think the docs are lacking and some of the examples don't necessarily work as listed. - Particularly if you have multiple filters enabled!

Just to note, that out of the box, the scheduler actually prefers the fat nodes as there is a weighting algorithm to prefer hypervisors with higher memory to cpu ratios, which may be fine, but actually we don't want to "waste" the fat nodes as we'd prefer them to be available to users with big needs and not have to worry about migrating small VMs to make space.

First up, we're going to use host aggregates. These are arbitrary groups of hosts, they can be used be used with availability zones, but don't have to be - when you create host aggregate, if you do it without setting an availability zone, then its not something that is visible to an end user.

This ticks our first requirement - I don't want a user to have to remember which availability zone to select in the drop down in Horizon based on the flavour they are using.

So to implement what we want to do we are also going to use the "AggregateInstanceExtraSpecsFilter" filter. First we need to enable this on our controller nodes, we need to do this on any node which is running the openstack-nova-scheduler service. Edit /etc/nova/nova.conf and change:


Your current list of filters may be different from this, but I added it after the RetryFilter which was the first in the default list. Restart the service on each node to ensure that new filter is loaded.

Note also that we have the ComputeCapabilitesFilter enabled, and I think this is why the examples of usage online don't work out of the box - in a minute I'll be adding the requirement to the flavour for a metadata match, we need to use a namespace on that match otherwise the nodes pass the AggregateInstance filter, but then fail as they don't match the ComputeCapabilitesFilter and you end up failing to schedule with "no host found" errors.

I'm going to show what to do using the command line tools, you can also do this from Horizon using the metadata fields and Host Aggregates view.

Create a host aggregate

(keystone_admin)]# nova aggregate-create stdnodes
| Id | Name     | Availability Zone | Hosts | Metadata |
| 11 | stdnodes | -                 |       |          |
Add a node (a hypervisor can belong to multiple aggregates)
(keystone_admin)]# nova aggregate-add-host stdnodes cl0903u01.climb.cluster
Host cl0903u01.climb.cluster has been successfully added for aggregate 11 
| Id | Name     | Availability Zone | Hosts                     | Metadata |
| 11 | stdnodes | -                 | 'cl0903u01.climb.cluster' |          |
Now add some metadata to the aggregeate:
(keystone_admin)]# nova aggregate-set-metadata 11 stdmem=true
Metadata has been successfully updated for aggregate 11.
| Id | Name     | Availability Zone | Hosts                     | Metadata      |
| 11 | stdnodes | -                 | 'cl0903u01.climb.cluster' | 'stdmem=true' |
Now I'm going to specify that (me existing) flavour needs to run on hypervisors with the aggregate property fatmem=true, first check the flavour:
(keystone_admin)]# nova flavor-show m1.tiny
| Property                   | Value   |
| OS-FLV-DISABLED:disabled   | False   |
| OS-FLV-EXT-DATA:ephemeral  | 0       |
| disk                       | 1       |
| extra_specs                | {}      |
| id                         | 1       |
| name                       | m1.tiny |
| os-flavor-access:is_public | True    |
| ram                        | 512     |
| rxtx_factor                | 1.0     |
| swap                       |         |
| vcpus                      | 1       |
Now we want to add that m1.tiny should have stdmem=true applied to it:
(keystone_admin)]# nova flavor-key m1.tiny set aggregate_instance_extra_specs:stdmem=true

(keystone_admin)]# nova flavor-show m1.tiny
| Property                   | Value                                             |
| OS-FLV-DISABLED:disabled   | False                                             |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                 |
| disk                       | 1                                                 |
| extra_specs                | {"aggregate_instance_extra_specs:stdmem": "true"} |
| id                         | 1                                                 |
| name                       | m1.tiny                                           |
| os-flavor-access:is_public | True                                              |
| ram                        | 512                                               |
| rxtx_factor                | 1.0                                               |
| swap                       |                                                   |
| vcpus                      | 1                                                 |
Note that in the examples in the OpenStack docs, they don't include the namespace "aggregate_instance_extra_specs:" in front of the key name, as I mentioned above, when there are multiple filters, this may cause it to fail as although a node passes the aggregate filter, it fails the compute capabilities filter.

So to summerise, to allow us to ensure small VMs don't land, for each "small" flavour, we specify that the stdmem=true property is required, this causes the filter to exclude the fat nodes when considering them for scheduling.

Bumpy ride! ... troubleshooting Neutron

The last few weeks have been a bit of a bumpy ride with OpenStack! We've had it working, suddenly stop working, failures on some hypervisors. Its er, been interesting!

We've been replacing out IceHouse install here with Juno, this was a full rip and replace actually as we re-architechted the controller solution slightly so we now have three control/network nodes. Running both in parallel for a while with different interfaces.

So a couple of the issues we've faced below, which may server to help someone else!


And we had the Juno config working, with VXLAN instead of VLAN, but working, only it only had 1 hypervisor attached to it. When we went to provision more hypervisors into the config, we were getting the evil "vif_type=binding_failed" out of Nova when instances were spawned on the new hypervisor.

Now if you Google for this, you'll get a lot of people complaining and a lot of people saying "works for me" type answers. So just to be clear, this error message comes out anytime you get any sort of error from the Neutron networking bit. It may just be a bad config for Neutron, so first off, go and check that your Neutron config looks sane!

Ours did, but still there are some issues that you can get depending on how you configured it, so we've had varying success with the following resolutions:

1. Go nuclear on Neutron

I've seen the need to do this when the Neutron database sync and services have been started before Neutron has been configured. So if you think you might have done this, this might help. Note this will WIPE any networks you may have configured and will confuse the heck out of OpenStack if you have running VMs when you do this!

  • stop all neutron services on controller/network nodes, as well as neutron agents running on hypervisor nodes
  • drop the neutron database from mysql
  • re-create the neutron database and assign permissions for the neutron db user
  • re-sync the Neutron database by running

  neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade juno

  • At this point restart Neutron services. It may work. But it may also be confused, so you may want to try this step again, but also following the steps below to remove all OVS bridges below

2. Remove all the OVS bridge interfaces and try again

I've needed to do this on both hypervisors and network nodes to actually get things working. Whilst it may appear that the VXLAN tunnels have come up, they may not actually be working correctly. To fix this:
  • stop neutron services (or agent if on a hypervisor)
  • delete the OVS bridges
  • re-create the OVS bridges
  • reboot the node
Note that on hypervisors, I found that if we didn't reboot, it still didn't always work correctly. So do try that although it is extreme!

Typically this process would look something like (on a network node):
   systemctl restart neutron-dhcp-agent.service neutron-l3-agent.service neutron-metadata-agent.service neutron-openvswitch-agent.service neutron-server.service
  ovs-vsctl del-br br-int
  ovs-vsctl del-br br-tun
  ovs-vsctl add-br br-int
  ovs-vsctl add-br br-tun
  systemctl restart neutron-dhcp-agent.service neutron-l3-agent.service neutron-metadata-agent.service neutron-openvswitch-agent.service neutron-server.service

3. /etc/hostname

So this one surprised me and took a while to dig out. We're running CentOS 7 on our hypervisors, and one of the install scripts we have puts the short hostname into /etc/hostname. Somehow this breaks things and caused the vif_binding failed error. Seriously I don't understand this as the resolv.conf file contains the DNS search suffix for the private network.

But still, this caused us issues. Check that you have the fully qualified name in the hostname file and try again!

This is why it was working fine with one hypervisor, but adding additional ones wouldn't work. Interestingly, if the scheduler tried to schedule on a "broken" hypervisor, it failed (due to the hostname), and then subsequently tried on a known good hypervisor, then it would also fail on that hypervisor with the same error. I don't get it either!

Slow external network access

We have 10GbE bonded links for our hypervisors, but were getting really slow off-site network access. I eventually tracked this down to likely being caused by MTU issues. The default MTU for CentOS et al is 1500, and our VMs also had an MTU of 1500. Except of course we are using VXLAN, so there's an extra 50 bytes or so tagged on for the VXLAN frame header. As far as I can see, what was happening is that the VMs were spitting out 1500 frames, then the hypervisor was encapsulating the VXLAN traffic adding 50 bytes, and so the emitted network frame would be 1550, which was higher than the interface MTU.

Of course with 10GbE interterfaces we probably want larger MTUs anyway, so I went ahead and changed the MTU to 9000 and restarted networking.

This caused chaos and broke things. It took a while to track this down! Chaos meaning things like the mariadb cluster no longer working. Bringing a node back into the cluster would start to work and then fail when doing the state transfer. This took a lot of head scratching and pondering, but we eventually worked out that the MTU on the switch was set to something like 2500, which seems like a strange default for a 10GbE switch to me! Anyway, increasing it to over 9000 (caveat about different manufacturers using different calculations here!) made the problem go away.

Can we get full line rate out of the VXLAN interfaces? Well, actually I don't know, but I did some very unscientific speed testing from both a network node and from inside a VM (before changing the MTU this was something like 6.7Kb/sec):

[centos@vm-test ~]$ speedtest-cli
Retrieving configuration...
Retrieving server list...
Testing from University of Birmingham (
Selecting best server based on latency...
Hosted by Warwicknet Ltd. (Coventry) [28.29 km]: 15.789 ms
Testing download speed........................................
Download: 267.39 Mbit/s
Testing upload speed..................................................
Upload: 154.21 Mbit/s

[centos@controller-1]# ./speedtest-cli
Retrieving configuration...
Retrieving server list...
Testing from University of Birmingham (
Selecting best server based on latency...
Hosted by RapidSwitch (Leicester) [55.83 km]: 11.392 ms
Testing download speed........................................
Download: 251.66 Mbit/s
Testing upload speed..................................................
Upload: 156.93 Mbit/s

So I don't know if that is the peak we can get from inside a VM as the controller also seems to be peaking around the same value. Its something I'll probably come back to another day.

No external DNS resolution from inside VMs

By default, the DHCP agent used on the network nodes will enable dnsmasq to act as a DNS server. This means that tenant-network names will resolve locally on the VMs, however we found that our VMs couldn't resolve external names.
This took a little bit of digging to resolve, but at least I have a plausible sounding cause for this.
The network nodes are attached to a private network and they use a DNS server on that private network for names rather than a publicly routable DNS server.
The dnsmasq instance is running inside a network namespace and that network namespace has its own routing rules, i.e. VMs can't route to our private management network. This also means that the dnsmasq instance can't use the DNS servers in /etc/resolv.conf as they have no route to actually do the name resolution.
Luckily, the dhcp agent service can be told to use a different set of DNS servers, so on our network nodes, we now have in /etc/neutron/dhcp_agent.ini:

  dnsmasq_dns_servers =,

Then restart the neutron-dhcp-agent service. These IPs are of course those of the Google public DNS, but you could use others.

Wednesday, 1 April 2015

IBM Spectrum Scale ...

Just a few comments on Spectrum Scale or Elastic Storage as it was called when I recorded this!

Friday, 13 March 2015

Making OpenStack services wait for GPFS with systemd

A little while ago I posted about how OpenStack Cinder and Glance services fail to start properly when using GPFS for the underlying storage layer.

I looked and asked around if anyone had any suggestions on resolving this as GPFS uses a SysV init script. Someone on the GPFS user group suggested that it should be possible to make a systemd service wait on a SvsV init style script.

I've eventually found time to look into this a bit more and found a solution.

Bear in mind that I'm running this system on CentOS 7, so it should work on RHELs, but might be a bit different for other distributions. Basically, we're going to add additional requirements for the systemd manifests to make this work, what we don't want to do is edit the default manifest files as these may get overwritten by upgrade. Luckily, we can add "local" changes as follows:

cd /etc/systemd/system/
mkdir openstack-glance-api.service.d
cd openstack-glance-api.service.d

Now create a file named gpfs.conf with the following contents:


Basically what we are saying here is that in addition to the default settings for Glance API, we also require GPFS to be started. Note that using "Requires" on its own doesn't guarantee that GPFS is started, which is why we also have the "After" setting. We're also using Requires rather than Wants, this means if GPFS is stopped, then the Glance API service will also be stopped.

This example obviously only covers the Glance api, you probably also want to do it for:
  • openstack-cinder-api.service.d
  • openstack-cinder-volume.service.d
  • openstack-glance-api.service.d
  • openstack-glance-registry.service.d
  • openstack-glance-scrubber.service.d
  • openstack-swift-object.service.d
  • (possibly some more Swift services as well)
Once you've created these, you need to also call:
systemctl daemon-reload

Incidentally, Incidentally, we also use a custom GPFS init script as well, specifically in the start section we use:
      for i in `seq 1 24` # Wait upto 2 mins for the IB to be ready (24x5s=120s)
                if grep -q gpfs /proc/mounts
                sleep 5


And status is rewritten as:
      lsmod | grep -c mmfs >/dev/null 2>&1
      if [ $? == 0 ]; then
        grep -q gpfs /proc/mounts
        exit $?
        exit $?


Red Hat OpenStack Technical Workshop slides

Yesterday I spoke at the Red Hat OpensStack Technical Workshop held in London, a couple of people asked about my slides, they are available below:

Tuesday, 10 March 2015

What difference does a name make?

As you may have recently heard, IBM have recently announced that GPFS has a new name. Its now Spectrum Storage.

Its been on the cards for a while that it was getting a new name - in fact it became Elastic Storage for a while, but I believe that was the code name for the rebranding project.

In any case, I'm sure most of us who've used it for a while will continue to call it GPFS!

It will be interesting to see if the Power ESS (Elastic Storage Server) will get renamed as well at some point!

Tuesday, 27 January 2015

Using the GPFS Cinder driver with OpenStack

I've blogged a couple of times about using GPFS with OpenStack, in this post I'm going to focus on using setting up the GPFS Cinder driver. This is tested using Juno with RDO.

First I'd like to send out thanks to Dean Hildebrand (IBM Cloud Storage Software team) who I met at SC14, who put me in touch with Bill Owen who works in the GPFS and OpenStack development team, Bill helped me work out what was going on and how to check it was working correctly.

I'll assume you have both Glance and Cinder installed. These should be putting their image stores onto a GPFS file-system and using the same fileset for both, for example I have a file-set "openstack-bham-data" where I have cinder and glance directories, the fileset it mounted at /climb/openstack-bham-swift.

The basic magic is that the cinder driver uses mmclone of the glance images to create copy-on-write versions, which can be done almost instantly and is very space efficient. It will also only work on raw images from the glance store.

# ls -l /climb/openstack-bham-data/
total 0
drwxr-xr-x 2 cinder cinder 4096 Jan 27 20:14 cinder
drwxr-xr-x 2 glance glance 4096 Jan 27 19:08 glance

In the /etc/cinder/cinder.conf file, we need a few config parameters setting:
gpfs_mount_point_base = /climb/openstack-bham-data/cinder
volume_driver =
gpfs_sparse_volumes = True
gpfs_images_dir = /climb/openstack-bham-data/glance
gpfs_images_share_mode = copy_on_write
gpfs_max_clone_depth = 8
gpfs_storage_pool = nlsas

(Docs on the parameters are online).
Of course my Glance instance is also configured (in /etc/glance/glance-api.conf) to use:
filesystem_store_datadir = /climb/openstack-bham-data/glance

A couple of things to note here, nlsas is one of my storage pools. You can use this parameter to determine which pool cinder volumes are placed in, we're also using copy_on_write which means we only copy blocks as they change giving better storage utilisation.

Now that we have cinder configured, restart the services:
# systemctl restart openstack-cinder-api.service
# systemctl restart openstack-cinder-scheduler.service
# systemctl restart openstack-cinder-volume.service

(at this point I should note that this is running on CentOS7, so its systemd based, the GPFS init script is a traditional SysV init script - it would be nice to be systemdified so that we could add dependencies on swift, glance and cinder on GPFS being active)

We'll now do a basic cinder test to ensure cinder is working:
# cinder create --display-name demo-volume1 1
|       Property      |                Value                 |
|     attachments     |                  []                  |
|  availability_zone  |                 nova                 |
|       bootable      |                false                 |
|      created_at     |      2015-01-27T20:32:25.244948      |
| display_description |                 None                 |
|     display_name    |             demo-volume1             |
|      encrypted      |                False                 |
|          id         | f7f5c7a1-bf56-41a1-b9f9-a7c74cac748d |
|       metadata      |                  {}                  |
|         size        |                  1                   |
|     snapshot_id     |                 None                 |
|     source_volid    |                 None                 |
|        status       |               creating               |
|     volume_type     |                 None                 |
# cinder list
|                  ID                  |   Status  | Display Name | Size | Volume Type | Bootable | Attached to |
| f7f5c7a1-bf56-41a1-b9f9-a7c74cac748d | available | demo-volume1 |  1   |     None    |  false   |             |
# ls -l /climb/openstack-bham-data/cinder
-rw-rw---- 1 root root 1073741824 Jan 27 20:32 volume-f7f5c7a1-bf56-41a1-b9f9-a7c74cac748d
# cinder delete f7f5c7a1-bf56-41a1-b9f9-a7c74cac748d

OK, so basic cinder is using, now lets try out using the GPFS driver. Remember, it will only work with raw images.

We need to define the GPFS driver type in cinder:
# cinder type-create gpfs
# cinder type-list
|                  ID                  | Name |
| a7db8364-9051-4e40-99a5-c43842443ef7 | gpfs |

If we don't have a raw image in glance, lets add one:
# glance image-create --name 'CentOS 7 x86_64' --disk-format raw --container-format bare --is-public true --copy-from

# glance image-list
| ID                                   | Name                | Disk Format | Container Format | Size       | Status |
| e3b37c2d-5ee1-4bac-a204-051edbc34c31 | CentOS 7 x86_64     | qcow2       | bare             | 8587706368 | active |
| bf756074-ab13-45bb-b899-c83586df4ea8 | CentOS 7 x86_64 raw | raw         | bare             | 8589934592 | active |

Note that I have two images here, one is qcow2, the other raw. Its important that the image is actually raw - I found that the CentOS image from named raw was actually qcow2 and things didn't work properly for me, so I had to convert the image file before the mmclone would work.

So to reiterate, the image must actually be raw  the glance --disk-format will accept raw even if it isn't and it doesn't check. To be sure it is raw, lets check:
# qemu-img info /climb/openstack-bham-data/glance/bf756074-ab13-45bb-b899-c83586df4ea8 
image: /climb/openstack-bham-data/glance/bf756074-ab13-45bb-b899-c83586df4ea8
file format: raw
virtual size: 8.0G (8589934592 bytes)
disk size: 8.0G

OK, we're happy it is raw, lets now create a cinder volume from it:
# cinder create --volume-type gpfs --image-id bf756074-ab13-45bb-b899-c83586df4ea8 8
|       Property      |                Value                 |
|     attachments     |                  []                  |
|  availability_zone  |                 nova                 |
|       bootable      |                false                 |
|      created_at     |      2015-01-27T20:43:09.213366      |
| display_description |                 None                 |
|     display_name    |                 None                 |
|      encrypted      |                False                 |
|          id         | 91d8028a-c2cb-4e2c-a336-aa0636488b88 |
|       image_id      | bf756074-ab13-45bb-b899-c83586df4ea8 |
|       metadata      |                  {}                  |
|         size        |                  8                   |
|     snapshot_id     |                 None                 |
|     source_volid    |                 None                 |
|        status       |               creating               |
|     volume_type     |                 gpfs                 |
The image should be ready within a second or so - even with an 8GB image (if it takes a minute or two, you almost certainly have  problem with the mmclone). Lets take a look at the volumes we have:
# cinder list
|                  ID                  |   Status  | Display Name | Size | Volume Type | Bootable | Attached to |
| 91d8028a-c2cb-4e2c-a336-aa0636488b88 | available |     None     |  8   |     gpfs    |   true   |             |
And lets also check that it is a cloned image:
# mmclone show /climb/openstack-bham-data/cinder/volume-91d8028a-c2cb-4e2c-a336-aa0636488b88 
Parent  Depth   Parent inode   File name
------  -----  --------------  ---------
    no      1          449281  /climb/openstack-bham-data/cinder/volume-91d8028a-c2cb-4e2c-a336-aa0636488b88

If it isn't a clone, you'll get something like:
# mmclone show /climb/openstack-bham-data/cinder/*
Parent  Depth   Parent inode   File name
------  -----  --------------  ---------

Note there is no parent inode listed. We can also check that the glance image is now a parent mmclone file:
# mmclone show /climb/openstack-bham-data/glance/bf756074-ab13-45bb-b899-c83586df4ea8
Parent  Depth   Parent inode   File name
------  -----  --------------  ---------
   yes      0                  /climb/openstack-bham-data/glance/bf756074-ab13-45bb-b899-c83586df4ea8

Just to compare timings, using the mmclone method on my 8GB image took less than a second to be ready (as quick as I could type cinder list), whereas the traditional copy method took a couple of minutes. I guess this will vary based on how busy the GPFS file system is, but mmclone is always going to be quicker than copying the whole image over.

To cover a couple of troubleshooting tips, if you get errors in the cinder volume.log like:
VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: Could not find GPFS file system device: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128).

This means that you probably forgot to do the "cinder type-create gpfs" step to create the volume type.

Second, if you find that it isn't cloning the image, check that the source is actually of type raw using "qemu-img info" to verify the type actually is raw.

Those are the only two problems I ran into, but enabling debug and verbose in the cinder.conf file should help with diagnosing problems.

Once again, thanks to Dean and Bill at IBM GPFS team for helping me get this working properly.

Update - Nilesh at IBM pointed out that we might also want to set default_volume_type = gpfs so that new volumes default to GPFS, this is important if more than one volume type is defined.