Wednesday, 20 May 2015

Saturday, 9 May 2015

GPFS User Group

The Spring 2015 GPFS User Group will be taking place in York on 20th May, I hear there are only a few places left now, if you are a UK based GPFS user (or want to travel!), then this is a great event to find out what is happening with GPFS development, talk with the developers and other customers.

Oh, and incidentally I'm on the agenda for the May meeting. Customer experiences of using GPFS.

Shh! Its working!

Well, after a couple of very fragile weeks, it looks like out OpenStack environment is working and seems stable at the moment!

We've been running some test VMs on it whilst we work through the initial teething issues including some users from Warwick building an environment on there - they were using our IceHouse config, until it ate itself. - RabbitMQ got upset and that was that - as long as we didn;t want to change anything, the VMs carried on running, and we managed to pull all the images into the Juno install. I'm not sure there is a proper process for that, but what we did worked.

But this weekend CLIMB are running a Hackthon with users who aren't from our inner circle of alpha testers. I thought I'd have a quick check to see how they were getting on and there appear to VMs running for the weekend, and no emails in my mailbox (or WhatsApp messages either) complaining that stuff wasn't working!

I won't say it hasn't been a lot of work to get to this point, but I'm pretty happy to see VMs for others running on there!

Thursday, 7 May 2015

Directing the OpenStack scheduler ... aka helping it place our flavours!

We have two distinct types of hardware for our OpenStack environment:

  • large memory (3TB RAM, up to 240vCPUS)
  • standard memory (512GB RAM, 64vCPUS)
What we really want to do is reserve the large memory machines for special flavours (or flavor!) which we only allow special users to be able to use. From an HPC background, this is something that would be trivial to do, and actually we can get OpenStack to do this using host aggregates and filters, but I think the docs are lacking and some of the examples don't necessarily work as listed. - Particularly if you have multiple filters enabled!

Just to note, that out of the box, the scheduler actually prefers the fat nodes as there is a weighting algorithm to prefer hypervisors with higher memory to cpu ratios, which may be fine, but actually we don't want to "waste" the fat nodes as we'd prefer them to be available to users with big needs and not have to worry about migrating small VMs to make space.

First up, we're going to use host aggregates. These are arbitrary groups of hosts, they can be used be used with availability zones, but don't have to be - when you create host aggregate, if you do it without setting an availability zone, then its not something that is visible to an end user.

This ticks our first requirement - I don't want a user to have to remember which availability zone to select in the drop down in Horizon based on the flavour they are using.

So to implement what we want to do we are also going to use the "AggregateInstanceExtraSpecsFilter" filter. First we need to enable this on our controller nodes, we need to do this on any node which is running the openstack-nova-scheduler service. Edit /etc/nova/nova.conf and change:


Your current list of filters may be different from this, but I added it after the RetryFilter which was the first in the default list. Restart the service on each node to ensure that new filter is loaded.

Note also that we have the ComputeCapabilitesFilter enabled, and I think this is why the examples of usage online don't work out of the box - in a minute I'll be adding the requirement to the flavour for a metadata match, we need to use a namespace on that match otherwise the nodes pass the AggregateInstance filter, but then fail as they don't match the ComputeCapabilitesFilter and you end up failing to schedule with "no host found" errors.

I'm going to show what to do using the command line tools, you can also do this from Horizon using the metadata fields and Host Aggregates view.

Create a host aggregate

(keystone_admin)]# nova aggregate-create stdnodes
| Id | Name     | Availability Zone | Hosts | Metadata |
| 11 | stdnodes | -                 |       |          |
Add a node (a hypervisor can belong to multiple aggregates)
(keystone_admin)]# nova aggregate-add-host stdnodes cl0903u01.climb.cluster
Host cl0903u01.climb.cluster has been successfully added for aggregate 11 
| Id | Name     | Availability Zone | Hosts                     | Metadata |
| 11 | stdnodes | -                 | 'cl0903u01.climb.cluster' |          |
Now add some metadata to the aggregeate:
(keystone_admin)]# nova aggregate-set-metadata 11 stdmem=true
Metadata has been successfully updated for aggregate 11.
| Id | Name     | Availability Zone | Hosts                     | Metadata      |
| 11 | stdnodes | -                 | 'cl0903u01.climb.cluster' | 'stdmem=true' |
Now I'm going to specify that (me existing) flavour needs to run on hypervisors with the aggregate property fatmem=true, first check the flavour:
(keystone_admin)]# nova flavor-show m1.tiny
| Property                   | Value   |
| OS-FLV-DISABLED:disabled   | False   |
| OS-FLV-EXT-DATA:ephemeral  | 0       |
| disk                       | 1       |
| extra_specs                | {}      |
| id                         | 1       |
| name                       | m1.tiny |
| os-flavor-access:is_public | True    |
| ram                        | 512     |
| rxtx_factor                | 1.0     |
| swap                       |         |
| vcpus                      | 1       |
Now we want to add that m1.tiny should have stdmem=true applied to it:
(keystone_admin)]# nova flavor-key m1.tiny set aggregate_instance_extra_specs:stdmem=true

(keystone_admin)]# nova flavor-show m1.tiny
| Property                   | Value                                             |
| OS-FLV-DISABLED:disabled   | False                                             |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                 |
| disk                       | 1                                                 |
| extra_specs                | {"aggregate_instance_extra_specs:stdmem": "true"} |
| id                         | 1                                                 |
| name                       | m1.tiny                                           |
| os-flavor-access:is_public | True                                              |
| ram                        | 512                                               |
| rxtx_factor                | 1.0                                               |
| swap                       |                                                   |
| vcpus                      | 1                                                 |
Note that in the examples in the OpenStack docs, they don't include the namespace "aggregate_instance_extra_specs:" in front of the key name, as I mentioned above, when there are multiple filters, this may cause it to fail as although a node passes the aggregate filter, it fails the compute capabilities filter.

So to summerise, to allow us to ensure small VMs don't land, for each "small" flavour, we specify that the stdmem=true property is required, this causes the filter to exclude the fat nodes when considering them for scheduling.

Bumpy ride! ... troubleshooting Neutron

The last few weeks have been a bit of a bumpy ride with OpenStack! We've had it working, suddenly stop working, failures on some hypervisors. Its er, been interesting!

We've been replacing out IceHouse install here with Juno, this was a full rip and replace actually as we re-architechted the controller solution slightly so we now have three control/network nodes. Running both in parallel for a while with different interfaces.

So a couple of the issues we've faced below, which may server to help someone else!


And we had the Juno config working, with VXLAN instead of VLAN, but working, only it only had 1 hypervisor attached to it. When we went to provision more hypervisors into the config, we were getting the evil "vif_type=binding_failed" out of Nova when instances were spawned on the new hypervisor.

Now if you Google for this, you'll get a lot of people complaining and a lot of people saying "works for me" type answers. So just to be clear, this error message comes out anytime you get any sort of error from the Neutron networking bit. It may just be a bad config for Neutron, so first off, go and check that your Neutron config looks sane!

Ours did, but still there are some issues that you can get depending on how you configured it, so we've had varying success with the following resolutions:

1. Go nuclear on Neutron

I've seen the need to do this when the Neutron database sync and services have been started before Neutron has been configured. So if you think you might have done this, this might help. Note this will WIPE any networks you may have configured and will confuse the heck out of OpenStack if you have running VMs when you do this!

  • stop all neutron services on controller/network nodes, as well as neutron agents running on hypervisor nodes
  • drop the neutron database from mysql
  • re-create the neutron database and assign permissions for the neutron db user
  • re-sync the Neutron database by running

  neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade juno

  • At this point restart Neutron services. It may work. But it may also be confused, so you may want to try this step again, but also following the steps below to remove all OVS bridges below

2. Remove all the OVS bridge interfaces and try again

I've needed to do this on both hypervisors and network nodes to actually get things working. Whilst it may appear that the VXLAN tunnels have come up, they may not actually be working correctly. To fix this:
  • stop neutron services (or agent if on a hypervisor)
  • delete the OVS bridges
  • re-create the OVS bridges
  • reboot the node
Note that on hypervisors, I found that if we didn't reboot, it still didn't always work correctly. So do try that although it is extreme!

Typically this process would look something like (on a network node):
   systemctl restart neutron-dhcp-agent.service neutron-l3-agent.service neutron-metadata-agent.service neutron-openvswitch-agent.service neutron-server.service
  ovs-vsctl del-br br-int
  ovs-vsctl del-br br-tun
  ovs-vsctl add-br br-int
  ovs-vsctl add-br br-tun
  systemctl restart neutron-dhcp-agent.service neutron-l3-agent.service neutron-metadata-agent.service neutron-openvswitch-agent.service neutron-server.service

3. /etc/hostname

So this one surprised me and took a while to dig out. We're running CentOS 7 on our hypervisors, and one of the install scripts we have puts the short hostname into /etc/hostname. Somehow this breaks things and caused the vif_binding failed error. Seriously I don't understand this as the resolv.conf file contains the DNS search suffix for the private network.

But still, this caused us issues. Check that you have the fully qualified name in the hostname file and try again!

This is why it was working fine with one hypervisor, but adding additional ones wouldn't work. Interestingly, if the scheduler tried to schedule on a "broken" hypervisor, it failed (due to the hostname), and then subsequently tried on a known good hypervisor, then it would also fail on that hypervisor with the same error. I don't get it either!

Slow external network access

We have 10GbE bonded links for our hypervisors, but were getting really slow off-site network access. I eventually tracked this down to likely being caused by MTU issues. The default MTU for CentOS et al is 1500, and our VMs also had an MTU of 1500. Except of course we are using VXLAN, so there's an extra 50 bytes or so tagged on for the VXLAN frame header. As far as I can see, what was happening is that the VMs were spitting out 1500 frames, then the hypervisor was encapsulating the VXLAN traffic adding 50 bytes, and so the emitted network frame would be 1550, which was higher than the interface MTU.

Of course with 10GbE interterfaces we probably want larger MTUs anyway, so I went ahead and changed the MTU to 9000 and restarted networking.

This caused chaos and broke things. It took a while to track this down! Chaos meaning things like the mariadb cluster no longer working. Bringing a node back into the cluster would start to work and then fail when doing the state transfer. This took a lot of head scratching and pondering, but we eventually worked out that the MTU on the switch was set to something like 2500, which seems like a strange default for a 10GbE switch to me! Anyway, increasing it to over 9000 (caveat about different manufacturers using different calculations here!) made the problem go away.

Can we get full line rate out of the VXLAN interfaces? Well, actually I don't know, but I did some very unscientific speed testing from both a network node and from inside a VM (before changing the MTU this was something like 6.7Kb/sec):

[centos@vm-test ~]$ speedtest-cli
Retrieving configuration...
Retrieving server list...
Testing from University of Birmingham (
Selecting best server based on latency...
Hosted by Warwicknet Ltd. (Coventry) [28.29 km]: 15.789 ms
Testing download speed........................................
Download: 267.39 Mbit/s
Testing upload speed..................................................
Upload: 154.21 Mbit/s

[centos@controller-1]# ./speedtest-cli
Retrieving configuration...
Retrieving server list...
Testing from University of Birmingham (
Selecting best server based on latency...
Hosted by RapidSwitch (Leicester) [55.83 km]: 11.392 ms
Testing download speed........................................
Download: 251.66 Mbit/s
Testing upload speed..................................................
Upload: 156.93 Mbit/s

So I don't know if that is the peak we can get from inside a VM as the controller also seems to be peaking around the same value. Its something I'll probably come back to another day.

No external DNS resolution from inside VMs

By default, the DHCP agent used on the network nodes will enable dnsmasq to act as a DNS server. This means that tenant-network names will resolve locally on the VMs, however we found that our VMs couldn't resolve external names.
This took a little bit of digging to resolve, but at least I have a plausible sounding cause for this.
The network nodes are attached to a private network and they use a DNS server on that private network for names rather than a publicly routable DNS server.
The dnsmasq instance is running inside a network namespace and that network namespace has its own routing rules, i.e. VMs can't route to our private management network. This also means that the dnsmasq instance can't use the DNS servers in /etc/resolv.conf as they have no route to actually do the name resolution.
Luckily, the dhcp agent service can be told to use a different set of DNS servers, so on our network nodes, we now have in /etc/neutron/dhcp_agent.ini:

  dnsmasq_dns_servers =,

Then restart the neutron-dhcp-agent service. These IPs are of course those of the Google public DNS, but you could use others.