Thursday, 7 May 2015

Bumpy ride! ... troubleshooting Neutron

The last few weeks have been a bit of a bumpy ride with OpenStack! We've had it working, suddenly stop working, failures on some hypervisors. Its er, been interesting!

We've been replacing out IceHouse install here with Juno, this was a full rip and replace actually as we re-architechted the controller solution slightly so we now have three control/network nodes. Running both in parallel for a while with different interfaces.

So a couple of the issues we've faced below, which may server to help someone else!

vif_type=binding_failed

And we had the Juno config working, with VXLAN instead of VLAN, but working, only it only had 1 hypervisor attached to it. When we went to provision more hypervisors into the config, we were getting the evil "vif_type=binding_failed" out of Nova when instances were spawned on the new hypervisor.

Now if you Google for this, you'll get a lot of people complaining and a lot of people saying "works for me" type answers. So just to be clear, this error message comes out anytime you get any sort of error from the Neutron networking bit. It may just be a bad config for Neutron, so first off, go and check that your Neutron config looks sane!

Ours did, but still there are some issues that you can get depending on how you configured it, so we've had varying success with the following resolutions:

1. Go nuclear on Neutron

I've seen the need to do this when the Neutron database sync and services have been started before Neutron has been configured. So if you think you might have done this, this might help. Note this will WIPE any networks you may have configured and will confuse the heck out of OpenStack if you have running VMs when you do this!

  • stop all neutron services on controller/network nodes, as well as neutron agents running on hypervisor nodes
  • drop the neutron database from mysql
  • re-create the neutron database and assign permissions for the neutron db user
  • re-sync the Neutron database by running

  neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade juno

  • At this point restart Neutron services. It may work. But it may also be confused, so you may want to try this step again, but also following the steps below to remove all OVS bridges below

2. Remove all the OVS bridge interfaces and try again

I've needed to do this on both hypervisors and network nodes to actually get things working. Whilst it may appear that the VXLAN tunnels have come up, they may not actually be working correctly. To fix this:
  • stop neutron services (or agent if on a hypervisor)
  • delete the OVS bridges
  • re-create the OVS bridges
  • reboot the node
Note that on hypervisors, I found that if we didn't reboot, it still didn't always work correctly. So do try that although it is extreme!

Typically this process would look something like (on a network node):
   systemctl restart neutron-dhcp-agent.service neutron-l3-agent.service neutron-metadata-agent.service neutron-openvswitch-agent.service neutron-server.service
  ovs-vsctl del-br br-int
  ovs-vsctl del-br br-tun
  ovs-vsctl add-br br-int
  ovs-vsctl add-br br-tun
  systemctl restart neutron-dhcp-agent.service neutron-l3-agent.service neutron-metadata-agent.service neutron-openvswitch-agent.service neutron-server.service

3. /etc/hostname

So this one surprised me and took a while to dig out. We're running CentOS 7 on our hypervisors, and one of the install scripts we have puts the short hostname into /etc/hostname. Somehow this breaks things and caused the vif_binding failed error. Seriously I don't understand this as the resolv.conf file contains the DNS search suffix for the private network.

But still, this caused us issues. Check that you have the fully qualified name in the hostname file and try again!

This is why it was working fine with one hypervisor, but adding additional ones wouldn't work. Interestingly, if the scheduler tried to schedule on a "broken" hypervisor, it failed (due to the hostname), and then subsequently tried on a known good hypervisor, then it would also fail on that hypervisor with the same error. I don't get it either!

Slow external network access

We have 10GbE bonded links for our hypervisors, but were getting really slow off-site network access. I eventually tracked this down to likely being caused by MTU issues. The default MTU for CentOS et al is 1500, and our VMs also had an MTU of 1500. Except of course we are using VXLAN, so there's an extra 50 bytes or so tagged on for the VXLAN frame header. As far as I can see, what was happening is that the VMs were spitting out 1500 frames, then the hypervisor was encapsulating the VXLAN traffic adding 50 bytes, and so the emitted network frame would be 1550, which was higher than the interface MTU.

Of course with 10GbE interterfaces we probably want larger MTUs anyway, so I went ahead and changed the MTU to 9000 and restarted networking.

This caused chaos and broke things. It took a while to track this down! Chaos meaning things like the mariadb cluster no longer working. Bringing a node back into the cluster would start to work and then fail when doing the state transfer. This took a lot of head scratching and pondering, but we eventually worked out that the MTU on the switch was set to something like 2500, which seems like a strange default for a 10GbE switch to me! Anyway, increasing it to over 9000 (caveat about different manufacturers using different calculations here!) made the problem go away.

Can we get full line rate out of the VXLAN interfaces? Well, actually I don't know, but I did some very unscientific speed testing from both a network node and from inside a VM (before changing the MTU this was something like 6.7Kb/sec):

[centos@vm-test ~]$ speedtest-cli
Retrieving speedtest.net configuration...
Retrieving speedtest.net server list...
Testing from University of Birmingham (147.188.xxx.xxx)...
Selecting best server based on latency...
Hosted by Warwicknet Ltd. (Coventry) [28.29 km]: 15.789 ms
Testing download speed........................................
Download: 267.39 Mbit/s
Testing upload speed..................................................
Upload: 154.21 Mbit/s

[centos@controller-1]# ./speedtest-cli
Retrieving speedtest.net configuration...
Retrieving speedtest.net server list...
Testing from University of Birmingham (147.188.xxx.xxx)...
Selecting best server based on latency...
Hosted by RapidSwitch (Leicester) [55.83 km]: 11.392 ms
Testing download speed........................................
Download: 251.66 Mbit/s
Testing upload speed..................................................
Upload: 156.93 Mbit/s

So I don't know if that is the peak we can get from inside a VM as the controller also seems to be peaking around the same value. Its something I'll probably come back to another day.

No external DNS resolution from inside VMs

By default, the DHCP agent used on the network nodes will enable dnsmasq to act as a DNS server. This means that tenant-network names will resolve locally on the VMs, however we found that our VMs couldn't resolve external names.
This took a little bit of digging to resolve, but at least I have a plausible sounding cause for this.
The network nodes are attached to a private network and they use a DNS server on that private network for names rather than a publicly routable DNS server.
The dnsmasq instance is running inside a network namespace and that network namespace has its own routing rules, i.e. VMs can't route to our private management network. This also means that the dnsmasq instance can't use the DNS servers in /etc/resolv.conf as they have no route to actually do the name resolution.
Luckily, the dhcp agent service can be told to use a different set of DNS servers, so on our network nodes, we now have in /etc/neutron/dhcp_agent.ini:

  dnsmasq_dns_servers = 8.8.8.8,8.8.4.4

Then restart the neutron-dhcp-agent service. These IPs are of course those of the Google public DNS, but you could use others.

No comments:

Post a Comment