Monday, 8 September 2014

Highly available MariaDB for OpenStack

On our OpenStack installation, we'd like to have an HA core set of services, this is the first in (probably!) a series of blog posts on getting our core stack of services into an HA environment. Wherever possible, I'd like them to be running active/active rather than using something like pacemaker to hand over services.

OpenStack supports both MySQL and Postgres for its underlying database, whilst I'm generally a Postgres preferring person, it doesn't really do active/active and most of the OpenStack distress seem to utilise MySQL and the docs are based around MySQL. We also run a HA MySQL cluster for another service so its something I'm familiar with.

Ideally you want at least three nodes in the cluster, however for this project, I'm planning to use the two GPFS servers for the database cluster, I'm also opting for the MariaDB fork rather than MySQL itself. My testing is based around 10.0.13 using the RPMs from MariaDB, I'm also using the MariaDB Galera cluster version.

To avoid split brain conditions with a two node cluster, we also use garb (in the galera rpm)

Initial build and config

I'm using xcat to install the machines, and have added the following packages to the profile:
mariadb/galera
mariadb/MariaDB-Galera-server
mariadb/MariaDB-client
mariadb/percona-xtrabackup

Now as I'm running a two node cluster, I also added the percona-xtrabackup (2.2.3) from Percona. This backup tool integrates with the Galera wsrep provider and allows initial database server state transition to occur without blocking one of the cluster nodes - using mysql backup would block access to one server when a new server joined to allow state transfer to occur. In general we won't be doing this, but what we would like is a reinstall of a server node to be possible without intervention and without interruption to the service. i.e. we can do a DR on a server node without disruption.

We use xcat to push out a couple of config files out to the compute nodes, and a postbootscript which locks down the local database config (based on mysl_secure_installation script). The first is /etc/my.cnf:
[client-server]

[server]
datadir=/var/lib/mysql

#
# include all files from the config directory
#

!includedir /etc/my.cnf.d

And the second /etc/my.cnf.d/server.conf:
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0
default_storage_engine=InnoDB
bind-address=CHANGEIP
# settings recommended for OpenStack
# http://docs.openstack.org/icehouse/install-guide/install/yum/content/basics-database-controller.html
innodb_file_per_table
collation-server = utf8_general_ci
init-connect = 'SET NAMES utf8'

character-set-server = utf8
[mariadb]
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
query_cache_size=0
innodb_flush_log_at_trx_commit=2
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://server1-osmgmt,server2-osmgmt
wsrep_cluster_name='climb_db_cluster'
wsrep_sst_auth=galerasync:galerasyncuserdbpassword
wsrep_node_address=CHANGEIP
wsrep_sst_method=xtrabackup-v2
[sst]
streamfmt=xbstream

A syncfile.post script file substitutes the two references CHANGEIP to the host's IP address on the OpenStack management network (which is a tagged VLAN on the 40GbE network we have) - this is also the name of the host in the wsrep_cluster_address line. The postbootscript also adds the galerasync user, sets the password and gives it access to all databases. This user will be used to sync data between cluster nodes. (Before I did this, wsrep picked the IP on the first network interface, which happens in my case to be bond0 which is 'up', but currently not plumbed into a switch so can't pass traffic!).  By binding the database and wsrep to a specific IP, we can control where the traffic goes, we can also use haproxy and keepalived to load-balance and additional IP over the systems without needing to run the database on a different port number.

Now you may note that we have datadir defined in both my.cnf and server.cnf, this is a quirk of using xtrabackup - that tool doesn't traverse other config files so needs it defined at the 'top' level.

Starting up the cluster

The names of both servers are included in the config file, but to start the cluster from cold, you'll need to logon to one of the nodes and start the cluster by hand:
mysqld --wsrep_cluster_address=gcomm:// --user=mysql

One this is done you can start the second node of the cluster manually, though initially I test by starting by hand which allows me to diagnose what is going on with the cluster:
mysqld --wsrep_cluster_address=gcomm://server1-osmgmt,server2-osmgmt --user=mysql

Once I can see everything is happy, I'll stop one node with the init script, restart it with the init script and then do the same procedure on the second node. A reboot of any node will now come up normally. Obviously if we ever shut all servers down, then one will need bringing up by hand to get quorum in place.

Just one other thing to note, I planned to use a separate partition for /var/lib/mysql  - that didn't quite work out.

Arbitrating to prevent split brain

Once we're happy basic two node operation is working, we then add the garb arbitrator node, this is pretty simple, just install the galera rpm on another system and configure /etc/sysconfig/garb:
# A space-separated list of node addresses (address[:port]) in the cluster
GALERA_NODES="server1-osmgmt:4567 server2-osmgmt:4567"

# Galera cluster name, should be the same as on the rest of the nodes.

GALERA_GROUP="climb_db_cluster"

I actually push this config out to all of my nova compute nodes, but I only have 1 running the garb at any point in time. The only reason for needing garb is if one of my server nodes dies, it means we don't end up with split brain condition and the database will continue running and we'll be able to re-add the server node again.

Load balancing and HA

We now have an active/active cluster in place, but this doesn't help us with load balancing or fail-over as OpenStack can only be configured to 'talk' to one database system, to help with this we'll now add an additional IP address which uses haproxy and keepalived to keep the service up and to load balance traffic to the real database IP addresses. This is where binding the database instance to a specific IP helps as we can now add another haproxy service listening on port 3306 which redirects to the backend databases.

We now need the following RPMs installing (again I get them in from xcat onto the nodes);
# For HA
keepalived

haproxy

First up, for haproxy we will want to have a user which can connect to the database service, so start the mysql client on one of the servers and add the user:
CREATE USER 'haproxy'@'server%.climb.cluster';

Note that we use the create user syntax for this rather than directly inserting into the mysql.user table. Create user is propagated across the cluster, whereas using insert directly into the mysql table isn't.

We now need to configure keepalived. This will be used for the HA IP address and will automatically control moving it between the servers in the event of a failure. My /etc/keepalived/keepalived.conf looks something like this:
! Configuration File for keepalived
! generated by xcat postscript

global_defs {
   notification_email {
     foo@bar
   }
   notification_email_from foo@bar
   smtp_server 127.0.0.1
   smtp_connect_timeout 30
   router_id CLIMB_server1
}

vrrp_script OSMGMT_HA {
   script "killall -0 haproxy"
   interval 2
   weight 2
}

vrrp_instance VIP_OSMGMT {
    state BACKUP
    interface mlx0.3006
    virtual_router_id 10
    priority 121
    virtual_ipaddress {
        10.30.20.52 dev mlx0.3006
    }
    track_script {
      OSMGMT_HA
    }
    authentication {
      auth_type PASS
      auth_pass somerandomstring
    }

}

(Actually again an xcat postscript populates it with the correct interfaces. Note that a system with higher priority will prefer to be the master).

On server2, the priority is set to a higher number and the router_id has a different name in there. Obviously change the auth_pass to something else, it should be the same on all your nodes providing the same service.

Now we can start keepalived and keep an eye on the message file, once its sorted itself out, we should see the extra IP address on server2:
% ip addr
12: mlx0.3006@mlx0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether f4:52:14:8b:50:72 brd ff:ff:ff:ff:ff:ff
    inet 10.30.21.66/22 brd 10.30.23.255 scope global mlx0.3006
    inet 10.30.20.52/32 scope global mlx0.3006
    inet6 fe80::f452:1400:18b:5072/64 scope link 
       valid_lft forever preferred_lft forever
We can see that 10.30.20.52 is up on the interface. Test that you can ping it from another system as well!

Before we setup haproxy we need to tweak a sysctl setting to allow it to bind to interfaces which aren't present - as the HA IP address floats, it may not be present when HA proxy starts so edit /etc/sysctl.conf and add:
net.ipv4.ip_nonlocal_bind=1

To activate it now, run sysctl -p.

Finally we need to configure haproxy, this is in /etc/haproxy/haproxy.cfg:
#---------------------------------------------------------------------
# Global settings
#---------------------------------------------------------------------
global
    log         10.30.0.254 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon
    stats socket /var/lib/haproxy/stats

#---------------------------------------------------------------------
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#---------------------------------------------------------------------
defaults
    mode                    http
    log                     global
    option                  httplog
    option                  dontlognull
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000

listen climbdbcluster 10.30.20.52:3306
    balance source
    mode tcp
    option tcpka
    option mysql-check user haproxy
    server server1-osmgmt 10.30.21.64:3306 check weight 1
    server server2-osmgmt 10.30.21.66:3306 check weight 1

Note that I'm logging out to an external syslog server. The climbdbcluster IP address is the HA address and we list the two real backed database servers each having equal weight. Once haproxy is started it should be possible to connect to the database from other nodes using the HA IP address. If you see errors in the messages file about a server not being available, its possible you forgot to give the user access to authenticate to the database, they don't need permission to do anything other than complete the authentication stage.


No comments:

Post a Comment