Thursday, 7 May 2015

Directing the OpenStack scheduler ... aka helping it place our flavours!

We have two distinct types of hardware for our OpenStack environment:

  • large memory (3TB RAM, up to 240vCPUS)
  • standard memory (512GB RAM, 64vCPUS)
What we really want to do is reserve the large memory machines for special flavours (or flavor!) which we only allow special users to be able to use. From an HPC background, this is something that would be trivial to do, and actually we can get OpenStack to do this using host aggregates and filters, but I think the docs are lacking and some of the examples don't necessarily work as listed. - Particularly if you have multiple filters enabled!

Just to note, that out of the box, the scheduler actually prefers the fat nodes as there is a weighting algorithm to prefer hypervisors with higher memory to cpu ratios, which may be fine, but actually we don't want to "waste" the fat nodes as we'd prefer them to be available to users with big needs and not have to worry about migrating small VMs to make space.

First up, we're going to use host aggregates. These are arbitrary groups of hosts, they can be used be used with availability zones, but don't have to be - when you create host aggregate, if you do it without setting an availability zone, then its not something that is visible to an end user.

This ticks our first requirement - I don't want a user to have to remember which availability zone to select in the drop down in Horizon based on the flavour they are using.

So to implement what we want to do we are also going to use the "AggregateInstanceExtraSpecsFilter" filter. First we need to enable this on our controller nodes, we need to do this on any node which is running the openstack-nova-scheduler service. Edit /etc/nova/nova.conf and change:

scheduler_default_filters=RetryFilter,AggregateInstanceExtraSpecsFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter


Your current list of filters may be different from this, but I added it after the RetryFilter which was the first in the default list. Restart the service on each node to ensure that new filter is loaded.

Note also that we have the ComputeCapabilitesFilter enabled, and I think this is why the examples of usage online don't work out of the box - in a minute I'll be adding the requirement to the flavour for a metadata match, we need to use a namespace on that match otherwise the nodes pass the AggregateInstance filter, but then fail as they don't match the ComputeCapabilitesFilter and you end up failing to schedule with "no host found" errors.

I'm going to show what to do using the command line tools, you can also do this from Horizon using the metadata fields and Host Aggregates view.

Create a host aggregate


(keystone_admin)]# nova aggregate-create stdnodes
+----+----------+-------------------+-------+----------+
| Id | Name     | Availability Zone | Hosts | Metadata |
+----+----------+-------------------+-------+----------+
| 11 | stdnodes | -                 |       |          |
+----+----------+-------------------+-------+----------+
Add a node (a hypervisor can belong to multiple aggregates)
(keystone_admin)]# nova aggregate-add-host stdnodes cl0903u01.climb.cluster
Host cl0903u01.climb.cluster has been successfully added for aggregate 11 
+----+----------+-------------------+---------------------------+----------+
| Id | Name     | Availability Zone | Hosts                     | Metadata |
+----+----------+-------------------+---------------------------+----------+
| 11 | stdnodes | -                 | 'cl0903u01.climb.cluster' |          |
+----+----------+-------------------+---------------------------+----------+
Now add some metadata to the aggregeate:
(keystone_admin)]# nova aggregate-set-metadata 11 stdmem=true
Metadata has been successfully updated for aggregate 11.
+----+----------+-------------------+---------------------------+---------------+
| Id | Name     | Availability Zone | Hosts                     | Metadata      |
+----+----------+-------------------+---------------------------+---------------+
| 11 | stdnodes | -                 | 'cl0903u01.climb.cluster' | 'stdmem=true' |
+----+----------+-------------------+---------------------------+---------------+
Now I'm going to specify that (me existing) flavour needs to run on hypervisors with the aggregate property fatmem=true, first check the flavour:
(keystone_admin)]# nova flavor-show m1.tiny
+----------------------------+---------+
| Property                   | Value   |
+----------------------------+---------+
| OS-FLV-DISABLED:disabled   | False   |
| OS-FLV-EXT-DATA:ephemeral  | 0       |
| disk                       | 1       |
| extra_specs                | {}      |
| id                         | 1       |
| name                       | m1.tiny |
| os-flavor-access:is_public | True    |
| ram                        | 512     |
| rxtx_factor                | 1.0     |
| swap                       |         |
| vcpus                      | 1       |
+----------------------------+---------+
Now we want to add that m1.tiny should have stdmem=true applied to it:
(keystone_admin)]# nova flavor-key m1.tiny set aggregate_instance_extra_specs:stdmem=true

(keystone_admin)]# nova flavor-show m1.tiny
+----------------------------+---------------------------------------------------+
| Property                   | Value                                             |
+----------------------------+---------------------------------------------------+
| OS-FLV-DISABLED:disabled   | False                                             |
| OS-FLV-EXT-DATA:ephemeral  | 0                                                 |
| disk                       | 1                                                 |
| extra_specs                | {"aggregate_instance_extra_specs:stdmem": "true"} |
| id                         | 1                                                 |
| name                       | m1.tiny                                           |
| os-flavor-access:is_public | True                                              |
| ram                        | 512                                               |
| rxtx_factor                | 1.0                                               |
| swap                       |                                                   |
| vcpus                      | 1                                                 |
+----------------------------+---------------------------------------------------+
Note that in the examples in the OpenStack docs, they don't include the namespace "aggregate_instance_extra_specs:" in front of the key name, as I mentioned above, when there are multiple filters, this may cause it to fail as although a node passes the aggregate filter, it fails the compute capabilities filter.

So to summerise, to allow us to ensure small VMs don't land, for each "small" flavour, we specify that the stdmem=true property is required, this causes the filter to exclude the fat nodes when considering them for scheduling.

No comments:

Post a Comment