Tuesday, 2 September 2014

Setting up a new GPFS file-system (for OpenStack)

Yes, this stuff is posted all over the place, this post is mostly for my records. Its also got some discussion on my choice of parameters for the use-case, i.e. to use the OpenStack cinder driver for GPFS.

One thing to note, make sure all your nodes have the same version of GPFS installed ... I've switched to testing with GPFS 4.1 for this project, but some of my nodes in the cluster were installed before the move to 4.1 so still had some 3.5 remnant packages on them.

Oh and secondly, if you have GPFS standard or advanced licenses, make sure you also add the gpfs.ext package as well (I got an error about Pools not being supported in GPFS Express Edition, because well, gpfs.ext didn't exist before and so wasn't in the xcat package list)!

Cluster config

All the systems in the cluster are equipped with Mellanox Connect-X 3 VPI cards, these are dual personality cards supporting FDR Infiniband/40GbE, we also have a 1GbE management network used for general tasks. For GPFS, we're planning to use Verbs running over the Infiniband fabric, falling back to 40GbE and finally 1GbE if needed. There's an SX6036 FDR switch, SX1036 40GbE switch and an IBM 1GbE switch for the management side of things.

This will be a two server node cluster, each of these is an IBM x3650m4 system which is direct SAS attached to the v3700 storage array. I've already blogged about the v3700 LUN config, so won't go over it here.

I'll assume at this point that the GPFS RPMs are already installed (gpfs.base, gpfs.gpl, gpfs.ext, gpfs.msg, and gpfs.gplbin appropriately built for the kernel in use, I also have gpfs.docs on my servers so I have the man pages available).

First up is to create a node file with all the nodes that are in the cluster (note that the nodes must be booted and ssh'able to as root). The format of the file is something like:

Now from one of the servers, create a new GPFS cluster by running:
mmcrcluster -N climb.nodefile.gpfs --ccr-disable -p server1 -s server2 -A -C climbgpfs -r /usr/bin/ssh -R /usr/bin/scp -U climb.cluster

I also created a couple of node list files to pass into mmchlicense, of course it can't read the syntax of the node file used to create the cluster, so I have two files, climb.serverlist.gpfs:

and climb.nodelist.gpfs:

We now need to confirm that we have appropriately licensed all of our nodes:
mmchlicense server --accept -N climb.serverlist.gpfs 
mmchlicense client --accept -N climb.nodelist.gpfs

Now before we get  on with the process of creating NSDs and filesystems, there's a bunch of cluster settings we want to configure first, first we're going to restrict the port range used for some GPFS admin type commands, this will be handy if we ever get around to firewalling or if we need to expose the cluster over IP to a remote cluster.
mmchconfig tscCmdPortRange=30000-30100

We also want to configure verbs so we can use RDMA over the Infiniband network:
mmchconfig verbsPorts="mlx4_0/1"
mmchconfig verbsRdma=enable
mmchconfig verbsRdmaSend=yes

The systems all have IvyBridge based CPUs in, so NUMA domains are likely to be present, set the flag to allow GPFS to interleave memory usage across domains to prevent out of memory in a single domain:
mmchconfig numaMemoryInterleave=yes

Note that our systems have a single dual port ConnectX-3 card, port 1 is connected to the IB network and port 2 connected to the 40GbE network.

Now I mentioned earlier that we also have a 1GbE for the management network, according to the GPFS docs, its possible to use the admin node name for to tell GPFS to use that network name for admin traffic. Now as we created the cluster using the 'normal' host names on the 1GbE network, its difficult for us to specify a different node unless we have another network for admin traffic which isn't what we want.

The solution here is to the the GPFS subnets config option, this allows us to specify the network for the high performance storage network (the 40GbE one) which is used in preference by GPFS for node communication. So we can work around having an admin network name by using this:
mmchconfig privateSubnetOverride=yes
mmchconfig subnets=""

In our system, is the network assigned to the VLAN tagged interface for storage on the 40GbE cards. In fact Sven (IBM GPFS Team) confirmed that GPFS will run traffic in preference on "RDMA, subnets, default", so our data network will run ideally over the Infiniband, then the 40GbE network, finally over the 1GbE network as a last resort, additionally as we set the admin name to the main hostname, the admin traffic should run over the 1GbE network and be separate, though I concede its not clear from the docs if this will still prefer subnets.

Block sizes and metadata space

Initially I was sizing metadata requirements based on the 5-10% of usable storage, but I've since come across a couple of docs indicating that this isn't a great way of sizing. The first is by and Scott (an IBM GPFSer), and the second is really a summary. In short, worst case for metadata is 16kb per (file/directory), so for 40million files, double replicated, 1.3TB of metadata, which is a lot less than the ~12TB estimated for 250TB usable space at 5% ish. OK, so we'll be using snapshots as well for glance image clones, but I don't expect the image blocks to actually vary massively once provisioned, and we have ~10TB of metadata space. I guess the worst case is we have to disable metadata replication at some point in the future if we need more metadata space!

Our data to go not the filesystem is likely to be mostly big files, as its either VM images for OpenStack, or ephemeral nova disks, or genome data (200Mb-3Gb) files, I decided to go with quite a large block size (8Mb), this is of course a multiple of the RAID strip size (256kb), and 1/32th (sub-block size) is 256kb, so it should align nicely with the underlying RAID controller strip.

Which brings me on to inode size. I was originally going with the default, by Jim at IBM suggested I think about 4k inodes. GPFS is nice in that small files can actually be contained in the inode as part of the metadata, so this actually seems like a nice compromise for the 8Mb data block size. i.e. for the few config files we might have, these can sit inside the inode in metadata, but the majority large files will fit nicely in the 8Mb data blocks.

Bearing in mind the big data block size, we need to tweak a couple of config options, the first to increase the max block size, and the second is to increase the page pool. The default is 64M for 256k blocks, 8192/256 = 32, so we want a page pool sized 32 * 64 = 2048Mb:
mmchconfig maxblocksize=8192K
mmchconfig pagepool=2048M

I'm using the stanza format for NSD definition in a file, we'll define both the NSDs and the pools in the file (though for NSD creation some of the lines are ignored, but they are used when creating the file system. A sample of the stanza file is:
 %nsd: device=dm-2

 %nsd: device=dm-3

%nsd: device=dm-4



In this file you can see that the first two NSDs are for metadata only and are in different failure groups to allow replication of metadata. There's then a number of NSDs which are data only and are the NL-SAS LUNs from the v3700 array. Finally the pools are defined, this isn't used by mmcrnsd, but is for mmcrfs later. One thing to note is the device name dm-X, this is the multi path device name. Look very carefully at these, and it refers to the device name as on the first listed server (as these may vary across server) - GPFS writes onto the header so the other server can find it.

And also actually create the NSDs as well:
mmcrnsd -F climb.nsd.gpfs -v yes

Update (02-09-2014) - I've since changed the NSDs so that the server order isn't always server1,server2, its actually now balanced across the two servers and across the canisters such that each server has an equal number of primary NSDs and that they are evenly distributed over the LUNs by their preferred canister owner.

As we have a two node cluster, we need to have tiebreaker disks enabled, so we're just going to use the two NSDs we are planning to use for meta data:
mmchconfig tiebreakerDisks="climb_v3700_clds01_md_lun01;climb_v3700_clds01_md_lun02"

And actually create the file system!

Now we need to actually create the filesystem, to do this, we need gpfs running on both the GPFS server systems. We're using the same nsd config file defined above used when creating the NSDs, this assigns the NSDs into pools and sets the pool and underlying storage block size.
mmcrfs climbgpfs -F climb.nsd.gpfs --filesetdf --perfileset-quota -Q yes -A yes -z yes -D nfs4 -i 4K -B 8M -m 2 -k all -n 26 -r 1 -T /climb --metadata-block-size 256K
mmchfs climbgpfs -z no

Just to clarify those options:
--filesetdf - df on a file-set will return the quote of the file-set not of the whole filesystem
--perfileset-quota - enable quotas on file-sets
-Q yes - activate quotas on file-system mount
-A yes - automatically mount the file-system on GPFS startup
-z yes - enable DMAPI
-D nfs4 - deny-write locks for NFS4 clients, not sure if we will use NFSv4, but needed if we will
-i 4K - 4k inode size
-B 8M - GPFS block size
-m 2 - 2 metadata replicas by default
-k all - allow (NFSv4 and POSIX) ACLs
-n 26 - 26 nodes to be in the cluster (we don't expect it to go significantly higher)
-r 1 - 1 replica of data by default
-T /climb - mount point
--metadata-block-size 256K - block size for metadata blocks.

Note that if want different GPFS data block sizes to metadata blocks, then you need to have different pools for storage and metadata.

And now lets mount the file system... note that initially the file-ssytem failed to mount, this was because I had "-z yes" enabled at file-system creation time, which enables DMAPI. My understanding from talking to people previously is that DMAPI needs to be enabled at creation time if you plan to use, I'm not sure for this project, so enabled it anyway, but because there is not HSM component installed, the file-system can't mount, hence the mmchfs -z no command above.
mmmount all

At this point I was momentarily stumped by not being able to create new files on the file system. Well, actually I could create some files, but not ones that were big (nor vi swap files actually!). Of course this was a file placement problem. As my system pool only contains NSDs which are marked for metadataOnly, this means there's no space available for actual data files, and we can create small files as they fit in the inode itself hence they fit can go into the system pool. The solution is to create a GPFS file placement policy, mine is simple, just a single rule in the file at present to put all files into the nlsas pool. Rules are executed in sequential order in the file, so if we had other pools, we could have a placement rule for specific file-sets or file extensions.
/* files are placed on nlsas */
RULE 'default' SET POOL 'nlsas'

And activate the policy file:
mmchpolicy climbgpfs climb.policy.gpfs

Finally, we want a couple of file sets to contain out data, we may want to quota eventually, but as we enabled quotas at file-system creation time, we can do this without shutting the cluster down or unmounting the file-system:
mmcrfileset climbgpfs openstack-swift -t "data for swift"
mmlinkfileset climbgpfs openstack-swift -J /climb/openstack-swift

mmcrfileset climbgpfs openstack-data -t "data for glance/cinder/nova"
mmlinkfileset climbgpfs openstack-data -J /climb/openstack-data

mmcrfileset climbgpfs climb-data -t "general CLIMB data"
mmlinkfileset climbgpfs climb-data -J /climb/climb-data

Now I've mentioned that we will be running the GPFS driver for OpenStack on top of this file-system, so we have two file sets, one which will be used for Swift data and a second for glance/cinder/nova ephemeral disks. The theory behind this is that the GPFS driver can use snapshot clones when provisioning glance images, so by placing glance and cinder on the same fileset, a snapshot provision of a glance image onto cinder block storage should happen almost instantly regardless of the size of the image. Placing the Nova ephemeral disks onto GPFS also allow live migration of VMs as the ephemeral disks are on shared storage. Swift is on a separate file-set to allow ease of management, backup etc.

The file-set config is based on a suggestion from a contact inside IBM, and it makes sense, so in the absence of other guidelines, I'm happy to run with it.

And a little performance testing...

Just to test out the file-system and v3700, I built the gpfsperf tool and did a couple of tests from an NSD client node in the GPFS cluster. I'm fairly sure we can run the storage array flat out over the Infiniband network, the follow is creating a 100GB file with random write pattern and is running over the GPFS:
./gpfsperf create rand /climb/climb-data/perf
  recSize 1045773310 nBytes 1045773310 fileSize 1045773310
  nProcesses 1 nThreadsPerProcess 1
  file cache flushed before test
  not using data shipping
  not using direct I/O
  offsets accessed will cycle through the same file segment
  not using shared memory buffer
  not releasing byte-range token after open
  no fsync at end of test
    Data rate was 887524.23 Kbytes/sec, thread utilization 0.999

So about 800Mb/sec, which I think is about the maximum we can expect from a 6Gbit SAS controller ...

I also ran the test having shutdown openibd, so we only have the 1GbE link available:
./gpfsperf create rand /climb/climb-data/perf
  recSize 1045773310 nBytes 1045773310 fileSize 1045773310
  nProcesses 1 nThreadsPerProcess 1
  file cache flushed before test
  not using data shipping
  not using direct I/O
  offsets accessed will cycle through the same file segment
  not using shared memory buffer
  not releasing byte-range token after open
  no fsync at end of test
    Data rate was 190084.41 Kbytes/sec, thread utilization 0.980

I'm not sure that's as fast as we could get over 1GbE, but shows that the IB link was working nicely. I haven't yet managed to test with just the 40GbE link up, that would mean walking down and unplugging the FDR cable as stopping openibd unloads the mlx4_core driver and so the 40GbE link also drops.

Suffice to say, I think we should get some nice performance out of the array, whether or not I've picked the right magic numbers for block size etc for the use-case remains to be seen, but the whole project is a bit of an experiment. We might have to rebuild the GPFS file-system later, but if we do, its not the end of the world!

Finally, I'm not made reference to file-system descriptor quorum. This is also important, GPFS will normally write 3 primary copies across different failure group NSDs. If these are lost, then the file-system become unusable. As we only have one storage array behind the GPFS, I'm not too worried as losing the storage array means we'll lose data anyway. If there were more storage arrays, then I might worry a bit more about this, as well as different pools for different file-sets across the arrays.

I'll post more on actually using GPFS with OpenStack when I get a chance to configure and test it!

UPDATE (Jan 2015):

I'm thinking about reducing the block size to 2MB for the GPFS file-system used for OpenStack images. This is because the VM images are likely to be doing small Linux inode updates and an 8MB block size will mean that if the update spans a block, then GPFS will have to do a full 8MB zero for the block if its not been used which could be a significant overhead for what should be a small write.


You may also be interested in my post on using and testing the GPFS Cinder driver!


  1. Interesting post! if you check mmdiag --iohist while a virtual machines running on GPFS you will notice KVM behavior on GPFS. very small writes.

    Could you try to go for 1M block size or 512K ? i would expect a performance boost

    1. The strip size on the v3700 storage arrays is 256KB, so with 8+2P, 2MB blocks aligns nicely with the stripe for writing. I need to have a look and see if the strip size is configurable on the controllers, it so then we could get smaller block sizes.

    2. Totally agree. I'm very interested in GPFS performance with Cinder, please let me know how things go.