One thing to note, make sure all your nodes have the same version of GPFS installed ... I've switched to testing with GPFS 4.1 for this project, but some of my nodes in the cluster were installed before the move to 4.1 so still had some 3.5 remnant packages on them.
Oh and secondly, if you have GPFS standard or advanced licenses, make sure you also add the gpfs.ext package as well (I got an error about Pools not being supported in GPFS Express Edition, because well, gpfs.ext didn't exist before and so wasn't in the xcat package list)!
Cluster config
All the systems in the cluster are equipped with Mellanox Connect-X 3 VPI cards, these are dual personality cards supporting FDR Infiniband/40GbE, we also have a 1GbE management network used for general tasks. For GPFS, we're planning to use Verbs running over the Infiniband fabric, falling back to 40GbE and finally 1GbE if needed. There's an SX6036 FDR switch, SX1036 40GbE switch and an IBM 1GbE switch for the management side of things.
This will be a two server node cluster, each of these is an IBM x3650m4 system which is direct SAS attached to the v3700 storage array. I've already blogged about the v3700 LUN config, so won't go over it here.
I'll assume at this point that the GPFS RPMs are already installed (gpfs.base, gpfs.gpl, gpfs.ext, gpfs.msg, and gpfs.gplbin appropriately built for the kernel in use, I also have gpfs.docs on my servers so I have the man pages available).
First up is to create a node file with all the nodes that are in the cluster (note that the nodes must be booted and ssh'able to as root). The format of the file is something like:
server1:manager-quorum
server2:manager-quorum
client1:client-nonquorum
client2:client-nonquorum
...
Now from one of the servers, create a new GPFS cluster by running:
mmcrcluster -N climb.nodefile.gpfs --ccr-disable -p server1 -s server2 -A -C climbgpfs -r /usr/bin/ssh -R /usr/bin/scp -U climb.cluster
I also created a couple of node list files to pass into mmchlicense, of course it can't read the syntax of the node file used to create the cluster, so I have two files, climb.serverlist.gpfs:
server1
server2
and climb.nodelist.gpfs:
client1
client2
...
We now need to confirm that we have appropriately licensed all of our nodes:
mmchlicense server --accept -N climb.serverlist.gpfs
mmchlicense client --accept -N climb.nodelist.gpfs
Now before we get on with the process of creating NSDs and filesystems, there's a bunch of cluster settings we want to configure first, first we're going to restrict the port range used for some GPFS admin type commands, this will be handy if we ever get around to firewalling or if we need to expose the cluster over IP to a remote cluster.
mmchconfig tscCmdPortRange=30000-30100
We also want to configure verbs so we can use RDMA over the Infiniband network:
mmchconfig verbsPorts="mlx4_0/1"
mmchconfig verbsRdma=enable
mmchconfig verbsRdmaSend=yes
The systems all have IvyBridge based CPUs in, so NUMA domains are likely to be present, set the flag to allow GPFS to interleave memory usage across domains to prevent out of memory in a single domain:
mmchconfig numaMemoryInterleave=yes
Note that our systems have a single dual port ConnectX-3 card, port 1 is connected to the IB network and port 2 connected to the 40GbE network.
Now I mentioned earlier that we also have a 1GbE for the management network, according to the GPFS docs, its possible to use the admin node name for to tell GPFS to use that network name for admin traffic. Now as we created the cluster using the 'normal' host names on the 1GbE network, its difficult for us to specify a different node unless we have another network for admin traffic which isn't what we want.
The solution here is to the the GPFS subnets config option, this allows us to specify the network for the high performance storage network (the 40GbE one) which is used in preference by GPFS for node communication. So we can work around having an admin network name by using this:
mmchconfig privateSubnetOverride=yes
mmchconfig subnets="10.30.13.0/22"
Block sizes and metadata space
Initially I was sizing metadata requirements based on the 5-10% of usable storage, but I've since come across a couple of docs indicating that this isn't a great way of sizing. The first is by and Scott (an IBM GPFSer), and the second is really a summary. In short, worst case for metadata is 16kb per (file/directory), so for 40million files, double replicated, 1.3TB of metadata, which is a lot less than the ~12TB estimated for 250TB usable space at 5% ish. OK, so we'll be using snapshots as well for glance image clones, but I don't expect the image blocks to actually vary massively once provisioned, and we have ~10TB of metadata space. I guess the worst case is we have to disable metadata replication at some point in the future if we need more metadata space!Our data to go not the filesystem is likely to be mostly big files, as its either VM images for OpenStack, or ephemeral nova disks, or genome data (200Mb-3Gb) files, I decided to go with quite a large block size (8Mb), this is of course a multiple of the RAID strip size (256kb), and 1/32th (sub-block size) is 256kb, so it should align nicely with the underlying RAID controller strip.
Which brings me on to inode size. I was originally going with the default, by Jim at IBM suggested I think about 4k inodes. GPFS is nice in that small files can actually be contained in the inode as part of the metadata, so this actually seems like a nice compromise for the 8Mb data block size. i.e. for the few config files we might have, these can sit inside the inode in metadata, but the majority large files will fit nicely in the 8Mb data blocks.
Bearing in mind the big data block size, we need to tweak a couple of config options, the first to increase the max block size, and the second is to increase the page pool. The default is 64M for 256k blocks, 8192/256 = 32, so we want a page pool sized 32 * 64 = 2048Mb:
mmchconfig maxblocksize=8192K
mmchconfig pagepool=2048M
%nsd: device=dm-2
nsd=climb_v3700_clds01_md_lun01
servers=server1,server2
usage=metadataOnly
failuregroup=10
pool=system
%nsd: device=dm-3
nsd=climb_v3700_clds01_md_lun02
servers=server1,server2
usage=metadataOnly
failuregroup=20
pool=system
%nsd: device=dm-4
nsd=climb_v3700_clds01_nls_lun01
servers=server1,server2
usage=dataOnly
failuregroup=11
pool=nlsas
...
%pool:
pool=system
blockSize=256K
usage=metadataOnly
%pool:
pool=nlsas
blockSize=256K
usage=dataOnly
In this file you can see that the first two NSDs are for metadata only and are in different failure groups to allow replication of metadata. There's then a number of NSDs which are data only and are the NL-SAS LUNs from the v3700 array. Finally the pools are defined, this isn't used by mmcrnsd, but is for mmcrfs later. One thing to note is the device name dm-X, this is the multi path device name. Look very carefully at these, and it refers to the device name as on the first listed server (as these may vary across server) - GPFS writes onto the header so the other server can find it.
And also actually create the NSDs as well:
mmcrnsd -F climb.nsd.gpfs -v yes
As we have a two node cluster, we need to have tiebreaker disks enabled, so we're just going to use the two NSDs we are planning to use for meta data:
mmchconfig tiebreakerDisks="climb_v3700_clds01_md_lun01;climb_v3700_clds01_md_lun02"
And actually create the file system!
Now we need to actually create the filesystem, to do this, we need gpfs running on both the GPFS server systems. We're using the same nsd config file defined above used when creating the NSDs, this assigns the NSDs into pools and sets the pool and underlying storage block size.
mmcrfs climbgpfs -F climb.nsd.gpfs --filesetdf --perfileset-quota -Q yes -A yes -z yes -D nfs4 -i 4K -B 8M -m 2 -k all -n 26 -r 1 -T /climb --metadata-block-size 256K
mmchfs climbgpfs -z no
Just to clarify those options:
--filesetdf - df on a file-set will return the quote of the file-set not of the whole filesystem
--perfileset-quota - enable quotas on file-sets
-Q yes - activate quotas on file-system mount
-A yes - automatically mount the file-system on GPFS startup
-z yes - enable DMAPI
-D nfs4 - deny-write locks for NFS4 clients, not sure if we will use NFSv4, but needed if we will
-i 4K - 4k inode size
-B 8M - GPFS block size
-m 2 - 2 metadata replicas by default
-k all - allow (NFSv4 and POSIX) ACLs
-n 26 - 26 nodes to be in the cluster (we don't expect it to go significantly higher)
-r 1 - 1 replica of data by default
-T /climb - mount point
--metadata-block-size 256K - block size for metadata blocks.
Note that if want different GPFS data block sizes to metadata blocks, then you need to have different pools for storage and metadata.
mmmount all
/* files are placed on nlsas */
RULE 'default' SET POOL 'nlsas'
And activate the policy file:
mmchpolicy climbgpfs climb.policy.gpfs
mmcrfileset climbgpfs openstack-swift -t "data for swift"
mmlinkfileset climbgpfs openstack-swift -J /climb/openstack-swift
mmcrfileset climbgpfs openstack-data -t "data for glance/cinder/nova"
mmlinkfileset climbgpfs openstack-data -J /climb/openstack-data
mmcrfileset climbgpfs climb-data -t "general CLIMB data"
mmlinkfileset climbgpfs climb-data -J /climb/climb-data
Now I've mentioned that we will be running the GPFS driver for OpenStack on top of this file-system, so we have two file sets, one which will be used for Swift data and a second for glance/cinder/nova ephemeral disks. The theory behind this is that the GPFS driver can use snapshot clones when provisioning glance images, so by placing glance and cinder on the same fileset, a snapshot provision of a glance image onto cinder block storage should happen almost instantly regardless of the size of the image. Placing the Nova ephemeral disks onto GPFS also allow live migration of VMs as the ephemeral disks are on shared storage. Swift is on a separate file-set to allow ease of management, backup etc.
The file-set config is based on a suggestion from a contact inside IBM, and it makes sense, so in the absence of other guidelines, I'm happy to run with it.
And a little performance testing...
Just to test out the file-system and v3700, I built the gpfsperf tool and did a couple of tests from an NSD client node in the GPFS cluster. I'm fairly sure we can run the storage array flat out over the Infiniband network, the follow is creating a 100GB file with random write pattern and is running over the GPFS:
./gpfsperf create rand /climb/climb-data/perf
recSize 1045773310 nBytes 1045773310 fileSize 1045773310
nProcesses 1 nThreadsPerProcess 1
file cache flushed before test
not using data shipping
not using direct I/O
offsets accessed will cycle through the same file segment
not using shared memory buffer
not releasing byte-range token after open
no fsync at end of test
Data rate was 887524.23 Kbytes/sec, thread utilization 0.999
So about 800Mb/sec, which I think is about the maximum we can expect from a 6Gbit SAS controller ...
I also ran the test having shutdown openibd, so we only have the 1GbE link available:
./gpfsperf create rand /climb/climb-data/perf
recSize 1045773310 nBytes 1045773310 fileSize 1045773310
nProcesses 1 nThreadsPerProcess 1
file cache flushed before test
not using data shipping
not using direct I/O
offsets accessed will cycle through the same file segment
not using shared memory buffer
not releasing byte-range token after open
no fsync at end of test
Data rate was 190084.41 Kbytes/sec, thread utilization 0.980
I'm not sure that's as fast as we could get over 1GbE, but shows that the IB link was working nicely. I haven't yet managed to test with just the 40GbE link up, that would mean walking down and unplugging the FDR cable as stopping openibd unloads the mlx4_core driver and so the 40GbE link also drops.
Suffice to say, I think we should get some nice performance out of the array, whether or not I've picked the right magic numbers for block size etc for the use-case remains to be seen, but the whole project is a bit of an experiment. We might have to rebuild the GPFS file-system later, but if we do, its not the end of the world!
Finally, I'm not made reference to file-system descriptor quorum. This is also important, GPFS will normally write 3 primary copies across different failure group NSDs. If these are lost, then the file-system become unusable. As we only have one storage array behind the GPFS, I'm not too worried as losing the storage array means we'll lose data anyway. If there were more storage arrays, then I might worry a bit more about this, as well as different pools for different file-sets across the arrays.
I'll post more on actually using GPFS with OpenStack when I get a chance to configure and test it!
I'll post more on actually using GPFS with OpenStack when I get a chance to configure and test it!
UPDATE (Jan 2015):
I'm thinking about reducing the block size to 2MB for the GPFS file-system used for OpenStack images. This is because the VM images are likely to be doing small Linux inode updates and an 8MB block size will mean that if the update spans a block, then GPFS will have to do a full 8MB zero for the block if its not been used which could be a significant overhead for what should be a small write.
Interesting post! if you check mmdiag --iohist while a virtual machines running on GPFS you will notice KVM behavior on GPFS. very small writes.
ReplyDeleteCould you try to go for 1M block size or 512K ? i would expect a performance boost
The strip size on the v3700 storage arrays is 256KB, so with 8+2P, 2MB blocks aligns nicely with the stripe for writing. I need to have a look and see if the strip size is configurable on the controllers, it so then we could get smaller block sizes.
DeleteTotally agree. I'm very interested in GPFS performance with Cinder, please let me know how things go.
Delete