Sunday, 7 September 2014

Breaking GPFS by reinstalling a server...!

My GPFS servers are direct attached to the storage luns via sas cards.

As part of my DR testing, I reinstalled one of them and GPFS failed to mount. Nor would it remount on other nodes in the cluster. A bit of digging in the log files indicated that GPFS thought the disks were corrupt (though data was still visible on the GPFS file-system).

A bit more digging and I worked out what had happened - the xcat kickstart template has initlabel in there, and as it could see the LUNs from Anaconda, that's what it did, it wrote new disk labels over all visible disks, wiping the GFPS disk descriptor.

So safest solution is to add addkmdline=mpt2sas.blacklist=yes in the xcat config for the server nodes. This blacklists the sas card to prevent it from seeing and wiping the LUNs. Basically this gets passed to the kernel parameters when the system boots off the network into Anaconda for install.

You of course also need a postscript to clean up the /etc/modprobe.d/anaconda.conf file so the LUNs appear on first boot.

I have a script to rebuild the gpfs, and it didn't have any data really. Anyway, shows the importance of testing the DR process before you need it...

Hat off the Laurence at OCF for the tip to blacklist.

No comments:

Post a Comment