January has seen a busy start to the year! And not from any of the major investments we had approved at the end of 2015!
No, this has come from our deployment of new technologies for HPC systems, in the past week or two, we've taken delivery of a direct water cooled HPC rack (well half a rack).
We're not afraid to try out new technology to help deliver our services, whether that is moving to the latest IBM Spectrum Scale release or building cloud services on OpenStack.
This has however probably been our longest new technology development and is entirely a hardware solution. The plan came following a visit to Lenovo's (then just opened) site in Raleigh, NC where the demo'd their direct cooled HPC technology, wind on a few months and we were considering our options for new systems instead of SandyBridge based iDataPlex. Our strategic framework with OCF and Lenovo meant the options were realistically next scale, but a few calculations and an air cooled data centre with no ability to provide rear door heat exchangers and so a 7.5kW meant we were seriously looking at the WCT hardware. The spec of the standard WCT system wasn't what we really wanted, so we approached Lenovo about getting storage into the systems - we use full fat OS deployments and some of our workloads perform significantly better with local data storage. We eventually agreed on getting SSDs into the systems and having gone through this process, we also wanted to add Mellanox 100Gb/s EDR InfiniBand. (Actually, we want some of the cool features on the ConnectX-4 ASIC). This has proven to be a bit more difficult to actually get, but we have EDR switches and cables and will be adding water cooled EDR cards once they've finally been manufactured for us. Of course in a system with no fans to cool other components, there's a lot of testing gone into making sure the cards can be properly cooled.
Realistically, its taken us 9 months from agreeing we were going to go WCT with this kit to getting our first tests on it. Its taken longer than we anticipated (we ordered it last summer!) and we've learnt a lot from the process. We always knew we'd have a 4-6 month build on the facility side work to do planning permission and installation of dry air coolers, but we'd had delays on getting the SSDs, EDR and silly things like the right valves and hoses!
The infrastructure we've installed is designed to be scalable and to integrate with other warm water cooled systems which have been increasing in the market over the past year - looking round the SC15 conference centre, there were a number of options for direct cooled systems, significantly more so than at SC14. So its good to see that the infrastructure we designed back in July will integrate with other hardware platforms from other vendors.
A little tinkering with xcat and some updates to our deployment system and I've finally got most of the compute nodes running today with an Intel LINPACK test on the hardware. Monitoring the load for about 30mins, I managed to get our supply temperature up to ~25C with a return of 29C, so a 4C delta on 15 compute nodes running flat out. This is a lot less than we originally expected. Checking the kit, we were also getting sustained turbo on all cores to 3.0GHz on the 2x 12 core 2.6GHz SKU we have fitted per node.
Really I want to run the water loops much higher than this, depending on the CPU SKU, this could be up to 45C, but I think we'll aim for 40C. Its interesting that the thermal properties of water are funny, and the warmer it is, the better it is at carrying the heat load away with it...
Installation and final commissioning of the rack took place last week when we worked with the Lenovo engineers, our Estates team and their contractor team to balance and air-bleed the water loops. No leaks so far!
Looking to the future, we're talking to a few technology companies on getting early access to some new hardware to support our projects over the next 18 months. We're also always open to looking at what options we have available from a whole range of vendors.