Wednesday, 10 September 2014

Render farm management with PipelineFX Qube!

Something a bit different from recent ramblings on OpenStack! As part of our research support infrastructure we've planned to provide a render farm to allow high resolution stills or video to be rendered. Right now its a small render farm made up of 1 controller and a couple of worker nodes. We got quote a bit of the kit some time ago but other projects have taken priority so we placed the workers into a general purpose HPC queue so they weren't being wasted.

I'll try to be careful to refer to it as a render farm rather than cluster, but if I mention cluster, read that as farm ;-)

Getting the render farm up and running has now bubbled to the top of my list to get it to proof of concept stage.

For various reasons we're using PipelineFX Qube! as the render manager software. It integrates with some 3d rendering software and runs across Windows, OS/X and Linux.

The farm is made up 1 controller node (the supervisor) running Linux and two render (worker) nodes running Windows 7. Initially I'd hoped to run it all under Linux, but of the applications we have licensed (Autodesk 3d studio max) is only available under Windows. The render nodes have a couple of applications directly installed on them for rendering (Blender, Autodesk 3d studio max and Autodesk Maya), if we get demand we'll add more later.

Qube! includes its own installer for all platforms to install supervisor, worker and client applications, however we like to deploy our Linux boxes with zero touch, so we use xcat to deploy the software, we also use this to deploy the config and license files so we can keep them in a centrally backed up repository.

Qube! requires mysql for its data warehouse backend, the installer will try to install this for you, so we include it as part of out xcat image, with a standard script to lock down the mysql database install. Ideally I'd like to run the database to our clustered ha database service, but as of Qube! 6.5, they only support using myISAM tables, which don't work with Galera clustering. I did ask one of the tech guys about this and they suggested something along the lines of myISAM being better performance. Whilst that may have been true many years ago, I'm not sure that holds now. Still we are where we are.

As far as possible, we push all the config into the server side of things, the qbwrk.conf file can be used to specify specific config options for classes of workers (e.g. Windows, Linux) as well as for specific nodes. This means we have to do very little config on the workers and its one of the nice things about Qube!. The basic xcat package list looks like:
qube/qube-supervisor
qube/qube-core
qube/qubegui
qube/qube-worker
qube/qube-mayajt
qube/qube-mentalrayjt

Once installed, you need to do some basic configuration, I have an /etc/qbwrk.conf file which includes config to be pushed to workers:
[default]
worker_description = "Render Farm"
proxy_execution_mode = user
worker_logmode = mounted

[linux]
worker_description = "Linux Render Farm"
worker_path = "/gpfs/qube/logs"

[windows]
worker_description = "Windows Render Farm"

[NODE1]
worker_logpath = "\\\\fileshare\qubelogs"

[NODE2]

worker_logpath = "\\\\fileshare\qubelogs"

We use shared log path config, this is the recommended configuration from PipelineFX and means the workers write directly to the log directory rather than via the supervisor. A couple of things to note on this, out Linux log path "/gpfs/qube/logs" is the same directory shared via Samba as \\fileshare\quebelogs. The thing I really don't like about this is that it needs to be Full Control/o+rwx to allow logging to work, which also means users can see other user log files (and potentially interfere with them!).

Other than that in /etc/qb.conf there's very little config required:
qb_domain = d1
qb_supervisor = supervisor.cluster
client_logpath = /gpfs/qube/logs
supervisor_default_cluster = /d1
supervisor_default_priority = 9950
supervisor_highest_user_priority = 9950
supervisor_default_security = 
supervisor_host_policy = "restricted"
supervisor_logpath = /gpfs/qube/logs
supervisor_preempt_policy = passive
mail_administrator = admin@domain
mail_domain = domain
mail_host = smtpserver.cluster

mail_from = admin@domain

A couple of things to note here, supervisor_host_policy - restricted requires the worker to be defined in the qbwrk.conf file (just to prevent someone accidentally adding a worker to the cluster). We also set the default priority of jobs to 9950 and the highest a user can set their priority to at 9950. Basically this allows us as admins to bump the priority of jobs up if we need to, and allows a user to drop the priority of their own jobs if they want another to run in preference. The scheduler for Qube! isn't particularly complicated (more or less highest priority, with first in first out). There's also no way to integrate it into another scheduling system.

The default permissions seem a bit scary to me, so we locked things down by default (I think users can interact with other users jobs, which is a bit scary!) and then created out our admins which may to our other admin accounts. For example to create an admin account we'd do:
/usr/local/pfx/qube/sbin/qbusers -set -all -admin -sudo -impersonate -lock <ADMINUSER>

To clean up the default users:
/usr/local/pfx/qube/sbin/qbusers -set administrator
/usr/local/pfx/qube/sbin/qbusers -drop qube
/usr/local/pfx/qube/sbin/qbusers -drop qubesupe
/usr/local/pfx/qube/sbin/qbusers -drop system
/usr/local/pfx/qube/sbin/qbusers -drop root

We also restrict what normal users can do by default and so users have to be specifically registered with something like:
/usr/local/pfx/qube/sbin/qbusers -set -submitjob -kill -remove -modify -block -interrupt -unblock -suspend -resume -retry -requeue -fail -retire -reset <USERNAME>

On the Windows 7 worker nodes, we do use the installer to install components for us. We use a basic Windows 7 Enterprise image which is joined to our Active Directory. The installer is pretty good, it allows you to use an offline cache of the packages and will generate "kickstart" style files for replaying the install on multiple machines. It comes with pre-defined classes of system, e.g. worker, client etc.

As well as the Qube! service itself, there are also a number of job templates which can be installed, these are wrapper scripts to allow Qube! to better integrate with various applications. On the client side, some of these include plugins for applications to allow direct submission from the application to the cluster.

Pretty much all we need to do when installing the Windows farm systems is to give the name of the render server. Our render farm nodes are on a private network, the supervisor is on both a public and private network, so I just have to be a little careful at this point to specify the internal name for the farm nodes to prevent the traffic traversing the NAT gateways and back in again!

The Windows machines obviously also need the applications installed on them and Autodesk Creative suit is BIG. Ideally we'd push these in from something like SCCM or wpkg, but with only a few nodes, right now we're doing them by hand. (One thing I hate is that whilst the Autodesk apps support flexlm licenses, there is no way to specify the port on the license manager from inside the installer, so I have to go back and edit it later!).

Qube! has several ways in which the worker node can run, one is "desktop" mode, this runs jobs as the user currently logged into the system. The second is "service" mode, this in itself has two modes, proxy user and run as a user. Proxy user has a local and hardcoded user to run jobs as, run as user requires the end user to cache credentials into Qube!. We're using the latter. Either of the other two modes require "other" users to be able to access your files and write to your output folders. Whilst this may be OK in a company where everyone is working on the same projects, it doesn't work in our environment, so its better to run as the real end user. The only downside to this is that is requires caching of the user's password, which is pain when they change it, though ultimately its no worse than our Windows HPC environment I guess.

The only other thing we really need to do on the worker nodes is to allow access via the Windows firewall:
netsh advfirewall firewall add rule name="Qube! 50001 TCP" protocol=TCP dir=in localport=50001 action=allow
netsh advfirewall firewall add rule name="Qube! 50001 UDP" protocol=UDP dir=in localport=50001 action=allow
netsh advfirewall firewall add rule name="Qube! 50002 TCP" protocol=TCP dir=in localport=50002 action=allow
netsh advfirewall firewall add rule name="Qube! 50002 UDP" protocol=UDP dir=in localport=50002 action=allow
netsh advfirewall firewall add rule name="Qube! 50011 TCP" protocol=TCP dir=in localport=50011 action=allow

netsh advfirewall firewall add rule name="Qube! 50011 UDP" protocol=UDP dir=in localport=50011 action=allow

So really, this is just a basic overview of initial setup and some of the features and things I think are worth looking at. The Qube! docs are pretty good and I've found their support people pretty responsive as well for the few occasions I've needed to get in touch with them. More details on Qube! from PipelineFX. They also run regular training courses (usually free), maybe a couple of times per year.


No comments:

Post a Comment