🌩

So, you’ve joined a student cluster design competition

warning: this article is, uh, kinda long. There’s a lot to talk about

Earlier this year, I roped some friends into joining me for the training round of the national Student Cluster Competition, a supercomputing competition for undergraduate students. We made it through the first round, and went on to compete in the national finals just this past week (December 2018). Somehow, we actually won, so we’re now part of the National Student Cluster Team, and going to compete in the International Student Cluster competition in Germany in 2019. Now that you know that, I should let you know that we really, really were not planning on this. We did well because we’re very generally competent and also quite lucky but there are a lot of things I would change, so I’m going to compile a list of tips, hints and tricks that you might want to use should you find yourself in this situation.

First though, under the spoiler box I’m going to put the description of the competition. If you already know how the South African cluster competition works, (perhaps you’re already signed up for it) then you could probably skip it. Otherwise, it has a few differences from some competitions such as the USA SCC, which I’ll detail in the box.

How does the SCC work

The SA SCC is a pretty straightforward affair compared to other countries. There’s a week-long training round where teams who made the initial cut will attend lectures to learn the absolute basics of cluster computing as well as a very, very brief introduction to linux basics. During this time you’ll also complete a few lab activities where you’ll set up and benchmark a virtual cluster on some kind of cloud service. You’ll have to set up a fairly comprehensive environment, complete with scheduler, user management and of course, benchmarking software packages. At the end of the training, you’ll be provided with a use case and a theoretical budget, with which you should try and design the best cluster you can and explain your reasoning.

Should you succeed here, you will be selected for the actual competition. You’ll get told another few use cases, and be provided a new budget. You’ll have to submit a design for a cluster, which will be provided to you at the start of the competition. You’ll get a few days to set up the cluster and run as many benchmarks as you can, and you’ll be scored based on how well you do relative to the best score. At the end, they’ll tally up all the scores and see who wins.

This last part varies from place to place. I know that the USA SCC doesn’t provide a budget or parts and it’s up to the teams competing to track down sponsorships and parts any way they can. Those competitions tend to also focus on producing a work cluster that has scheduling and management whereas the SA SCC tends to focus on benchmarking exclusively. This is in part because the budget and scope constraints of the SA SCC don’t really incentivise things like job management and monitoring. You may need to adapt some of the hints to your particular case.

By the time we got to the end, we were working on a 3-node system with 16 cores per node and 64GB of memory per node. It was all running through a not super great 10GbE network. There’s a detailed spec at the bottom of the page. It cost about ZAR 180 000, which is just under USD 13 000 at time of writing, and also coincidentally the budget we were given. If you’ve got much more budget, some of the advice might not be as valid.

Getting through round 1

Round 1 is, relatively speaking, pretty easy, and if you’re well-versed in how computers work and you know some linux it shouldn’t be too hard. However, it may be a good idea to make sure you have a solid foundational understanding of basic linux usage and networking, although the tutorials should cover most gaps. I was fortunate enough to have accidentally taught my entire team Linux in the months leading up to the training, and it gave us a major advantage. Some of the teams we competed with did well despite this being their first introduction to doing any kind of linux administration.

Assuming that you make it through round 1, congratulations, you’ve probably compiled some basic MPI aware code and worked on a Linux machine for a few hours. Armed with that, let’s go into what you might want to focus on in order to excel in the main round.

Alright, so, Knowledge:

Networking Knowledge You’ll be wiring a lot of machines together.

To this end, it’s important to know at least the basics of how a network works. Key points will be things like Subnet Masks/IP addressing, NAT, and name resolution. We cost ourselves several days of work due to a misunderstanding of how name resolution happens inside the /etc/hosts file, so make sure you test out your network on real hardware if you can. A NAT will be useful for making sure all the parts of your network can see the internet, and you’ll need a working knowledge of IP addressing to set up a reliable, statically-addressed network of machines. If you’re competing with a larger cluster than we did, you may need DNS. Setting up a few networks and ensuring you have name resolution and fast connection between as many nodes as you can is good practice, see if you can get your hands on a few boxes and a switch to practice from your university or organizer.

Learn You Some Linux (for a great good)

Pretty much any supercomputing anyone does is under Linux, because it’s easier and cheaper and you don’t have to worry about it updating when you’re trying to do stuff. In particular, CentOS is popular because it’s backed by RedHat who do enterprise type software and that means good support and so on. If you don’t know your way around a linux system, you’ll waste a lot of time and spend a while reading how to fix relatively simple issues.

The first class of Linux Knowledge is probably utility knowledge. This is knowing how to chain together common linux programs to perform useful tasks. You may, for instance, want to learn how to use error stream redirection and tee to let you read debug information AND save it to a file at the same time. Can you use find to locate a stray library that hasn’t been loaded properly? Or maybe you’ll want to know how to use ifconfig or ip to configure the network. You’ll probably, like we did, need to use scp to move things between nodes. Having this knowledge on hand without needing to look it up is good for speeding up the setup work you’ll need to do.

The second class is probably to do with linux inner workings. This is to do with how linux is put together. For instance, if you need to track down a stray library, which path might it be in. Could be /lib/, or /usr/lib/, but perhaps /lib64/ is a better bet because it’s a 64 bit library. Configuration files usually go in /etc/. Can you create a group that allows any users who are in it to access all the files in a directory like they were root? How would you go about preventing a particular service from starting on boot? Knowing where to find these files or tools and what they do will vastly improve your ability to quickly assess an error and let you understand what causes them better. Indeed, service management could almost be an entire skill block of it’s own.

Lastly, there’s scripting knowledge. This is mostly useful for preparation, since you can write scripts to automate annoying or repetitive tasks. I wrote a handful of scripts that would automate things like setting up the NAT, disabling certain unwanted features, or testing network interfaces. Mostly, this will be an extension of your command line knowledge, since you’ll likely be writing for bash or sh. If you find yourself searching through your old commands to repeat some line, put it in a script.

Monitoring

You’ll want some way to monitor at least the very basics of your system. You can set up a much more advanced monitoring system, and indeed some competitions will expect this, but you should aim to get a basic monitoring system up early. In our case, we used a tmux session that was connected to all three nodes with htop running in each one just to let us keep track of what was happening. It’s not much and it’s not detailed but it’s enough that if you’re about to run a long job you can look over and go “oh, someone’s already hogging ten cores with a compile, I’ll wait/move to another node.” This leads pretty well into:

Compiling Stuff

For a lot of supercomputing codes, a significant performance boost can be had by compiling maths, networking and vector libraries from source. Compiling on your system will allow the compiler to optimise how it compiles for your hardware. There’s some caveats to this I’ll get to in the tips section, but if you’re using open source libraries like *BLAS, FFTW, LAPACK, ODEPACK or ATLAS, you’ll be compiling them from source to get the best performance. In general, there’s good instructions for this, but it’s a good idea to have practiced modifying the required Makefiles to use the appropriate libraries and compilers, for instance, switching between OpenMPI and MPICH as your MPI provider. Getting some of these to compile also allows you to do practice benchmarks, even if it’s just on two laptops wired into a 100Mb/s switch with a tiny HPL run.

Communication and Teamwork

Make sure you know if anyone could be doing something that affects what you are working on. While some tasks can be done in parallel, it’s important to know who in your team is aware of what, who is good at certain jobs, and how to allocate work both fairly and effectively.

In my case, we had four teammembers, who each had wildly different skill focusses. Initially, we kind of haphazardly tried to do as many things as possible in parallel, which resulted in making a mess of the build environment and the cluster in general. Two days in, we restarted from scratch and started organizing who would do what and when, which allowed us to perform as many tasks as we could but no more than would allow us to keep track of what was happening. Make sure you can allocate jobs to people but don’t let that mean they can’t ask for help, we were all constantly rotating around to make sure that if someone needed fresh eyes they’d get them.

I saw a few different approaches from other teams, such as having everyone generalise and then take shifts to work non-stop, or to constantly rotate between tasks so everyone got to work on something. Your personalities will affect this, and due to how asymmetric the distribution of knowledge in my team was, I think that our solution was good.

Tips, Tricks and Recommendations.

Using Intel Parallel Studio

Compiling sucks. Compiling the huge and fidgety linalgebra, networking and transformation libraries needed for HPC doubly so. If you can avoid it, you should, and fortunately most competitions allow you to use software like Intel Parallel Studio. This is a couple gigabytes of Intel’s finest hand-crafted and optimized code. icc and mpiicc are the heavily optimized C/++ compilers and the associated Intel MPI (impi) wrapper, the Math Kernel Library is an enormously fast mixture of maths libraries that can replace things like BLAS or FFTW.

There’s also a variety of pre-compiled benchmarks such as HPL’s designed specifically for various AVX implementations. The Intel tools are very, very good, and not having to compile your own maths libraries saves a lot of time. Keep in mind that not all software will compile under them though, for instance, the finite element analysis package FEniCS uses some gcc-exclusive implementations and trying to compile it using icc is near impossible at the moment. While the Parallel Studio is technically proprietary (and very expensive) software, you can either use the 30 day trial or your student email to get access to it for the competition.

Know Your Benchmarks

You’ll have to benchmark several different codes. You’ll almost definitely do HPL and HPCG, but you may also have to do something like WRF, FEniCS, GROMACS, or TensorFlow. Each of these load different parts of your system, be it the RAM bandwidth, CPU count, or interconnect bandwidth. Monitoring is useful for helping diagnose bottlenecks but having a rough idea of what affects your benchmark is a good idea. HPL depends almost wholly on core count, where HPCG favours lots of fast memory. Calculate estimated performance where possible.

Be Friendly

Throughout this competition, there was not too much of a feeling of intense competition. That may change for the ISC, but during the last two rounds there was a lot of free sharing of information between groups. While you could abuse this to gain an edge by exclusively listening, it’s a good idea to help other groups out and in turn be helped. You’re all probably not too sure what you’re doing so you can learn a lot from each other, and it’s just a worthwhile learning experience. It wasn’t at all uncommon for people to get up, walk over to a nearby team and ask for scores to compare or hints in solving some error.

Listen to the hints you get

There will be very highly qualified HPC professionals wandering around asking you questions. When they drop hints, pay attention, because they are definitely trying to help you.

Know where to get help

Web resources are useful, but it’s nice to be able to ask skilled people for help. While support forums may be too slow for help, I managed to get a lot of good help during the competition from some chatrooms. In particular, if you can use IRC, ##linux on freenode is a good place to ask questions, and heck, maybe you can even answer someone else’s questions while you’re there. If you have a mentor, make sure they stay abreast of what you’re doing, since they’ll probably be able to provide useful advice from their experience. Shout out to doublehp from ##linux for sharing a very useful NAT setup script that saved me from DNS hell.

Don’t use bleeding edge software

If something has just come out, do not touch it. We installed Intel Parallel Studio 2019 (yes this was in 2018, business years) and there was absolutely zero support available. We ended up rolling back to 2018. Don’t go too far back as there are optimizations to be had from newer software, but don’t use anything so new that it hasn’t been tested extensively by people smarter than you.

Know Your Specialities

I alluded to this earlier but it’s a good idea to know what people are good at and play to those strengths. Group your team as best as you can, and ideally make sure you know who to ask for help. I’m personally a fairly weak programmer and other team members are better equipped than I am to handle debugging compiles, but I have the most comprehensive linux knowledge of all of us, so I served largely as a knowledge base who could be queried for assistance. This left me darting between teammates helping with short fixes to config files and tools rather than working on any single larger goal. Other teammembers worked on optimizing runs or knowing how to get TensorFlow working or running compiles and so, whenever one of us hit a task that we couldn’t handle, we’d know who to turn to. These specialities are largely organic, in the sense that it’s just what we’re good at, but you could potentially try and assign them.

Get it working first

If you can get a benchmark running, you can hand it in and get the next one. Since the grades (In SA SSC) are calculated by the sum of all scores, it’s possible to win with mediocre scores in every category rather than with good scores in a few. If you can’t get BLAS to compile, grab the one in the repositories and fix it later if you get a chance. It’s not optimized but it’ll get you through and onto the next benchmark, which vastly improves your chances.

Don’t get stuck

A big part of why we won was that we knew when to give up. Compiling FEniCS put a serious dent in the time of practically every team because it has a lot of intricate and confusing dependencies, never mind that it needs a carefully controlled build and run environment that is very easy to accidentally mess up. That’s all you need to run it on one node. Running it on a cluster is even more fidgety setup work, and while by the end, a few groups managed to get it working, they had very little time to try and optimize existing benchmarks or tackle the unseen TensorFlow benchmark. We opted to, once we had FEniCS working on one node, forgo a score for the cluster entirely and continue. In the end, not wasting several hours on getting it working allowed us to gain a lead by having the only working TensorFlow, a big boost to our score. This was a risk, so you’ll need to consider what you know about other teams and how confident you are in your ability to tackle a task you may not have seen, since in the SA SCC you can’t go back to a task you’ve skipped.

This isn’t a working cluster

If you’ve tried to speak to general HPC experts about your cluster, you may have found them to be unhelpful, because (with the SA SCC at least) your budget is a couple orders of magnitude smaller than theirs. You don’t get Infiniband, and even if you could, you couldn’t afford it. You probably can’t afford GPU’s, and HPC is all about those GPU’s these days. Instead, you kind of need to cheat. You’re not building a real cluster so you shouldn’t build /like/ a real cluster. You can skimp on a lot of software to push up performance and save time. You don’t need broad user management through LDAP. You super don’t need a self-maintaining DNS system to help route connections. You probably don’t need any kind of friendly user interface, which will eat valuable CPU cycles. Instead, keep it as light as you can without making your job much harder. Users can be managed with useradd and usermod from the shadow package that most OS’es ship with nowadays. Use a hosts file and fixed IP’s to handle name resolution, since it’ll not only be lighter, but also faster.

Don’t burn out

It’s easy to end up working really long runs. I don’t think many people got much more than four hours of sleep each night, and it takes a toll. Don’t worry too much about letting teammates take a nap or a break for an hour or two if they’re feeling out of it. Pretty much all of us, at some point, stood up from our booth and said “I’ll be back in [30 minutes/an hour/two hours] I’m going to go [take a shower/have lunch/take a nap]” and it was a lot easier to work with than someone falling asleep at their terminal and getting irritable. We made a point of not pulling all nighters, preferring to leave the cluster completely each night for some time. This a) gives you a nice time to run some longer benchmarks without having to put yourself through staring at end time estimates for two hours and b) provides you with some much needed rest. Stay hydrated, and stay fed.

Don’t take it too seriously

This is less actual advice and more just a thing we did. We, uh, never really expected to win, indeed throughout the competition we were basically banking on coming second or third at best, because “we wanted to win something, at least.” Whether not holding a competitive spirit meant we were more relaxed made us more successful is speculative and therefore meaningless, it certainly felt more fun this way. And while cracking jokes about how you could cheat directly in front of the judges is a good way to get them to come ask you to re-run some benchmarks (haha, whoops) it’s a small price to pay for actually enjoying spending five days staring at computers with some friends.

God that’s a really long list of things to know

Yeah! This competition was genuinely hard, and at many points we were pretty sure we’d never even scrape top 5. Thanks to the way the SA SCC runs most clusters are actually very similar and so scores are extremely close and come down to a mixture of luck, optimization and tiny hardware differences. Speaking of hardware differences, if you were wondering what we ended up building, here’s some descriptions. Note that the head node was also a compute node and that frankly, this is a weird setup that was originally designed to run in a ring topology. I’ll explain more under the cut.

Cluster Design

We were provided with a hypothetical budget of ZAR 180 000 and a limited parts pricelist from Dell, along with the requirement that we use at least three nodes. There were no Infiniband or GPU cards available in the price list, presumably to simplify the competition and reduce costs.

1×Head Node

Body: Dell T440 Tower Server
CPU: 2×Intel Xeon Silver 4110 (8C @ 2.1GHz, 11M cache)
RAM: 4×16GB 2666MHz DDR4
Disk: 3×120GB SSD’s in RAID0 (scary!)
RAID controller: HERC H330
Interconnect: Dual-port 10Gbe SFP network card

2×Compute Node

Body: Dell T440 Tower Server
CPU: 2×Intel Xeon Silver 4110 (8C @ 2.1GHz, 11M cache)
RAM: 4×16GB 2666MHz DDR4
Disk: 1×120GB SSD’s
Interconnect: Dual-port 10Gbe SFP network card

Software Stack

OS: CentOS 7
Filesystem: XFS
MPI: Intel MPI, MPICH
Compilers: gcc/g++/gfortran, icc
Monitoring: tmux full of htop, custom bash scripts
User Management: shadow
Math Libraries: OpenBLAS, Intel MKL, FFTW
Fileshare: NFS mount from head node

Benchmarks Run

HPL
HPCC
HPCG
GROMACS ADH_cubic and 1.5M_water
FEniCS demo_poisson
TensorFlow ResNet 50

So that’s uh, that’s about all I wanted to get through. There’s a lot more to the competition, of course, and we still haven’t even started training so I can’t talk to working with larger clusters and GPU’s, or even to Infiniband networking. I’ll try and update this as I go, and it should be considered in a state of flux, for now.

I think. Now that we’ve won the national, we’re going to go start training for the ISC. We’re also looking into increasing the number of students competing from the University of Cape Town, since we were the first team in years. In particular I’d really like to get a mixture of CS and non-CS students involved because frankly, this is mostly very particular ops work and you don’t need to know how to reverse a linked list to do ops work.

If you have any questions, or if you’re looking for help, have some comments, or just want to talk about this, you can reach me, in order of reliability, as Kalium on IRC, Twitter, or email if you really insist. This place doesn’t have comments because uuuuhhh I haven’t gotten round to it and I’m scared of databases.