LINPACK is Back

It’s been awhile since I’ve done anything with Raiden. This is partly because I’ve been swamped with other projects and partly because it was summer and running the cluster made things uncomfortably (dangerously?) warm in the lab.

Now that Makerfaire is past and winter is coming, I took some time to resume my pursuit of establishing the baseline performance of Raiden Mark I, which means round two of fighting with LINPACK.

winter-is-coming-boot-the-supercompuer
For the uninitiated, LINPACK (specifically the High-Performance Linpack Benchmark) is a standard method of measuring supercomputer performance. It’s what’s used by top500.org to rank the world’s fastest computers and while it’s not an absolute measure of performance for all applications, it’s the go-to compare the performance of supercomputers.

Since the goal of the Raiden project is to determine if traditional Intel server-based supercomputer clusters can be replaced by less resource-intensive ARM systems, the first step is to establish a baseline of performance to use for comparisons. Since HPL is the standard for measuring supercomputers, it makes sense to use it here.

The problem is that HPL isn’t the easiest thing to get running. This is partially due to the fact that most supercomputers are specialized, custom machines. It’s also a pretty specialized piece of software with a small userbase which means there’s just not a lot of people out there sharing their experiences with it.

When I first built-out Raiden Mark I I kind of assumed that HPL would be part of the Rocks cluster distribution since benchmarking is a pretty common task when building-out a supercomputer. If it was included, I wasn’t able to find it, and after spending a few hours trying to get HPL and its various dependencies to compile, all I had to show for it was a highly-parallel segfault.

I’m not sure what’s changed since then, but with fresh eyes I put on the white belt and tried building HPL from scratch. After reading the included documentation and looking over my own (working) MPI programs I was able to ask the right questions and found a tutorial that lead me to successfully compiling the software. Not only that, but I was able to do a test run on a single node without errors!

================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 29184 192 2 2 692.41 2.393e+01
HPL_pdgesv() start time Fri Oct 6 12:58:50 2017

HPL_pdgesv() end time Fri Oct 6 13:10:22 2017

(The interesting part of the output is the Gflops metric, which in this case is 2.393e+01)

I quickly spun-up the compute nodes of the cluster and modified the machines file to run the benchmark across four nodes. However, for some reason only two nodes joined the cluster so I decided to run with only two and troubleshoot the missing nodes another time.

================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 29184 192 2 2 589.34 2.812e+01
HPL_pdgesv() start time Wed Jan 23 18:08:43 2008

HPL_pdgesv() end time Wed Jan 23 18:18:32 2008

The results were a bit disappointing (only about .5 Gflops faster).  I would have expected something closer to twice the performance by adding two more nodes to the cluster (as well as off-loading the benchmark from the head node). Based on these results I decided to take a look at tuning the HPL.dat file and see if I could optimize the parameters for a two-node cluster vs. of a single computer.

================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 41088 192 2 4 1333.90 3.467e+01
HPL_pdgesv() start time Wed Jan 23 18:25:19 2008

HPL_pdgesv() end time Wed Jan 23 18:47:33 2008

This made a significant difference. Not surprisingly, the benchmark responds strongly to being tuned for the hardware configuration it’s running on. I knew this mattered, but I didn’t realize how dramatic the difference would be.

I’m very excited to have reached this point in the project. There’s a number of reasons I’m anxious to move on to the Mark II version of the hardware and establishing a performance baseline for the Intel-based Mark I is a requirement for moving on to the next stage. There is still work to do, I need to get the other two nodes on-line and I need to spend more time learning how to optimize the settings in HPL.dat, but these are much less mysterious problems than getting the benchmark to compile & run on the cluster.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s