Thanks to a gracious gift of gadgetry for Fathers Day, I was able to spend last weekend increasing Mark II’s computing power to maximum capacity!
Not only did I have the parts, but I also had the time (again thanks to Jamie & Berty) and had very ambitious plans for pushing Mark II across the finish line.
Things didn’t go quite as planned (I spent one of the two days fighting O/S problems of all things) but I was able to get all 8 nodes online by the end of the weekend.
Since then I’ve been working on running high-performance Linpack (hpl) to see how Mark II’s performance stacks-up against Mark I. After a few runs, I was not only able match the best results I achieved with Mark I, but slightly exceed them.
The best hpl performance I could get out of Mark I was 9.403 Gflops. Using a similar hpl configuration (adjusted for reduced memory size per node), I was able to get 9.525 out of Mark II. On both machines this result was achieved with a 4×4 configuration (4 nodes, 16 cores) so even though it wasn’t the maximum capacity of either machine, it’s a pretty good apples-to-apples comparison between the two architectures.
That said, there’s more performance to be had. I’ve learned a lot about running hpl since I recored Mark I’s results and I’m confident that I can figure out what was stopping me from improving the numbers by adding nodes back then. In addition to better hpl tuning skills, I’m able to run all nodes of Mark II without throwing a circuit breaker, which was not the case with Mark I. This makes iterating on hpl configs a lot easier.
The theoretical maximum hpl performance (Rpeak) for Mark II is 76 Gflops. Achieving this is impossible due to memory constraints, transport overhead, etc. but I think 75% of Rpeak is not unreasonable, which would yield a measured peak performance (Rmax) of 57 Gflops. This would rank Mark II right around the middle of the Top 500 Supercomputer Sites… in 1999.
Not too shabby for a $500.00 machine you can hold in your hand.
Still, there’s a long ways to go from my current results of not-quite 10 Gflops to 57. I think theres a lot to improve in how I’m configuring hpl and I also think there is work to do at the O/S and hardware level. Minimally I’m going to need to increase the machine’s cooling capacity (right now I don’t even have heatsinks on the SOC’s so I know the CPU throttle is kicking in). So if I’m able to find an hpl config that doesn’t loose performance as I add nodes to the test, and I can keep the system cool enough to make sure cpu throttling doesn’t kick-in, I should be able to get much closer to that Rmax value.
But even with no improvement this test confirms the validity of the Mark II design. The original goal for this phase of the project was to establish the difference in performance between the Mark I machine and a similar cluster built from ARM single-board computers. The difference, surprisingly is that the ARM machine has a slight node-for-node edge over the “traditional” Mark I.
Perhaps just as important, Mark II achieves this with significantly less cost, physical size, thermal output and electricity consumption. Once I complete the power supply electronics for Mark II I’ll be able to get more precise measurements of power consumption but even if Mark II operates at it’s maximum power consumption that’s still 1/10 of the power consumed by Mark I.
This is a significant milestone in the RAIN project and marks the “official” end of the Mark II phase. I plan to finish Mark II’s implementation (wiring front-panel controls for all nodes, designing & fabricating the front-end interface board, etc.) and generate a “reference design” for others who would like to recreate my results (perhaps even offer a kit?), but with these results in hand I can confidently enter into the Mark III phase of the project as well.