Have you ever been working with a dataset, started crunching some numbers and said to yourself, “damn, I should distribute this across the cluster,” only to realize that your cluster is already saturated with your last job and will be for the next day or two? If you answered yes, then we probably share the same data-craving/slicing/mining sickness.Well the above scenario happens to me and often enough for me to pose the question to others. I could simply invest in a larger cluster — an expensive investment, especially since the scenario often only requires bursts of compute time. This would make an on-demand cluster a perfect solution.

On-Demand Beowulf

I had heard some chatter about Peter Skomoroch’s ElasticWulf and found myself walking through his series on creating an on-demand beowulf cluster using Amazon’s EC2. You can find his very helpful posts here and here (with another on the way). ElasticWulf is a package of Python tools and machine images that allow you to create and manage a beowulf cluster on Amazon’s EC2 service. Peter has done the heavy lifting for you: the machine images come loaded with your essential computational Python packages like SciPy as well as cluster middleware so you can get up and running with minor configuration.

The Results

After running through Amazon’s EC2 Getting Started Guide, and Peter’s posts I was up and running with a new beowulf cluster in well under an hour. I pushed up and distributed some tests and it seems to work. Now, it’s not fast compared to even a low-end contemporary HPC, but it is cheap and able to scale up to 20 nodes with only a few simple calls. That’s nothing to sneeze at and I don’t have to convince the wife or the office to allocate more space to house 20 nodes.

I don’t currently have any hard numbers to back up my ephemeral cluster’s performance, but it is something I am curious about. How much can these virtualized Opteron 250s dish out? It looks like Peter’s third installment will address benchmarking performance, which is something that I will look forward to. In the meantime I might just push up High Performance Linpack (HPL) and see how it stacks up (in the abstract) against my existing clusters.

Now that I’m finally up and running a cluster on EC2, I plan on immersing myself in more data. It will also be a nice place to experiment with other cluster technologies I have been meaning to investigate like Hadoop; in fact there are already public Amazon Machine Images for Hadoop nodes.

Exciting stuff…