GHAPACK: A Library for the Generalized Hebbian Algorithm

I recently joined a new open source project called GHAPACK. The project currently provides the functionality and the means to use the Generalized Hebbian Algorithm. I came across this project after banging my head against some of the practical limitations of Singular Value Decomposition (SVD). GHA is a Hebbian-based neural network-like algorithm that approximates SVD’s ability to perform eigen decomposition. Its added bonus is that it allows for incremental training so you can refine your model with new data without having to recompute using the entire dataset.

Your Trusty SVD Tool

SVD is one of those tools that every machine learning practitioner and computational geek will pull out at some time or another. It’s a powerful matrix factorization technique that allows you to get at the matrix’s eigenvectors and eigenvalues. One of reason it tends to be used so often is the fact that it can be used on those pesky M x N matrices, which us data junkies tend to generate.

For most small problems I can just use scipy and numpy’s svd and never give it a second thought. LAPACK’s suite of SVD routines power the svd functions of scipy, numpy, and MATLAB among others. It is developed for dense matrices and processes them in their entirety. What happens when you start dealing with problems in high-dimensional space? Those dense representations and full processing are expensive. So, when your problem space is better suited for sparse matrices you tend to run into not enough memory, non-convergence…no SVD.

At the time I was considering a problem that would be well-suited for incremental training, meaning I did not want to have to rerun the entire dataset through SVD after adding a small set of new data; GHA lets you avoid that sort of inconvenience and approximates the same outcome (as far as my problem was concerned).

GHAPACK

GHAPACK is written by Genevieve Gorrell and based on her work using GHAs to perform Latent Semantic Analysis (LSA).

“Offline” Calculations

My first order of business upon joining the project was to get the offline training working. This allows you to compute a pseudo-SVD based on a massive matrix without having to load the whole thing into memory. No more out-of-memory segfaults. Now, you’re just limited by the resiliency of your hardware. This is now working.

Memory Management

I addressed a few memory leaks, but will likely do some restructuring to optimize memory management.

Resource Library

I would like the core of the GHA magic to be extracted into a library that others could embed in their own projects. So, I intend to move core functionality into a library and restructure the existing apps into commandline tools that utilize those libs.

Performance

GHA, off-the-bat, is not known for its speed compared to some other eigen decomposition approaches. Besides that, there is room for some major gains in performance. Let’s see what we can squeeze out of GHAPACK and perhaps lean on things like BLAS.

Testing Framework

A testing framework that objectively keeps track of performance gains, while ensuring computational integrity through unit testing always makes refactoring work that much less stressful.

Lots of work to do and a hearty thanks to Professor Gorrell for letting me join her efforts.

Other SVD Resources

There are other SVD libraries out there that will carry you farther if SVD is what you really want and not necessarily the means to an end.

ScaLAPACK has parallel SVD code, which creams LAPACK’s performance when you have access to multiple cores and/or MPI. ARPACK and SVDPACK both offer Lanczos-based SVD solutions for sparse matrices with ARPACK being well-suited for parallel processing.

Edward Tufte: Presenting Data and Information

I had a chance to attend an Edward Tufte class this past week and it truly was a pleasure. He has published a number of beautiful books on presentation and the visualization of data. So, it was quite a treat to sit in on a presentation by someone that teaches about giving presentations for a living. The class was engaging, full of content, and certainly left me with a sense of excitement.

the class

One big take away for me was that clutter and the sense of being overwhelmed by data is not an attribute of too much information, but rather a consequence of poor design. How many times have you looked at an information dashboard or a chart in a meeting only to get a headache from trying to grasp what was trying to be communicated? But yet we are capable of navigating and internalizing large amount of information if it is properly displayed and explained; those are the truly elegant presentational designs.

The class covers the basic principles you would want to follow to present your data in such a way to tell a story — a persuasive one. Things like how we can layout and present data to facilitate the basic intellectual process that one goes through when considering and weighing a proposal or story. I went into the class thinking I would learn some better ways to visualize and display complex datasets. I think I have some better ideas in this area, but only as a result of the bigger insight I walked away with on how to make a better presentation.

The other majorly cool bonus was being less than a foot away from a 1st edition Galileo printing. This along with an early printing of Euclid’s helped demonstrate the power of “breaking out of flatspace” by bringing something physical into a meeting.

bringing it home

So after all this excitement I went home and tried to look for ways to integrate Tufte’s design principles into my own presentations and reports. Edward Tufte’s website has a rich forum called Ask E.T., which contains information about presentation in a number of areas including project management. One Ask E.T. thread lead me to a project on Google Code that contains a Tufte-inspired LaTeX layout.

I expect to play around with some of these designs and see how I might better polish my own reports.