I recently joined a new open source project called GHAPACK. The project currently provides the functionality and the means to use the Generalized Hebbian Algorithm. I came across this project after banging my head against some of the practical limitations of Singular Value Decomposition (SVD). GHA is a Hebbian-based neural network-like algorithm that approximates SVD’s ability to perform eigen decomposition. Its added bonus is that it allows for incremental training so you can refine your model with new data without having to recompute using the entire dataset.
Your Trusty SVD Tool
SVD is one of those tools that every machine learning practitioner and computational geek will pull out at some time or another. It’s a powerful matrix factorization technique that allows you to get at the matrix’s eigenvectors and eigenvalues. One of reason it tends to be used so often is the fact that it can be used on those pesky M x N matrices, which us data junkies tend to generate.
For most small problems I can just use scipy and numpy’s svd and never give it a second thought. LAPACK’s suite of SVD routines power the svd functions of scipy, numpy, and MATLAB among others. It is developed for dense matrices and processes them in their entirety. What happens when you start dealing with problems in high-dimensional space? Those dense representations and full processing are expensive. So, when your problem space is better suited for sparse matrices you tend to run into not enough memory, non-convergence…no SVD.
At the time I was considering a problem that would be well-suited for incremental training, meaning I did not want to have to rerun the entire dataset through SVD after adding a small set of new data; GHA lets you avoid that sort of inconvenience and approximates the same outcome (as far as my problem was concerned).
GHAPACK is written by Genevieve Gorrell and based on her work using GHAs to perform Latent Semantic Analysis (LSA).
My first order of business upon joining the project was to get the offline training working. This allows you to compute a pseudo-SVD based on a massive matrix without having to load the whole thing into memory. No more out-of-memory segfaults. Now, you’re just limited by the resiliency of your hardware. This is now working.
I addressed a few memory leaks, but will likely do some restructuring to optimize memory management.
I would like the core of the GHA magic to be extracted into a library that others could embed in their own projects. So, I intend to move core functionality into a library and restructure the existing apps into commandline tools that utilize those libs.
GHA, off-the-bat, is not known for its speed compared to some other eigen decomposition approaches. Besides that, there is room for some major gains in performance. Let’s see what we can squeeze out of GHAPACK and perhaps lean on things like BLAS.
A testing framework that objectively keeps track of performance gains, while ensuring computational integrity through unit testing always makes refactoring work that much less stressful.
Lots of work to do and a hearty thanks to Professor Gorrell for letting me join her efforts.
Other SVD Resources
There are other SVD libraries out there that will carry you farther if SVD is what you really want and not necessarily the means to an end.
ScaLAPACK has parallel SVD code, which creams LAPACK’s performance when you have access to multiple cores and/or MPI. ARPACK and SVDPACK both offer Lanczos-based SVD solutions for sparse matrices with ARPACK being well-suited for parallel processing.