Kurt Grandis

Software Engineering & Entrepreneurship

Kurt Grandis header image 1

django-multidb, MySQLdb, and MySQL Encoding Errors

December 3rd, 2009 · Django, Python

We recently ran into a bug involving improper encoding of Unicode data using Django, MySQL, and django-multidb. It took us a little while to track it down so I just wanted to take the opportunity to post a description of the problem and the resolution to help any others out there running into similar issues.

We were anxiously looking forward to Alex Gaynor’s multidb efforts, but needed something in the near term to help our site scale. After looking at a few options, including building our own, we settled on Mike Malone’s django-multidb. I first heard about Mike’s django-multidb in his Scaling Django Presentation. It was a perfect solution for our needs: it gave us the ability to manage master-slave databases within Django, it was very simple, and offered just the right amount of flexibility.

The Problem

We started receiving sporadic UnicodeEncodeErrors. The tracebacks were reporting that the system was unable to encode certain Unicode strings into latin1. Latin1? Who wants latin1? We use utf8 as our standard character set for both Django and MySQL (client & server).

So where was this latin1 encoding request sneaking in from? No rogue .encode(‘latin1′)s were popping up in codebase searches. The Django MySQL backend certainly looked like it was doing its job, but we had to validate that the cursors being generated were in fact being set with the appropriate charset. We followed the path back and eventually started intercepting a few MySQLdb cursors. Once we started debugging and probing cursors it became clear that they were in fact using latin1 as the default character set.

You keep using that word. I do not think it means what you think it means.

I learned an interesting tidbit along the way that explains why it took us awhile to diagnose the problem. When you don’t specify a character set or encoding, MySQL’s default encoding is called “latin1″. Except by “latin1″ MySQL does not mean “latin1″ of ISO 8859-1 fame, but rather the Windows cp1252 code page. This occurs even though MySQL does know what cp1252 is and is fully capable of honoring that character set separately by name. Really.

So, imagine MySQLdb asking the MySQL server what charset it prefers and the server replies “latin1″. MySQLdb then says, “Awesome, I know latin1,” and they go on chatting and passing information. This works just fine until the MySQL server passes back a bytestring representing a string once stored in its database containing the Unicode entity U+2019 ( ’ ). This RIGHT SINGLE QUOTATION MARK can easily be encoded in cp1252, but it cannot be represented by latin1 (The real ISO 8859-1 one). MySQLdb receives the cp1252-encoded bytestring and attempts to decode as if it were latin1 and lo and behold it throws an exception for attempting the impossible.

Now imagine the encoding mess occurring within a SQL query. MySQLdb opens a connection with the server, agrees to communicate using the latin1 charset, and then prepares to send a query containing a right single smart quote (U+2019). It takes the Unicode query string and attempts to .encode(‘latin1′).  BLAM! Encoding error.

The Solution

The problem ended up being an inconspicuous bug in django-multidb that restored Django’s backend cursor settings to system defaults, which resulted in any preferred character sets being ignored. The latin1-cp1252 confusion was then free to crop up.  I know many folks including myself looked right over the code and bug many times without noticing. No worries. Mike Malone has already patched the django-multidb repository over at github. So, at this point you just need to update your project with the latest django-multidb code and you should be good to go.

Regarding the MySQL-MySQLdb latin1 debacle, it seems simple enough to make MySQLdb call MySQL’s bluff. Maybe another time…until then I think I will avoid this issue and stick with explicitly defining utf8 as my character set of choice.

View CommentsTags:······

Django in the Triangle

November 20th, 2009 · Django, Local Business, Python

Jacob Kaplan-Moss recently wrote about the growing size of the Django community. It seems as though we are starting to feel some Django-related growing pains here in North Carolina’s Research Triangle Park. Given recent developments on the Triangle Zope & Python User Group (TriZPUG) mailing list I thought I would take some time to discuss the current state of Django in the Triangle, who’s using it, and what is in the pipeline.

Django Jobs in the US (Trend data provided by Indeed.com)

Django’s growing popularity

First off, it’s important to note that Django adoption is growing nationwide. The included chart shows the number of posted “Django” jobs found on Indeed.com over the past few years. Notice a trend? Jacob Kaplan-Moss estimates the Django community may have grown somewhere on the order of 2-3x from 2007 to 2009. I definitely believe it and wouldn’t be surprised if it were higher. Between the volume of phone calls from recruiters and the number of people I run into using or talking about Django its popularity is definitely on the rise in the Triangle.

Django in Action

Here’s a short list of shops in the Triangle who use Django in their day-to-day development:

I know there are other closeted folks out there using Django without full corporate blessing or knowledge. If there are other groups out there who would like to make this list please let me know.

A Triangle Django Users Group?

A new Google Group TriDjUG (twitter: @TriDjUG) was recently created in order to help foster a healthy Django community. At the same time some good discussion erupted from the TriZPUG mailing list. Why bother splitting our local Python community? While the intention was never to split away, some good cases were made for operating under the umbrella of TriZPUG. Strengthen the Python community. One exciting sentiment that came from TriZPUG members was that non-Django Python users were interested in Django and wanted to learn more during regular TriZPUG meetings. That would give us a captive audience at an already catered and organized event. Sounds good.

Surely there is some Django-specific fun to be had… For starters, we’re looking to sponsor local Django-related sprints maybe including one for the upcoming Django 1.2 release. For our community building/growing merit badges a couple of us are developing a Django Bootcamp; let’s continue to grow.

So it sounds like it’s up to us local Djangonauts to step up, participate in TriZPUG, and build greater awareness. People want to hear about our technologies, so let’s share. If you have other ideas we’ld love to hear them. If you haven’t yet introduced yourself swing by the mailing list or irc (#trizpug) and say hello.

View CommentsTags:····

GHAPACK: A Library for the Generalized Hebbian Algorithm

February 8th, 2009 · Data, Machine Learning, Software Engineering

I recently joined a new open source project called GHAPACK. The project currently provides the functionality and the means to use the Generalized Hebbian Algorithm. I came across this project after banging my head against some of the practical limitations of Singular Value Decomposition (SVD). GHA is a Hebbian-based neural network-like algorithm that approximates SVD’s ability to perform eigen decomposition. Its added bonus is that it allows for incremental training so you can refine your model with new data without having to recompute using the entire dataset.

Your Trusty SVD Tool

SVD is one of those tools that every machine learning practitioner and computational geek will pull out at some time or another. It’s a powerful matrix factorization technique that allows you to get at the matrix’s eigenvectors and eigenvalues. One of reason it tends to be used so often is the fact that it can be used on those pesky M x N matrices, which us data junkies tend to generate.

For most small problems I can just use scipy and numpy’s svd and never give it a second thought. LAPACK’s suite of SVD routines power the svd functions of scipy, numpy, and MATLAB among others. It is developed for dense matrices and processes them in their entirety. What happens when you start dealing with problems in high-dimensional space? Those dense representations and full processing are expensive. So, when your problem space is better suited for sparse matrices you tend to run into not enough memory, non-convergence…no SVD.

At the time I was considering a problem that would be well-suited for incremental training, meaning I did not want to have to rerun the entire dataset through SVD after adding a small set of new data; GHA lets you avoid that sort of inconvenience and approximates the same outcome (as far as my problem was concerned).

GHAPACK

GHAPACK is written by Genevieve Gorrell and based on her work using GHAs to perform Latent Semantic Analysis (LSA).

“Offline” Calculations

My first order of business upon joining the project was to get the offline training working. This allows you to compute a pseudo-SVD based on a massive matrix without having to load the whole thing into memory.  No more out-of-memory segfaults. Now, you’re just limited by the resiliency of your hardware. This is now working.

Memory Management

I addressed a few memory leaks, but will likely do some restructuring to optimize memory management.

Resource Library

I would like the core of the GHA magic to be extracted into a library that others could embed in their own projects. So, I intend to move core functionality into a library and restructure the existing apps into commandline tools that utilize those libs.

Performance

GHA, off-the-bat, is not known for its speed compared to some other eigen decomposition approaches. Besides that, there is room for some major gains in performance. Let’s see what we can squeeze out of GHAPACK and perhaps lean on things like BLAS.

Testing Framework

A testing framework that objectively keeps track of performance gains, while ensuring computational integrity through unit testing always makes refactoring work that much less stressful.

Lots of work to do and a hearty thanks to Professor Gorrell for letting me join her efforts.


Other SVD Resources

There are other SVD libraries out there that will carry you farther if SVD is what you really want and not necessarily the means to an end.

ScaLAPACK has parallel SVD code, which creams LAPACK’s performance when you have access to multiple cores and/or MPI. ARPACK and SVDPACK both offer Lanczos-based SVD solutions for sparse matrices with ARPACK being well-suited for parallel processing.

View CommentsTags:····