We recently ran into a bug involving improper encoding of Unicode data using Django, MySQL, and django-multidb. It took us a little while to track it down so I just wanted to take the opportunity to post a description of the problem and the resolution to help any others out there running into similar issues.
We were anxiously looking forward to Alex Gaynor’s multidb efforts, but needed something in the near term to help our site scale. After looking at a few options, including building our own, we settled on Mike Malone’s django-multidb. I first heard about Mike’s django-multidb in his Scaling Django Presentation. It was a perfect solution for our needs: it gave us the ability to manage master-slave databases within Django, it was very simple, and offered just the right amount of flexibility.
We started receiving sporadic UnicodeEncodeErrors. The tracebacks were reporting that the system was unable to encode certain Unicode strings into latin1. Latin1? Who wants latin1? We use utf8 as our standard character set for both Django and MySQL (client & server).
So where was this latin1 encoding request sneaking in from? No rogue .encode(‘latin1′)s were popping up in codebase searches. The Django MySQL backend certainly looked like it was doing its job, but we had to validate that the cursors being generated were in fact being set with the appropriate charset. We followed the path back and eventually started intercepting a few MySQLdb cursors. Once we started debugging and probing cursors it became clear that they were in fact using latin1 as the default character set.
You keep using that word. I do not think it means what you think it means.
I learned an interesting tidbit along the way that explains why it took us awhile to diagnose the problem. When you don’t specify a character set or encoding, MySQL’s default encoding is called “latin1″. Except by “latin1″ MySQL does not mean “latin1″ of ISO 8859-1 fame, but rather the Windows cp1252 code page. This occurs even though MySQL does know what cp1252 is and is fully capable of honoring that character set separately by name. Really.
So, imagine MySQLdb asking the MySQL server what charset it prefers and the server replies “latin1″. MySQLdb then says, “Awesome, I know latin1,” and they go on chatting and passing information. This works just fine until the MySQL server passes back a bytestring representing a string once stored in its database containing the Unicode entity U+2019 ( ’ ). This RIGHT SINGLE QUOTATION MARK can easily be encoded in cp1252, but it cannot be represented by latin1 (The real ISO 8859-1 one). MySQLdb receives the cp1252-encoded bytestring and attempts to decode as if it were latin1 and lo and behold it throws an exception for attempting the impossible.
Now imagine the encoding mess occurring within a SQL query. MySQLdb opens a connection with the server, agrees to communicate using the latin1 charset, and then prepares to send a query containing a right single smart quote (U+2019). It takes the Unicode query string and attempts to .encode(‘latin1′). BLAM! Encoding error.
The problem ended up being an inconspicuous bug in django-multidb that restored Django’s backend cursor settings to system defaults, which resulted in any preferred character sets being ignored. The latin1-cp1252 confusion was then free to crop up. I know many folks including myself looked right over the code and bug many times without noticing. No worries. Mike Malone has already patched the django-multidb repository over at github. So, at this point you just need to update your project with the latest django-multidb code and you should be good to go.
Regarding the MySQL-MySQLdb latin1 debacle, it seems simple enough to make MySQLdb call MySQL’s bluff. Maybe another time…until then I think I will avoid this issue and stick with explicitly defining utf8 as my character set of choice.