Friday, September 11, 2009

Ruby's Marshal and ActiveRecord and PostgreSQL bytea fields

I'm posting this because it took me about a day to figure out what was happening, and I couldn't find any writeup of the problem when Googling. Hopefully this'll save someone else some time.

The problem:

Ruby's ActiveRecord does a fair amount of processing on any data you put into a database column; the behavior is data-type dependent, so various escaping and conversion can happen (making sure integer fields are numeric, and so forth). This works great (and transparently) almost all of the time, except when it comes to the BYTEA binary type. When storing BYTEA data, ActiveRecord escapes your data in an asymmetrical way, either using the PostgreSQL C API's PQescapeByteaConn, if it's available, or a pure-Ruby implementation that does the same thing. This happens whether or not you actually call ActiveRecord::Base.save on the ActiveRecord object; it's part of ActiveRecord::Base.write_attribute.

Unfortunately, PQescapeByteaConn and its complement PQunescapeBytea aren't symmetrical (see the documentation for PQunescapeBytea. In particular, backslashes are treated poorly. You can prove this with the following snippet:
>> require 'rubygems'
=> true
>> require 'pg'
=> true
>> str = "\\"; puts PGconn.unescape_bytea(PGconn.escape_bytea(str)) + " = " + str
\\ = \
=> nil
I first discovered this when trying to write data using Marshal, which can handle unexpected double-backslashes (it just treats them as a single one), but doesn't know what to do when a backslash is replaced by an unexpected character. When you Marshal a string, it is prefixed with "\004\010X", where \004 is EOT, \010 is a linefeed, and X is the character at 8+the length of the string. If a string is 84 characters long, X is a backslash (84+8 = 92; 92.chr == "\"). Under most circumstances, this is okay, because as the string gets escaped and unescaped, you just end up with two backslashes, which Marshal deals with.

Unfortunately, if your string starts with three digits, as in "123andthen84charactersxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", the Marshaling process contains the substring "\123", which is treated by PGescapeByteaConn as an octal character, so what comes out during unescaping isn't two backslashes, but rather the octal character "\123". Marshal.restore either returns a truncated string or throws an exception, depending on the details.

The solution:

Well, the easy solution is to use a TEXT field instead of a BYTEA field; the escaping and unescaping won't happen, and backslashes won't be an issue. If you've got more stringent requirements (data other than the output of Marshal), you can either escape your backslashes manually (after Marshaling or what-have-you), or go so far as to encode your string in a backslash-free scheme, like base 64.

What really makes this bug dangerous when using Marshal is that most of the time Marshal will hide the problem by accepting two backslashes in place of one. It's only if you have an 84-character string starting with three digits that you'll see the error. (Actually, 83 is just the first collision; later Marshaled strings also have this problem, starting at some length after 12000.)

Anyway, hope this saves someone the trouble I had.

Monday, September 7, 2009

It's not that I have more free time

Several times in the past couple of weeks, I've been seized with a notion for a blog entry longer than a tweet. I've forgotten most of the topics, but I still want to talk about my new camera and some pictures I've been taking.

Looks like Fairyland - 16
A couple of weeks ago, I picked up a Canon Powershot SD780 IS, which is turning out to be a great decision. It's so tiny I keep it in my pocket next to my phone. (Admittedly, between wallet, phone, camera, and keys, I think I've used up all my available pocket space.) I also ended up springing for the extended warranty, which explicitly covers my own "negligence", so I don't feel too worried trying to take pictures while riding my bike across the UCSF Mission Bay campus in the middle of the night. Someone also finally wrote a piece of software I've been waiting on for a couple of years now that lets me attach locations to pictures (geotag) in Aperture either by browsing around a Google Map or copying over a GPX tracklog from my phone. (A side benefit of using my phone to record my location is that I can make neat maps (w/ photos) of walking trips, as in The Park Less Traveled (map | photos).

Next up, bad Rails/PostgreSQL bugs!