Friday, September 11, 2009

Ruby's Marshal and ActiveRecord and PostgreSQL bytea fields

I'm posting this because it took me about a day to figure out what was happening, and I couldn't find any writeup of the problem when Googling. Hopefully this'll save someone else some time.

The problem:

Ruby's ActiveRecord does a fair amount of processing on any data you put into a database column; the behavior is data-type dependent, so various escaping and conversion can happen (making sure integer fields are numeric, and so forth). This works great (and transparently) almost all of the time, except when it comes to the BYTEA binary type. When storing BYTEA data, ActiveRecord escapes your data in an asymmetrical way, either using the PostgreSQL C API's PQescapeByteaConn, if it's available, or a pure-Ruby implementation that does the same thing. This happens whether or not you actually call ActiveRecord::Base.save on the ActiveRecord object; it's part of ActiveRecord::Base.write_attribute.

Unfortunately, PQescapeByteaConn and its complement PQunescapeBytea aren't symmetrical (see the documentation for PQunescapeBytea. In particular, backslashes are treated poorly. You can prove this with the following snippet:
>> require 'rubygems'
=> true
>> require 'pg'
=> true
>> str = "\\"; puts PGconn.unescape_bytea(PGconn.escape_bytea(str)) + " = " + str
\\ = \
=> nil
I first discovered this when trying to write data using Marshal, which can handle unexpected double-backslashes (it just treats them as a single one), but doesn't know what to do when a backslash is replaced by an unexpected character. When you Marshal a string, it is prefixed with "\004\010X", where \004 is EOT, \010 is a linefeed, and X is the character at 8+the length of the string. If a string is 84 characters long, X is a backslash (84+8 = 92; 92.chr == "\"). Under most circumstances, this is okay, because as the string gets escaped and unescaped, you just end up with two backslashes, which Marshal deals with.

Unfortunately, if your string starts with three digits, as in "123andthen84charactersxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", the Marshaling process contains the substring "\123", which is treated by PGescapeByteaConn as an octal character, so what comes out during unescaping isn't two backslashes, but rather the octal character "\123". Marshal.restore either returns a truncated string or throws an exception, depending on the details.

The solution:

Well, the easy solution is to use a TEXT field instead of a BYTEA field; the escaping and unescaping won't happen, and backslashes won't be an issue. If you've got more stringent requirements (data other than the output of Marshal), you can either escape your backslashes manually (after Marshaling or what-have-you), or go so far as to encode your string in a backslash-free scheme, like base 64.

What really makes this bug dangerous when using Marshal is that most of the time Marshal will hide the problem by accepting two backslashes in place of one. It's only if you have an 84-character string starting with three digits that you'll see the error. (Actually, 83 is just the first collision; later Marshaled strings also have this problem, starting at some length after 12000.)

Anyway, hope this saves someone the trouble I had.

No comments: