The problem:Ruby's ActiveRecord does a fair amount of processing on any data you put into a database column; the behavior is data-type dependent, so various escaping and conversion can happen (making sure integer fields are numeric, and so forth). This works great (and transparently) almost all of the time, except when it comes to the BYTEA binary type. When storing BYTEA data, ActiveRecord escapes your data in an asymmetrical way, either using the PostgreSQL C API's
PQescapeByteaConn, if it's available, or a pure-Ruby implementation that does the same thing. This happens whether or not you actually call
ActiveRecord::Base.saveon the ActiveRecord object; it's part of
PQescapeByteaConnand its complement
PQunescapeByteaaren't symmetrical (see the documentation for
PQunescapeBytea. In particular, backslashes are treated poorly. You can prove this with the following snippet:
I first discovered this when trying to write data using Marshal, which can handle unexpected double-backslashes (it just treats them as a single one), but doesn't know what to do when a backslash is replaced by an unexpected character. When you Marshal a string, it is prefixed with
>> require 'rubygems'
>> require 'pg'
>> str = "\\"; puts PGconn.unescape_bytea(PGconn.escape_bytea(str)) + " = " + str
\\ = \
\010is a linefeed, and X is the character at 8+the length of the string. If a string is 84 characters long, X is a backslash (
84+8 = 92; 92.chr == "\"). Under most circumstances, this is okay, because as the string gets escaped and unescaped, you just end up with two backslashes, which Marshal deals with.
Unfortunately, if your string starts with three digits, as in "123andthen84charactersxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", the Marshaling process contains the substring "\123", which is treated by
PGescapeByteaConnas an octal character, so what comes out during unescaping isn't two backslashes, but rather the octal character "\123".
Marshal.restoreeither returns a truncated string or throws an exception, depending on the details.
The solution:Well, the easy solution is to use a TEXT field instead of a BYTEA field; the escaping and unescaping won't happen, and backslashes won't be an issue. If you've got more stringent requirements (data other than the output of Marshal), you can either escape your backslashes manually (after Marshaling or what-have-you), or go so far as to encode your string in a backslash-free scheme, like base 64.
What really makes this bug dangerous when using Marshal is that most of the time Marshal will hide the problem by accepting two backslashes in place of one. It's only if you have an 84-character string starting with three digits that you'll see the error. (Actually, 83 is just the first collision; later Marshaled strings also have this problem, starting at some length after 12000.)
Anyway, hope this saves someone the trouble I had.