Since we live in an imperfect world where we often have to trade ease-of-use for simplicity and abstractions for performance, while dealing with computer systems and standards you’ll often run into tasks that seem like they should be really obvious and simple only to later discover that in reality they lead down a rabbit hole. Such is the case anytime a developer comes into contact with the individual bits of a binary blob – and that can happen even when you least expect it.
The premise is simple: given a database powered by, for example, the extremely popular MySQL or SQLite RDBMS engines that do not offer “native” representation of UUIDs or GUIDs, database designers and systems programmers have a choice to make: do you store a GUID as a plaintext (
TEXT) and take the performance, memory, and storage hit that comes with it, or do you take off the gloves and dig out the
BLOB column type in your DDL?
For some quick context as to why this conundrum exists (and feel free to skip this and the next paragraph if you’re already encoding savvy): any time you see data that takes the form of a sequence of hexadecimal characters, you can bet your bottom dollar that it’s only a representation of the underlying data – and not the data itself – that you’re dealing with. As a concrete and very relevant example, the ubiquitous UUID:
What you’re seeing here is a sequence of ascii characters. If you count them, braces and dashes aside, you’ll find there are 32 of them. 32 ascii characters take up 32 bytes. But a GUID or UUID is only sixteen bytes long – how come!? That’s because a UUID is not defined as a sequence of characters, but rather a sequence of bytes – and given the fact that only seventy-nine of one hundred and twenty-eight possible values that can be stored in a single byte are “printable text” (that’s
~, and everything in between),1 we can’t actually display the GUID without resorting to binary or hexadecimal notation.
Back to the matter at hand: when you tell your database to store our UUID from before in a
TEXT field, it has no clue that you are telling it that it can take the 32 bytes you give it and shove them into a 16-byte field instead. Think about it, how can it know that after you execute the SQL query,
INSERT INTO mytable ('guid') VALUES ("4c36947a-f5ad-11e7-8f4c-7085c25fb679")
that you won’t follow that up with another:
INSERT INTO mytable ('guid') VALUES ("not a guid")
It can’t. You told it to treat this column as UTF8/ASCII text data, and that’s what it’s doing. In fact, the query we executed above won’t take up just 32 bytes, because as far as the database is concerned, the
- in the GUID is an integral part of the column value, and can’t just be dropped and reinserted willy-nilly. So it dutifully complies and stores your 16-byte GUID as a whopping 32, 36, or even 38-byte value (with the opening and closing braces and the dividing dashes), taking up 238% of the space it actually needs!
So sane developers store their GUIDs in
BLOB columns, as fixed 16-byte values. At runtime, the database driver/adapter/wrapper then converts the saved value from an array of 16 bytes to whatever datatype the software uses to represent a GUID, and the rest, as they say, is history.
But what happens when you need to manually dig into your database and look something up? Here’s what happens when we try to execute a simple query in SQLite that contains the GUID stored in a blob column:
U���&<ZE��Z�T�? What the heck is that? That’s not a GUID! That’s our GUID represented as UTF-8, and smiley face symbols aside, it ain’t pretty. But how can we get the version of the GUID that we know and love back from this abomination? And it’s not just for aesthetics – what if we need to match against another developer’s database, and they unwisely stored their GUIDs as text and not binary in their table?
The programmer, familiar as he is with ASCII, UTF-8, binary, hexadecimal, and all that other good stuff, might rush in with a solution like this,
SELECT substr(hex(guid), 1, 8) || '-' || substr(hex(guid), 9, 4) || '-' || substr(hex(guid), 13, 4) || '-' || substr(hex(guid), 17, 4) || '-' || substr(hex(guid), 21, 12) FROM [MyTable]
converting the binary value of the
guid column to hexadecimal (as that’s what our 0-9 and A-F characters are), and then splitting it up into a sequence of
8-4-4-4-12 characters (keeping in mind that every two characters form one byte), and then rejoice at the fruits of his success:
If something about those results strikes you as odd,2 that’s because they are, in fact, wrong. A smarter developer knows that when dealing with representations of binary data, it’s not enough to simply convert from binary to hexadecimal and call it a day. The smarter developer realizes that there might be some significance to the fact that UUIDs and GUIDs are always seen in the same, familiar
8-4-4-4-12 format, and think to herself, “Maybe I should check Wikipedia on this one.”
You see, GUIDs aren’t defined as “a sequence of 16 bytes, divided into 4 bytes, 2 bytes, 2 bytes, 2 bytes, and 6 bytes” but rather as
|Name||Length (bytes)||Length (hex digits)||Contents|
|time_low||4||8||integer giving the low 32 bits of the time|
|time_mid||2||4||integer giving the middle 16 bits of the time|
|time_hi_and_version||2||4||4-bit “version” in the most significant bits, followed by the high 12 bits of the time|
|clock_seq_hi_and_res clock_seq_low||2||4||1-3 bit “variant” in the most significant bits, followed by the 13-15 bit clock sequence|
|node||6||12||the 48-bit node id|
The key here is the use of the word integer, which has a very specific meaning. Unlike a single byte, which reads the same backwards and forwards, an integer has an additional gotcha: it can be stored in the memory either in little-endian or big-endian format, with its least-significant byte coming first or last when viewed in the memory.
The first 3 chunks of the UUID (the
8-4-4) are byte-order dependent, while the last two chunks (the
4-12) are not. And to keep things complicated, it gets even weirder than that: while you’d think that big endian architectures such as Intel/AMD x86 would use big endian representation for the integers while little endian architectures such as MIPS or PowerPC alone would use little endian representation, that’s not the case at all.
Instead, when the
8-4-4-4-12 UUID/GUID was first adopted, most (software) platforms explicitly used little endian (aka network byte order) for UUIDs, while Microsoft used the representation used by the architecture the OS was running on (big endian for Windows).3
There’s some method to the madness here, though: while Microsoft’s approach was in line with how the operating system treated four- and two-byte integers, other platforms were focusing on the universal part of UUID, and therefore used a hard-coded (and easier to work with!) little endian approach regardless of the architecture of the machine in question, meaning that binary representations of a UUID could be copied to a different architecture/environment and still retain the same meaning.
Today these differences are why we have two official variants of UUID, as codified in RFC 4122. Variant 1 is the scheme initially used by most of the world, with the underlying bytes in little endian format, while Variant 2 is the variant used by Microsoft (encountered when dealing with .NET/Win32/COM
And that is why simply converting a
BLOB guid column to hexadecimal with SQL’s
HEX() function and then splitting the string after a certain number of characters just won’t do. Instead, we have to use an abomination like this:
SELECT substr(hguid, 7, 2) || substr(hguid, 5, 2) || substr(hguid, 3, 2) || substr(hguid, 1, 2) || '-' || substr(hguid, 11, 2) || substr(hguid, 9, 2) || '-' || substr(hguid, 15, 2) || substr(hguid, 13, 2) || '-' || substr(hguid, 17, 4) || '-' || substr(hguid, 21, 12) AS guid FROM (SELECT hex(guid) AS hguid FROM messages)
Which finally gives us the correct result we have been looking for:
If you’re interested in further optimizing the performance of tables containing UUIDs, this article from Percona is a good read. Note however that simply using UUIDv4 completely eliminates the problem they address, as results are completely randomized.
For the people following along at home, you can easily check this for yourself without digging out an ASCII table by simply hitting F12 and typing in
'~'.charCodeAt() - '0'.charCodeAt() + 1in the repl. ↩
When’s the last time you saw so many GUIDs starting with a sequence of zeroes? ↩
Yes, we’re purposely pretending Windows Server 2000 for Alpha, with DEC Alpha’s configurable bi-endian support never happened. Now go away, you smart aleck. ↩