Three days ago, I wrote a post about stumbling across uncensored/unredacted plain-text content in the DOJ Epstein archives that could theoretically be reconstructed and reversed to reveal the uncensored originals of some of the files. Last night, I succeeded in doing just that, and was able to extract the DBC12 PDF file from the Epstein archive EFTA00400459 document. Don’t get too excited: as expected, the exfiltrated document (available for download here) turned out to be nothing too juicy and is “just” another exemplar of the cronyism and the incestuous circles of financial funding that Epstein and his colleagues engaged in.
There is a lot I could say in this post about the different approaches I took and the interesting rabbit holes I went down in trying to extract valid base64 data from the images included in the PDF; however, I am somewhat exhausted from all this and don’t have the energy to go into it all in as much detail as it deserves, so I’ll just post some bullet points with the main takeaways:
Properly extracting the PDF images
I had made a mistake in how I converted the original PDF into images for processing. u/BCMM on r/netsec reminded me of a different tool from the poppler-utils package that could extract the images as originally embedded in the PDF, whereas my approach with pdftoppm was unnecessarily re-rasterizing the PDF and possibly reducing the quality (as well as increasing the size).
It turns out I had rasterized the PDF at a high enough resolution/density that it made little to no visual difference, but it did get rid of the Bates numbering/identification at the bottom of each page and was definitely the correct approach I should have used from the beginning. Funnily enough, when I started typing out pdfima— in my shell, a history completion from the last time I used it (admittedly long ago) popped up, so I can’t even claim to have been unaware of its existence.
I re-uploaded the images (of the redacted emails) from EFTA00400459 to a new archive and you can download it here.
OCR is a no-go for base64 text
I learned much more than I ever wanted to1 about OCR. Some takeaways that I plan on turning into a proper post sometime soon (famous last words, I know):
- I had assumed OCR would be less prone to “hallucinations” than an LLM because the traditional OCR tools are all heuristic/algorithm-driven. This is true, and they do not hallucinate in the same sense that an LLM does, but they are not meant to be used for faithful reconstructions where byte-for-byte accuracy is desired. It turns out they all (more or less) work by trying to make sensible words out of the characters they recognize at a lower level (this is why OCR needs to understand your language, not just its script/alphabet, for best results). This obviously doesn’t work for base64, or anything else that requires context-free reconstruction of the original.
- Tesseract supposedly has a way to turn off at least some of these heuristics, but that did not improve its performance in any measurable way, though there’s always the chance that I was holding it wrong:
- The vast, vast majority of OCR usecases and training corpora are done with proportional fonts (the opposite of a monospace font). In my experience, all the engines ironically performed better at recognizing images with the much less regularly spaced (and, therefore, harder to algorithmically separate) proportional fonts than they did with monospace fonts.
- Adobe Acrobat OCR is terrible. Just terrible. Don’t pay for it.
Trying a data science approach
My first idea for tackling this without OCR was to identify the bounding boxes for each character on the page (in order, of course) and cluster them using kmeans with k set to 64 (or 65 if you want to consider the = padding). Theoretically, there’s no reason this shouldn’t work. In practice, it just didn’t.
I’m sure the approach itself is sound, but the problem was trying to get regular enough inputs. However hard I tried, I was unable to get the characters sliced regularly and perfectly enough to get scikit-learn to give me even remotely sane results with the KMeans module. I tried feeding it the images thresholded to black-and-white, greyscaled, and even as three-dimensional color inputs (essentially the lossless data). Every time it would create repeated buckets for some letters while sticking the actually different characters into other buckets.
Instead of letting kmeans decide the buckets for me, I decided to try seeding the buckets myself, rendering each letter of the base64 alphabet in Courier New at the right scale/point size to match the scans, then using different methods to bucket the letters into the right bins. But despite these being monospaced fonts, perhaps as an artifact of the resolution the images were captured at, I couldn’t get some glyphs to include a hint of the letter that came before/after, and tweaking it one way (adding a border on one side) would solve it for some inputs but break others (e.g. if you clip even one pixel from the righthand side of the box surrounding the letter O at the font size and DPI the images were captured/rendered in, it turns into a C).
A note from the author I do this for the love of the game and in the hopes of hopefully exposing some bad people. But I do have a wife and small children and this takes me away from my work that pays the bills. If you would like to support me on this mission, you can make a contribution here.
I turned to an LLM for help, but they are really bad at deriving things from first principles even when you give them all the information and constraints beforehand. Gemini 3 kept insisting I try HOG to handle the inputs and would repeatedly insist on trying a connected components approach to weed out noise (some on r/netsec were insisting that these images were 100% noise-free and without any encoding artifacts – this is patently not true), despite the fact that that absolutely destroys the recognition even if you apply the same algorithm to your seed buckets.
This approach with seeding the buckets yielded ~fair results, but it was completely stymied by the l vs 1 conundrum discussed in the previous post. Subpixel hinting differences between the ClearType algorithm used on the PCs the DoJ was using to originally render these emails and the font rendering engine used by OpenCV might look like nothing, but when the difference between an l and a 1 is 3 pixels at most, it adds up. Also, it didn’t help that my bounding box algorithm to draw a grid around the text and isolate each letter wasn’t perfect – I think there was a fractional-pixel mismatch everywhere in the 2x (losslessly) upscaled sources I was using, and that threw everything off (while using the 1x sources made it hard to slice individual characters, though I know the algorithm could have been tweaked to fix this).
Image kernels can’t solve bad typography
A number of people in the comments and on social media kept suggesting applying various kernels to the input in order to get better OCR results. But OCR wasn’t just getting 1 vs l wrong, it was making up or omitting letters altogether, giving me 75-81 characters per line instead of the very-much-fixed 78 monospaced characters I was feeding it. And using them to try and improve the situation with the classifier I wrote turned out to also be futile: if you darken pixels above a certain threshold, you make the difference between the 1 and the l more noticeable, but at that threshold you also darken the subpixel hinting around the lowercase w turning into a capital N.2 The kernels to make one pair of letters distinguishable from another would ultimately hurt the recognition of another pair, unless you knew in advance which to apply them to and which not to… which would imply you had already figured out which letter was which!
Except that’s not true: of course you could make a multi-level classifier to first just identify the ambiguous pairs like l and 1 and then feed those into a secondary classifier that applied the kernels, then tried bucketing these separately. But you know what? I didn’t think of that at the time. Mea culpa!
CNNs really are powerful dark magic
With the traditional image processing/data science classification approach not working too well, I decided to try another approach and see if I could train a CNN to handle both the discrepancies with the imperfect slicing around each character/cell, noise from adjacent cells abutting into the cell being analyzed, and the difference between 1 vs l. And you know what? I think I’m going to have to start using CNNs more!
They really are a powerful answer to a lot of this stuff. After just typing out two lines of base64 data from the first page of the PDF3 and training the CNN on those as ground truth, it was able to correctly identify the vast majority of characters in the input corpus, or at least the ones I could visibly tell it had gotten right, meaning I couldn’t be sure if it had handled l vs 1 correctly or not.
It turns out the alignment grid was off by a vertical pixel or two by the time it reached the end of the page, so tweaking the training algorithm to train against the top x lines and the bottom x lines of the page separately got rid of the remaining recognition errors… or at least, it did for the first page.

It’s really hard to see, but while the grid is almost identically placed around the characters at the top-left vs bottom-right, the subtle difference is there.
But alignment issues with how the bounding box was drawn on subsequent pages pushed the error rate well above the necessary zero, and it wasn’t until I realized that since these pages weren’t scanned but rather rendered into images digitally, I could just have the training/inference script memorize the grid from the first page and reuse it on subsequent pages was I able to get rid of all the recognition errors on all pages. (As an interesting sidebar, despite augmenting the training data with artificially generated inputs introducing randomized subtle (-2px to +2px) vertical/horizontal shifts, that wasn’t enough to address the grid drift.)
Well, all except the same pesky 1 vs l which still plagued the outputs and led to decode errors.
I admit I wasted a lot of time here barking up the wrong tree: I spent hours tweaking the CNN’s layout, increasing the kernel size of the first layer, getting rid of the second MaxPool, playing with further augmenting the input samples, trying denoising and alignment techniques, and much more trying to figure out how to induce the CNN into correctly recognizing the two pixel difference between the ones and the ells, all to no avail. I kept trying to increase the training data, meticulously typing out line after line of base64 (becoming thoroughly sick and tired of zooming and panning) to see if that’s what it would take to nudge the error rate down further.

A debug view I added to the training script after I mistyped a character one time too many. It displays the greatest deviation between the average shape of all characters in each training-provided “ground truth” bucket and the max outlier in the same.
It was only after I took a break for my sanity and came back to doggedly tackle it again that I realized one of the errors qpdf reported for the recovered PDF never changed no matter how much I tweaked the CNN or how much additional training data I supplied that I realized the problem: despite zooming in and taking a good 5 or 10 seconds on each 1 vs l I was entering into the training data, I had still gotten some wrong! A second email forward/reply chain with the same base64 content was included in the DoJ archives (EFTA02154109) and while it wasn’t at a much better resolution or quality than the first, it was still different and the 1s and ls were distinguished with different quirks. After I spotted one mistake I had made, I quickly found a few more (and this was even with zooming in on the training corpus debug view pictured above and verifying that the ones and ells had the expected differences!), and lo and behold, the recovered PDF validated and opened!
I posted the code that solved it to GitHub, but apologies in advance: I didn’t take the time to make the harness scripts properly portable. Everything works and they should run on your machine as well as they run on mine, except they’re using a very idiosyncratic mix of tooling. The scripts are written in fish instead of sh or even bash, use parallel (which I actually hate), and have some WSL-isms baked into them for (my) practicality. The README was rushed, and some of the (relative) paths to input files are baked in instead of being configurable. Maybe I’ll get to it. We’ll see.
As to what’s next, well, there are more base64-encoded attachments where this one came from. But unfortunately I cannot reuse this approach as the interesting-looking ones I’ve found are all in proportional fonts this time around!
Follow me at @mqudsi or @neosmart for more updates or to have a look at the progress posts I’ve shared over the past 48 hours while trying to get to the bottom of this.
Thanks for tuning in!
A note from the author I do this for the love of the game and in the hopes of hopefully exposing some bad people. But I do have a wife and small children and this takes me away from my work that pays the bills. If you would like to support me on this mission, you can make a contribution here.
Okay, that’s not true. I love learning about random stuff and I don’t know why people say this about anything. ↩
If I remember correctly! I wasn’t taking notes (I wish I had, and more screenshots, too) and it could have been two different characters. ↩
Well, second page, actually. The first page is mostly just the correspondence with one line of base64, but the second page onwards are 100% base64 content. ↩

EFTA00382108 is easily decoded as well. Air Travel Details.