The Case for a Git-Powered Project Gutenberg…

Project Gutenberg, for those of you that are not already familiar with it, is one of the single-most important community projects of the century: an attempt at creating a digital library of free books in a variety of formats, preserving classics and other works of literature from all ages. At the time of this post, the project boasts an impressive 38,000 works for which the copyrights have expired and have been released into the public domain.

Project Gutenberg (PG from here on out) not only indexes the text of these titles, but also original illustrations, metadata (author(s), publisher(s), date(s), illustration(s), etc., and most importantly, bookmarks/tables of contents). The process of “creating” a book comprises many steps and starts off with scanning the original books, using OCR to convert the scanned images to text, manually reviewing the scanned contents for OCR conversion errors, fixing formatting (footnotes, endnotes, spacing, etc.), marking bookmarks and jump locations, creating tables of contents, and finally, to use a software terminology, “building” the files into many different formats to cover the very much fragmented spectrum of eBook file types.

The reason for this primer on how PG works is to give a sense of how complex the entire endeavor is and all the steps and components involved in the process. There probably are more steps were not more immediately apparent and most of the steps listed above can probably be broken up into several more steps each. The point is, it’s an incredibly complicated and error prone process. And even when it’s done without errors or mistakes, there’s always room for improvement. And this is where the need for version control comes in.

Continue reading