Project Gutenberg, for those of you that are not already familiar with it, is one of the single-most important community projects of the century: an attempt at creating a digital library of free books in a variety of formats, preserving classics and other works of literature from all ages. At the time of this post, the project boasts an impressive 38,000 works for which the copyrights have expired and have been released into the public domain.
Project Gutenberg (PG from here on out) not only indexes the text of these titles, but also original illustrations, metadata (author(s), publisher(s), date(s), illustration(s), etc., and most importantly, bookmarks/tables of contents). The process of “creating” a book comprises many steps and starts off with scanning the original books, using OCR to convert the scanned images to text, manually reviewing the scanned contents for OCR conversion errors, fixing formatting (footnotes, endnotes, spacing, etc.), marking bookmarks and jump locations, creating tables of contents, and finally, to use a software terminology, “building” the files into many different formats to cover the very much fragmented spectrum of eBook file types.
The reason for this primer on how PG works is to give a sense of how complex the entire endeavor is and all the steps and components involved in the process. There probably are more steps were not more immediately apparent and most of the steps listed above can probably be broken up into several more steps each. The point is, it’s an incredibly complicated and error prone process. And even when it’s done without errors or mistakes, there’s always room for improvement. And this is where the need for version control comes in.
At the start of any Project Gutenberg title, you’ll find something similar to this (between the PG listing info, some disclaimers, and volunteer acknowledgements), as taken from one of my favorite books of all time, The Five Children and It, written by E. Nesbit, published in 1902, and now in the public domain, at least in the USA and the EU (date of death + 70 years).
*Project Gutenberg Etext of Five Children and It, by E. Nesbit* *****This file should be named fivit10.txt or fivit10.zip****** Corrected EDITIONS of our etexts get a new NUMBER, fivit11.txt. VERSIONS based on separate sources get new LETTER, fivit10a.txt.
Project Gutenberg books are textbook-perfect examples of non-sourcecode projects that would benefit immensely from version control in general, and distributed version control (DVCS), such as GIT, in particular.
As you can see from the PG excerpt above, each book can have “editions” and “versions.” While I think the definitions of “edition” and “version” have been confused (well, to be honest, plain wrong), the important thing is that there are separate publications of each eBook (editions, or what PG is calling “versions”) each with their own sequence of edits and fixes (versions, or what PG is calling “editions”). Books can have typos, formatting errors, OCR mistakes, incorrect metadata, etc. which need fixing or can have features improved or added, such as introducing a linked table of contents, adding annotations, and so on and so forth, all of which warrant the introduction of a new version of a particular edition/publication.
At the same time, books can be available in multiple editions (say, different languages, illustrated vs non-illustrated, different publishing houses, reprints, improved texts, etc.) which are all equally correct and usable (i.e. one edition does not supercede the other).
These attributes of PG titles scream the need for a versioning system. I don’t know whether PG uses a versioning system internally/privately, but no such repository is publicly available (to the best of my knowledge). It’s downright foolish not to take advantage of the wonders a good VS can work with this sort of content: versions are revisions, editions are branches, commit logs preserve integrity and posterity, and an index of all changes is forever kept. Nothing is ever lost or overwritten, and the changes over time can be analyzed, indexed, and reviewed.
Now for the Distributed VCS part: PG accepts both new titles and revisions to existing titles from everyone. With 38,000 titles, that’s nothing to laugh about. Perhaps you spotted a typo in your copy of The Five Children and It introduced during the OCR process, or maybe you’re adding a translation no one has entered before, or you’ve found a missing footnote – with a DVCS solution like Git or Mercurial, nothing could be easier than forking the original, making your changes, then opening a ticket to propose that PG merge your changes back into their “official” distribution!
Or perhaps it’s time to make things even more interesting: these are all public domain titles. That doesn’t just mean that you’re free to read, copy, and distribute them all you like – you can even change what they say! No one can actually stop you from going around and changing your favorite classics. Maybe you want to rename a character, or change the way a particular story ends? It may be a sacrilege, but it’s your right, and with a DVCS, it becomes downright easy to make your changes while keeping them linked to the original, so you can pull any other corrections or fixes made to your “source” branch!
Or perhaps a less sacrilegious/offensive course of action would be to change a particular grammatical oddity. For instance, Edith Nesbit is famous for using “it” as a gender-neutral pronoun for situations where the person(s) being referred to can be male or female (i.e. not an object):
Everyone got its legs kicked or its feet trodden on in the scramble to get out of the carriage.
You can easily enough go through her book (though I’d shoot you if I ever met you afterwards!) and change that to something more conventional (the removal of “on” from “trodden on” is an OCD reaction to the poor parallelism in the original text and not associated with gender-neutral pronouns):
Everyone got his or her legs kicked or
itsfeet trodden onin the scramble to get out of the carriage.
In short, Project Gutenberg cries out for a DVCS interface. Not only does it make linking editions and versions a lot easier and more manageable (imagine if that were 38 million titles instead of 38 thousand? Each being actively maintained and fixed?), it also preserves changes and makes it easy to find who’s responsible for what. It makes it easy for anyone to contribute fixes and changes back (á la Wikipedia), and introduces some very interesting possibilites into the mix (a GitHub-like fan-fiction site!). And, as a freebie, it makes something like integrating nightly build systems to generate the various output formats a breeze!
We’ll be contacting Project Gutenberg with a link to this article and will even offer our services in helping set something like this up, to figuratively put one’s money where one’s mouth is. In the meantime, just think about how DVCS could revolutionize the world!