The Case for a Git-Powered Project Gutenberg…

Project Gutenberg, for those of you that are not already familiar with it, is one of the single-most important community projects of the century: an attempt at creating a digital library of free books in a variety of formats, preserving classics and other works of literature from all ages. At the time of this post, the project boasts an impressive 38,000 works for which the copyrights have expired and have been released into the public domain.

Project Gutenberg (PG from here on out) not only indexes the text of these titles, but also original illustrations, metadata (author(s), publisher(s), date(s), illustration(s), etc., and most importantly, bookmarks/tables of contents). The process of “creating” a book comprises many steps and starts off with scanning the original books, using OCR to convert the scanned images to text, manually reviewing the scanned contents for OCR conversion errors, fixing formatting (footnotes, endnotes, spacing, etc.), marking bookmarks and jump locations, creating tables of contents, and finally, to use a software terminology, “building” the files into many different formats to cover the very much fragmented spectrum of eBook file types.

The reason for this primer on how PG works is to give a sense of how complex the entire endeavor is and all the steps and components involved in the process. There probably are more steps were not more immediately apparent and most of the steps listed above can probably be broken up into several more steps each. The point is, it’s an incredibly complicated and error prone process. And even when it’s done without errors or mistakes, there’s always room for improvement. And this is where the need for version control comes in.

At the start of any Project Gutenberg title, you’ll find something similar to this (between the PG listing info, some disclaimers, and volunteer acknowledgements), as taken from one of my favorite books of all time, The Five Children and It, written by E. Nesbit, published in 1902, and now in the public domain, at least in the USA and the EU (date of death + 70 years).

*Project Gutenberg Etext of Five Children and It, by E. Nesbit*  *****This file should be named fivit10.txt or fivit10.zip******    Corrected EDITIONS of our etexts get a new NUMBER, fivit11.txt.  VERSIONS based on separate sources get new LETTER, fivit10a.txt.

Project Gutenberg books are textbook-perfect examples of non-sourcecode projects that would benefit immensely from version control in general, and distributed version control (DVCS), such as GIT, in particular.

As you can see from the PG excerpt above, each book can have “editions” and “versions.” While I think the definitions of “edition” and “version” have been confused (well, to be honest, plain wrong), the important thing is that there are separate publications of each eBook (editions, or what PG is calling “versions”) each with their own sequence of edits and fixes (versions, or what PG is calling “editions”). Books can have typos, formatting errors, OCR mistakes, incorrect metadata, etc. which need fixing or can have features improved or added, such as introducing a linked table of contents, adding annotations, and so on and so forth, all of which warrant the introduction of a new version of a particular edition/publication.

At the same time, books can be available in multiple editions (say, different languages, illustrated vs non-illustrated, different publishing houses, reprints, improved texts, etc.) which are all equally correct and usable (i.e. one edition does not supercede the other).

These attributes of PG titles scream the need for a versioning system. I don’t know whether PG uses a versioning system internally/privately, but no such repository is publicly available (to the best of my knowledge). It’s downright foolish not to take advantage of the wonders a good VS can work with this sort of content: versions are revisions, editions are branches, commit logs preserve integrity and posterity, and an index of all changes is forever kept. Nothing is ever lost or overwritten, and the changes over time can be analyzed, indexed, and reviewed.

Now for the Distributed VCS part: PG accepts both new titles and revisions to existing titles from everyone. With 38,000 titles, that’s nothing to laugh about. Perhaps you spotted a typo in your copy of The Five Children and It introduced during the OCR process, or maybe you’re adding a translation no one has entered before, or you’ve found a missing footnote – with a DVCS solution like Git or Mercurial, nothing could be easier than forking the original, making your changes, then opening a ticket to propose that PG merge your changes back into their “official” distribution!

Or perhaps it’s time to make things even more interesting: these are all public domain titles. That doesn’t just mean that you’re free to read, copy, and distribute them all you like – you can even change what they say! No one can actually stop you from going around and changing your favorite classics. Maybe you want to rename a character, or change the way a particular story ends? It may be a sacrilege, but it’s your right, and with a DVCS, it becomes downright easy to make your changes while keeping them linked to the original, so you can pull any other corrections or fixes made to your “source” branch!

Or perhaps a less sacrilegious/offensive course of action would be to change a particular grammatical oddity. For instance, Edith Nesbit is famous for using “it” as a gender-neutral pronoun for situations where the person(s) being referred to can be male or female (i.e. not an object):

Everyone got its legs kicked or its feet trodden on in the scramble to get out of the carriage.

You can easily enough go through her book (though I’d shoot you if I ever met you afterwards!) and change that to something more conventional (the removal of “on” from “trodden on” is an OCD reaction to the poor parallelism in the original text and not associated with gender-neutral pronouns):

Everyone got his or her legs kicked or its feet trodden on in the scramble to get out of the carriage.

In short, Project Gutenberg cries out for a DVCS interface. Not only does it make linking editions and versions a lot easier and more manageable (imagine if that were 38 million titles instead of 38 thousand? Each being actively maintained and fixed?), it also preserves changes and makes it easy to find who’s responsible for what. It makes it easy for anyone to contribute fixes and changes back (á la Wikipedia), and introduces some very interesting possibilites into the mix (a GitHub-like fan-fiction site!). And, as a freebie, it makes something like integrating nightly build systems to generate the various output formats a breeze!

We’ll be contacting Project Gutenberg with a link to this article and will even offer our services in helping set something like this up, to figuratively put one’s money where one’s mouth is. In the meantime, just think about how DVCS could revolutionize the world!

12 thoughts on “The Case for a Git-Powered Project Gutenberg…

  1. This is a great idea! Especially as people would fork and reuse / reannotate. It could be that a book could be split into chapters – annotated with metadata / markup and recompiled into text / pdf versions too.

    Lets do it! :)

  2. have had the same idea many times. Especially whilst my wife was writing a book. As a developer i would love to be able to go through the change history also to see how these things evolve. Would make for some *really* interesting figures.

  3. John, you’re discussing something I hadn’t even considered and I must admit, I find it very appealing! This post was mainly about applying VCS to digitized versions of the final text, but I’m sure it would be very insightful to see the text evolve as it was being first written by the author.

    Imagine seeing all the “alternative endings” the author considered before picking the right one, all the different names the author once gave the main character before doing a “Replace All” with their final choice…

    Did your wife end up publishing?

  4. The whole Project Gutenberg library is available to torrent. I would encourage you to get something set up and working. No need to have someone tell you its okay for you to do- just do it.

    It will likely make the Project Gutenberg folks see the value in it rather than just showing them this post.

  5. Put it up on Github, and I’ll fork! If the Project Gutenberg people want to sanction an “official” branch, great, but I’m always put off by their interface– as a programmer, this’d actually be way easier for reading alone.

  6. A great idea! You have lots of good points in this article, but there’s another good reason for using a DVCS: Backups. PG is a cultural treasure, and by using a DVCS there will be thousands of backups around the world containing the whole history. If the PG servers goes down for some reason or another, there will always be some repositories available.

    I’ve been a big fan of PG since I discovered it in the mid-nineties, and I’d sure like to help out with this. How about starting a project on GitHub/Gitorious/etc dedicated to planning the setup?

  7. This would be awesome, if for nothing else than being able to manage community-contributed OCR corrections, punctuation fixes, and collaboratively create more advanced source formats than simply plain text.

    Great idea!

  8. @john nicholas and @Mahmoud: There’s a fantastic script for making automatic commits to a git repository as you are writing a book or other long form text.

    Check out Flashbake at https://github.com/commandline/flashbake/wiki

    I used it for my MA thesis a couple years ago and it was awesome. In addition to making periodic commits (which you can then go back in time and look at to observe the progress of the book), you have it add additional metadata in your commit messages, like current location, weather, most recent tweet, or the song you’re listening to.

  9. Nice idea — and you’re not the first to have it ;-)

    We started putting Project Gutenberg texts in version control (originally svn, then mercurial and then git) back in 2005/2006 for http://openshakespeare.org/. Current texts and material are here: https://github.com/okfn/shakespeare-material (there are also scripts for “cleaning” PG texts that could be useful).

    We also did an Open Milton (again using PG texts) and have been actively working on an Open Philosophy and Open Literature project as part of the Open Knowledge Foundation’s Open Humanities working group.

    As I said we think this is a great idea and would definitely be interested in collaborating :-)

  10. As a long term Project Gutenberg volunteer (having added over 500 books, most using Distributed Proofreaders), I’ve been looking into this a few times. For my own texts, I’ve been using bazaar since 2007 or so, and that repository already has grown to about 4 gigabytes. Here you are looking at a repository that is about 100 times the size, which would be a real challenge to pull the first time (Although it would be great to have from that point onward).

    Size apart, I think the biggest challenge will be showing the benefits and teaching the mechanics to the people who run PG. It will be quite a change from the current methods.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>