Getting Spammed? Help Scan a Book!

Humans are apparently much better than machines at decoding words than OCR scanners are, so Carnegie Mellon University is putting the unreadable words online for the world to decipher. All in the interest of enhancing their digitizing efforts for the Internet Archive.

They’ve set up ReCAPTCHA, a free CAPTCHA service that gives webmasters the opportunity to add spam-defeating interfaces to websites. What’s the connection? Well, you’ve seen those small forms that force you to type in a word in order to successfully submit? On a ReCAPTCHA form, there is a second word in the CAPTCHA image that an OCR scanner couldn’t read well enough to decipher while scanning a book for the Archive.

If a website user decodes the first word successfully, the system assumes that they also decoded the second word, which becomes a candidate for being marked as deciphered. The system sends the second word to a second tier of CAPTCHAs, and if all of the second set of CAPTCHAs come up with the same reading, it is considered decoded and sent back to the database.

Their tagline? STOP SPAM. READ BOOKS.


Worldwide Digital Libraries

Along with Google Books and the Internet Archive/Open Content Alliance, there is a lesser-known collaboration between Carnegie Mellon University Libaries and three international institutions (Zhejiang University 浙江大学, the Indian Institute of Science, and the Bibliotheca Alexandrina) to put “a collection the size of a large university library” on the Web for free.

A million and a half books – for now, mostly in Chinese and English – have already been scanned and are (well, 15% of them anyway) accessible through the website at http://www.ulib.org/.

Back a few months ago, I decided to take a look at the UDL. A quick search for “pickwick” brought up 19 different records of various versions of Dickens’ The Pickwick Papers, under many different titles and authors, including one “Charles Dicknes”, one “CHARLES DICKENS”, and one “Scott Russell” – the title for the latter being rendered as, “Dickenss Posthumous Papers of the Pickwick Club (1912)”. Multi-volume copies had their titles rendered variously as “Vol I”, “Vol. Ii”, “Volume Ii”, etc. For those books that actually showed a detailed record when the title was clicked – many simply said “Book currently unavailable” – the tables of contents had been typed in haphazardly, as well. One record had every chapter title entered, many with misspellings, all in lowercase letters.

Some versions rendered the Subject as “Unknown”, while others had Subjects ranging from “Fiction” to “Language, Linguistics, Literature” to “Social Science” to (curiously enough) “Biology”. The latter example appears to be 1936 book by Logan Clendening that shows up in WorldCat as “A handbook to Pickwick Papers”, is rendered here as “A Hand Book Of Pickwick Papers” and the Publisher as “Alfred .A. Knopf”.

We visited the site again at the ASIS&T meeting on Monday the 28th. There was a prominent notice that metadata errors had come to their attention and were being cleaned up. Sure enough, the Pickwick Papers seem to be in a lot better shape now. (And to be fair, Google Books lists one “Posthumous Papers OR The Pickwick Club” in its results.)

If you visit some of these, take a moment to compare interfaces, and also look at some digitization projects hosted by individual libraries (particularly Special Collections and Rare Books). A contrast that quickly becomes apparent is the book-as-object metaphor (the British Library’s incredible Turning The Pages, for example) vs. the book-as-searchable-content metaphor (Google’s scans, which eliminate covers, endpapers, etc. in favor of focusing strictly on the page). Post your feedback here in the Comments — I’d love to know what you think.

Happy digital book-hunting,