reCAPTCHA
The digital shiftBecause of poor type, weathering from age or just plain dirtiness, books that are scanned are often misread by computers during digitization efforts. This is truer with older, out of copyright books. Humans, however, don't have as much difficulty reading a scanned page in image form. Which is to say, if you could get a group of humans to just sit and transcribe old books, the process would happen with much less error than the current automated processes. The only problem is, transcribing a book is supremely boring and tedious.
Blogs get a lot of spam. One way to stop spam is to use a CAPTCHA, which is an image of a string of messy or difficult to read random letters and numbers that a user must type correctly before the blog system accepts a comment.
reCAPTCHA combines these two problems. They take words from scanned pages of books that OCR software can't decipher, pair the indecipherable word with a known word, and present it as a CAPTCHA on a blog. The blog gets to avoid spam, and the book scanning project gets a (presumably) correct transcription of the otherwise un-computer-readable word. I suspect they must present the same unreadable word several times, aggregating responses and taking the most common set of responses as correct.
From the reCAPTCHA site:
About 60 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that's not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day. What if we could make positive use of this human effort? reCAPTCHA does exactly that by channeling the effort spent solving CAPTCHAs online into "reading" books.
I love projects like this — this system is such a simple, elegant and beautiful way to take something annoying, time consuming and riddled with negative connotation (CAPTCHAs in general and blog spam) and make it productive. I'm going to look into setting it up on our blog in the next few days.
Brilliant.
I really like the idea that someone was able to take the old Turing test (is a man or a machine?) one step further and make it a useful tool for digitizing books.
YES. The human eye-brain does contain more "fuzzy logic" than machines will be able to master for some decades to come. Glad to know that because the guys who scripted Terminator were - as good science fiction writers always are - closer to the truth than the newspaper or the scientific journals. Putting bits and pieces of current technology and extending the line of plausibility only a few decades into the future can really pay off.
But that is another digression.
Best wishes, Peter
Peter at June 8, 2007 03:55 AM

We've been hosting with ICD for over 3 years now with no hiccups. Super reliable, cheap and excellent tech support.
Curing Japan's America Addiction
Do You Know, the book
Goodbye Madame Butterfly
Kuhaku, the book
Last of the Red Hot Poppas
Book fairs
Bookstores
Business
Buzztracking
Circular file
Coffee Mondays
Copyright issues
Design
English usage
Hitotoki
Japan Infusion
Japan market
Life in Japan
Life in the US
Marketing
Media issues
Midwifery
Music Fridays
Noteworthy Publishers
Online publishing
Paper art
Readings
Reviews
Small press watch
The digital shift
The industry
The lit world
Things literary and otherwise
Working with printers
Writing
Art Space gives Guardian the lowdown
Sleep and productivity
New York Art Beat!
Art and neighborhoods
Art Space Tokyo Tokyo launch party TONIGHT!
Things literary and otherwise X
Envisioning Japan at Brooklyn Museum
Transpacific metamorphoses
Worst corporate word of the day
Chin Music Press at Hugo House tonight
Art Space Tokyo — Tokyo release party!
Confessions of a canned-coffee collector
CMP & 101Tokyo
The Butterfly quickie book tour in pictures
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004





