reCAPTCHA

The digital shift

This is ingenious.

Because of poor type, weathering from age or just plain dirtiness, books that are scanned are often misread by computers during digitization efforts. This is truer with older, out of copyright books. Humans, however, don't have as much difficulty reading a scanned page in image form. Which is to say, if you could get a group of humans to just sit and transcribe old books, the process would happen with much less error than the current automated processes. The only problem is, transcribing a book is supremely boring and tedious.

Blogs get a lot of spam. One way to stop spam is to use a CAPTCHA, which is an image of a string of messy or difficult to read random letters and numbers that a user must type correctly before the blog system accepts a comment.

reCAPTCHA combines these two problems. They take words from scanned pages of books that OCR software can't decipher, pair the indecipherable word with a known word, and present it as a CAPTCHA on a blog. The blog gets to avoid spam, and the book scanning project gets a (presumably) correct transcription of the otherwise un-computer-readable word. I suspect they must present the same unreadable word several times, aggregating responses and taking the most common set of responses as correct.

From the reCAPTCHA site:

About 60 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that's not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day. What if we could make positive use of this human effort? reCAPTCHA does exactly that by channeling the effort spent solving CAPTCHAs online into "reading" books.

I love projects like this — this system is such a simple, elegant and beautiful way to take something annoying, time consuming and riddled with negative connotation (CAPTCHAs in general and blog spam) and make it productive. I'm going to look into setting it up on our blog in the next few days.

Craig Mod >> May 25, 2007
Comments

Brilliant.

I really like the idea that someone was able to take the old Turing test (is a man or a machine?) one step further and make it a useful tool for digitizing books.

YES. The human eye-brain does contain more "fuzzy logic" than machines will be able to master for some decades to come. Glad to know that because the guys who scripted Terminator were - as good science fiction writers always are - closer to the truth than the newspaper or the scientific journals. Putting bits and pieces of current technology and extending the line of plausibility only a few decades into the future can really pay off.

But that is another digression.

Best wishes, Peter


Peter at June 8, 2007 03:55 AM


Post a comment









Remember personal info?









Subscribe to Adventures in Publishing
Our Books
Our Other Projects
Hosting By
Web hosting by ICDSoft

We've been hosting with ICD for over 3 years now with no hiccups. Super reliable, cheap and excellent tech support.

Categories
Recent Entries
Archives
RSS Feed
Powered by