Data CAPTCHA

Image
Rrishi Raote New Delhi
Last Updated : Jan 20 2013 | 2:49 AM IST

Have you ever wondered how it is that vast numbers of books are scanned and digitised in order to be served up on the Internet? I’ve always assumed it is a low-tech process that begins with low-paid humans slamming books onto scanner beds, hitting the button, waiting, turning the page, hitting the button, waiting... There were pages visible on Google Books that proved this to be so. One famous image showed a woman’s hand captured in front of a page. The hand had red-painted fingernails.

The scanning process is now mostly automated. But digitisation — turning image into text — is still only part-automated. It could be done by a human sitting in front of a screen typing out the words she sees in the scanned image of a page. But that would take too long and cost too much. Instead, optical character recognition (OCR) software “reads” the images to turn them into text. Then humans step in to correct the inevitable misreadings.

It works all right with good-quality printed pages. But old books and newspapers, printed when the surface of paper was less well-finished and printing technology was less advanced, are much harder for a computer to read. OCR results for old pages are sometimes total gibberish. Give that same computer-unreadable page to a human and it would be decoded in a jiffy.

But you would need hundreds of millions of humans working together to make a real dent in the vast number of pages in old books and documents still waiting to be digitised. So what to do? Well, start by watching the video of a TED talk delivered by Luis von Ahn in April (but posted online a few days ago). Von Ahn is the fast-speaking clever mind behind the CAPTCHA, or “Completely Automated Public Turing test to tell Computers and Humans Apart”.

CAPTCHAs are those oddly distorted words that you come across on many websites, before posting a comment, say, or buying a ticket. They are meant to separate humans from “bots” — programmes designed to deliver spam or block-book tickets. A “bot” cannot read distorted words. You can. When you type the words into the box, this proves that you are a human. Pass.

In the TED talk, von Ahn says he was “sad” that millions of people were wasting a few seconds each every day filling out a CAPTCHA. So he and his graduate students — he teaches at Carnegie Mellon — decided to put users’ time to better use. Many sites that use CAPTCHAs now use two-word puzzles. These are called reCAPTCHAs. One word of the pair is the one that a human must type correctly to pass. The other is a word picked from a document waiting to be digitised that the OCR software cannot read. The user will not know which is which.

If enough users type the machine-unreadable word the same way, von Ahn’s software will decide that this must be the right word — and one more word will have been rescued. Because the number of contributors is so large, digitisation is speeding up. Reportedly 100 million words a day are being recovered this way. And von Ahn and his team have long since moved on to another task: using “crowdsourcing” to translate all the Web’s content in different languages — while helping users learn those languages.

I should say at this point that Google bought von Ahn’s CAPTCHA and reCAPTCHA (and no doubt wants to buy Duolingo, von Ahn’s forthcoming language product). Google Books is using the technology to digitise millions of books. The New York Times has used it to digitise its 150-year archives. Its archive search, available to subscribers, is good.

In India, the right to information (RTI) system is in big trouble. Too many requests come in for information officers to handle. One obvious solution is for the government to voluntarily put all material covered by the RTI Act online. If it’s out there, a citizen won’t file a request. Surely reCAPTCHA technology can make the job of digitisation much less slow and difficult?

And when that is shown to be working, why not make a start on digitising the state and National Archives? What about newspaper archives? This will be cheaper for the government, and potentially lucrative for newspapers. Worth it, in either case.

*Subscribe to Business Standard digital and get complimentary access to The New York Times

Smart Quarterly

₹900

3 Months

₹300/Month

SAVE 25%

Smart Essential

₹2,700

1 Year

₹225/Month

SAVE 46%
*Complimentary New York Times access for the 2nd year will be given after 12 months

Super Saver

₹3,900

2 Years

₹162/Month

Subscribe

Renews automatically, cancel anytime

Here’s what’s included in our digital subscription plans

Exclusive premium stories online

  • Over 30 premium stories daily, handpicked by our editors

Complimentary Access to The New York Times

  • News, Games, Cooking, Audio, Wirecutter & The Athletic

Business Standard Epaper

  • Digital replica of our daily newspaper — with options to read, save, and share

Curated Newsletters

  • Insights on markets, finance, politics, tech, and more delivered to your inbox

Market Analysis & Investment Insights

  • In-depth market analysis & insights with access to The Smart Investor

Archives

  • Repository of articles and publications dating back to 1997

Ad-free Reading

  • Uninterrupted reading experience with no advertisements

Seamless Access Across All Devices

  • Access Business Standard across devices — mobile, tablet, or PC, via web or app

More From This Section

First Published: Dec 24 2011 | 12:19 AM IST

Next Story