Have you ever wondered how it is that vast numbers of books are scanned and digitised in order to be served up on the Internet? I’ve always assumed it is a low-tech process that begins with low-paid humans slamming books onto scanner beds, hitting the button, waiting, turning the page, hitting the button, waiting... There were pages visible on Google Books that proved this to be so. One famous image showed a woman’s hand captured in front of a page. The hand had red-painted fingernails.
The scanning process is now mostly automated. But digitisation — turning image into text — is still only part-automated. It could be done by a human sitting in front of a screen typing out the words she sees in the scanned image of a page. But that would take too long and cost too much. Instead, optical character recognition (OCR) software “reads” the images to turn them into text. Then humans step in to correct the inevitable misreadings.
It works all right with good-quality printed pages. But old books and newspapers, printed when the surface of paper was less well-finished and printing technology was less advanced, are much harder for a computer to read. OCR results for old pages are sometimes total gibberish. Give that same computer-unreadable page to a human and it would be decoded in a jiffy.
But you would need hundreds of millions of humans working together to make a real dent in the vast number of pages in old books and documents still waiting to be digitised. So what to do? Well, start by watching the video of a TED talk delivered by Luis von Ahn in April (but posted online a few days ago). Von Ahn is the fast-speaking clever mind behind the CAPTCHA, or “Completely Automated Public Turing test to tell Computers and Humans Apart”.
CAPTCHAs are those oddly distorted words that you come across on many websites, before posting a comment, say, or buying a ticket. They are meant to separate humans from “bots” — programmes designed to deliver spam or block-book tickets. A “bot” cannot read distorted words. You can. When you type the words into the box, this proves that you are a human. Pass.
In the TED talk, von Ahn says he was “sad” that millions of people were wasting a few seconds each every day filling out a CAPTCHA. So he and his graduate students — he teaches at Carnegie Mellon — decided to put users’ time to better use. Many sites that use CAPTCHAs now use two-word puzzles. These are called reCAPTCHAs. One word of the pair is the one that a human must type correctly to pass. The other is a word picked from a document waiting to be digitised that the OCR software cannot read. The user will not know which is which.
If enough users type the machine-unreadable word the same way, von Ahn’s software will decide that this must be the right word — and one more word will have been rescued. Because the number of contributors is so large, digitisation is speeding up. Reportedly 100 million words a day are being recovered this way. And von Ahn and his team have long since moved on to another task: using “crowdsourcing” to translate all the Web’s content in different languages — while helping users learn those languages.
I should say at this point that Google bought von Ahn’s CAPTCHA and reCAPTCHA (and no doubt wants to buy Duolingo, von Ahn’s forthcoming language product). Google Books is using the technology to digitise millions of books. The New York Times has used it to digitise its 150-year archives. Its archive search, available to subscribers, is good.
In India, the right to information (RTI) system is in big trouble. Too many requests come in for information officers to handle. One obvious solution is for the government to voluntarily put all material covered by the RTI Act online. If it’s out there, a citizen won’t file a request. Surely reCAPTCHA technology can make the job of digitisation much less slow and difficult?
And when that is shown to be working, why not make a start on digitising the state and National Archives? What about newspaper archives? This will be cheaper for the government, and potentially lucrative for newspapers. Worth it, in either case.