Wherever we go on the internet, we encounter CAPTCHAs, those twisted words that block or enable entries on websites.
There’s a CAPTCHA. Want to comment on an article or blog post? There’s a CAPTCHA. So why do we have them? They were invented to block spamming machines from posting wherever they want. In order to keep out spammers, a CAPTCHA has to effectively test if you are human or machine. Computer scientists figured out that one of the easiest ways to do that is to use images of language.
You know the things: You are presented with a few wavy, blurry words and you have to dutifully type them into a box, your teeth clenched, your mouth yelling, “I can’t read that mess.”
Well, next time you are confronted by one of these, you can take pride in the fact that you are doing your part to preserve history.
Yes, those annoying CAPTCHAs are actually being used to help digitalise decades of old texts — books, magazines and newspapers — that scanning programs struggle to decipher.
The reason the words are blurry or warped isn’t to test your patience; these are taken from scanned texts, which are often mistranslated by auto-digitising programs — or optical character recognition (OCR) software if you wish to get technical. That’s where we step in.
Through the use of CAPTCHAs, humans around the world digitalised 20 years worth of New York Times back issues in mere months. Within the first year, 440 million words had been deciphered: the equivalent of 17,600 books.
Google bought the technology in 2009, and is using it as the cornerstone of its ambitious Google Books project, which digitalises ancient, rare, and out-of-print works and offers them for free.
The technology came to be used in this way after the inventor of the CAPTCHA, Louis von Ahn, realised that while it only took a few seconds to type the letters, collectively humans were wasting hundreds of thousands of man-hours each day doing so, and so he set about discovering the best way to harness this energy.
“Human computation” is the less-than-charming term von Ahn uses to describe the process he arrived at. The updated software was dubbed the reCAPTCHA.
Initially CAPTCHAs would work by offering up a series of jumbled letters and intentionally warping these just enough that humans could easily read them but robots could not.
In the case of a ticketing company, this would stop software being developed by scalpers in order to automatically buy multiple tickets.
But the same inherent flaw that allowed CAPTCHAs to trip up robots also meant that OCR programs often failed to accurately decipher scanned text with any imperfections.
Fading, damage to the paper, and printing flaws means that OCR software incorrectly reads around 20 per cent of words — an unacceptable amount by any standards.
The program corrects this by pairing a word unable to be deciphered by the OCR software with a control word. If enough people type both words, the program can assume both are correct.
Note that both words are warped further by the program in order to decrease the chances that another OCR program — essentially one being used for cyber attacks — can also read the text.
Otherwise, this would defeat the CAPTCHA’s initial purpose — to halt such automated attacks.
The CyLab institute at Carnegie Mellon University, who developed the software, report a 99.1 per cent accuracy rate with the program, success they claim is “comparable to the best human professional transcription services”.
You may have also notice photos of numbers appearing in your CAPTCHA.
This is an even more ambitious plan: to digitalise street numbers scanned by Google Street View.
Of course, this is a less altruistic undertaking than the preservation of important literature from the past, and this implementation of the reCAPTCHA system might mesh more with critical views that this is simply Google employing free labour for its own commercial ends.
0 comments: