Free Republic
Browse · Search
General/Chat
Topics · Post Article

Skip to comments.

ReCAPTCHA: The job you didn’t even know you had
The Walrus ^ | March 4, 2009 | Alex Hutchinson

Posted on 03/04/2009 9:00:26 PM PST by Lorianne

With the help of a MacArthur “genius” grant, von Ahn set out to make amends. Now a growing number of websites, from e-commerce (Ticketmaster) to social networking (Facebook) to blogging (Wordpress), have implemented the precocious professor’s new tool, dubbed recaptcha. If you’ve visited those sites, your squiggly-letter- reading ability has been harnessed for a massive project that aims to scan and make freely available every out-of- copyright book in the world, by deciphering words from old texts that have stumped scanning software.

The largest scanning centre in this project, directed by a San Francisco–based non-profit called the Open Content Alliance, occupies a dimly lit corner of the seventh floor of Robarts Library, on the University of Toronto’s downtown campus. The space is filled with twenty-three cubicle-like scanning stations draped on all sides with light-proof black cloth, like rows of coin-operated peep shows.

When the centre opened in 2004, its single robotic scanner used a vacuum suction arm to turn pages automatically. “We ran it into the ground,” recalls coordinator Gabe Juszel, a cheerfully earnest former filmmaker who sports a soul patch. “It was literally smoking by the time we were done with it.” But with the wide variations in book sizes, binding, and condition, they consistently found that they could achieve a higher scanning rate by simply turning off the robotic arm and flipping the pages manually — something else, it seems, that humans are still better at.

Two shifts of dedicated employees keep pages turning from 8:30 in the morning to 11 at night, leavening the monotony by listening to music on their iPods, reading, or (in one particularly talented case) knitting as they go. Two Canon digital slr cameras mounted in opposing corners of each booth click at an adjustable pre-set interval. Rookies opt for seven seconds, the slowest possible; veterans can scan a page per second.

Juszel had just returned from the oca’s annual meeting, where it was announced that the number of books available on the group’s Internet Archive had broken the one million mark. U of T is currently adding about 1,500 books a week — and at that rate there’s no need to be choosy about which ones to scan. “It’s a real beast to feed, actually,” says Jonathan Bengtson, the librarian who oversees the university’s role. Entire subject areas are scanned by sorting for pre-1923 works (in accordance with US copyright laws), eliminating duplicates, and taking everything that’s left. Scholars from around the world can also request books for ten cents a page, and typically see them online in less than twenty-four hours.

The most popular Toronto contribution, Juszel reports, is a 1475 edition of St. Augustine’s De civitate Dei, downloaded a baffling 75,911 times (at press time). “Who knew people liked Latin so much?” he says. Toward the other end of the spectrum is a book pulled from the stacks around the same time: the Montreal Philatelist, a monthly journal that ran from 1898 to 1902, which features lurid tales of stamp counterfeiting in Newfoundland.

... excerpt


TOPICS: Books/Literature; Science
KEYWORDS: technology

1 posted on 03/04/2009 9:00:26 PM PST by Lorianne
[ Post Reply | Private Reply | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson