Well, if I called the wrong number, why did you answer the phone?

James Thurber

Recent Changes

edit RightBar

Notes ActionList


2007-02-08

Google's progress on book digitization

I had not thought about this projct or the CCEL project for quite awhile. Looked at the Current Cites newletter this morning and read 2 articles on digitization. Interesting to see that CCEL has revamped their site into the Dupal content management system and that they are using a load balanced 2 server setup to handle traffic.

There are major differences between the Google and CCEL approaches however. Note that Google is simply digitally photographing (scanning??? in a new way!) pages of pages and then OCRing the text to create a digital archive. A concise timeline story is at the New Yorker:

GOOGLE’S MOON SHOT - The quest for the universal library. by JEFFREY TOOBIN "Every weekday, a truck pulls up to the Cecil H. Green Library, on the campus of Stanford University, and collects at least a thousand books, which are taken to an undisclosed location and scanned, page by page, into an enormous database being created by Google. The company is also retrieving books from libraries at several other leading universities, including Harvard and Oxford, as well as the New York Public Library. At the University of Michigan, Google’s original partner in Google Book Search, tens of thousands of books are processed each week on the company’s custom-made scanning equipment.
Google intends to scan every book ever published, and to make the full texts searchable, in the same way that Web sites can be searched on the company’s engine at google.com. At the books site, which is up and running in a beta (or testing) version, at books.google.com, ..."

(Sidenote: what does Google do for backup purposes? Disaster recovery must be an interesting process. But at the same time - the content is static once created - so a mirror site is easily feasible.)

CCEL on the other hand starts where Google is finished. Thml is applied to the documents while the OCR text is hand edited and corrected. Hopefully OCR is accurate but I can attest to the fact that numerous editing changes are required from the CCEL book that I completed in 1999 to 2001 (St. Francis of Sales - Treatise on the Love of God). The end result is a fully tagged document that can be searched (intelligently with tags), linked, converted into multiple formats while retaining the meta structure of the document, and serves as digital content. And additionally the scanned image is also available as well (this is particularly interesting with very old documents that contain artwork, unusual title pages, or even to look at the fonts and layout used).

Tags: technology - books

2007-02-08


<< | Trail Index | >> HomeLinks


Page last modified on March 01, 2007, at 02:39 PM