2007-02-08
Google's progress on book digitization
I had not thought about this projct or the CCEL project for quite awhile. Looked at the Current Cites newletter this morning and read 2 articles on digitization. Interesting to see that CCEL has revamped their site into the Dupal content management system and that they are using a load balanced 2 server setup to handle traffic.
There are major differences between the Google and CCEL approaches however. Note that Google is simply digitally photographing (scanning??? in a new way!) pages of pages and then OCRing the text to create a digital archive. A concise timeline story is at the New Yorker:
(Sidenote: what does Google do for backup purposes? Disaster recovery must be an interesting process. But at the same time - the content is static once created - so a mirror site is easily feasible.)
CCEL on the other hand starts where Google is finished. Thml is applied to the documents while the OCR text is hand edited and corrected. Hopefully OCR is accurate but I can attest to the fact that numerous editing changes are required from the CCEL book that I completed in 1999 to 2001 (St. Francis of Sales - Treatise on the Love of God). The end result is a fully tagged document that can be searched (intelligently with tags), linked, converted into multiple formats while retaining the meta structure of the document, and serves as digital content. And additionally the scanned image is also available as well (this is particularly interesting with very old documents that contain artwork, unusual title pages, or even to look at the fonts and layout used).
Tags: technology - books
2007-02-08
|
<< | Trail Index | >> HomeLinks |

