Replies

The Library of Congress has been digitizing old books in their collection for quite some time. Building I worked in had some of Schwartzkopf's intelligence folks "doing stuff". One of the guy's father was deeply involved in that digitizing effort.

He was always good for a tale or two about the Library of Congress. He related that the major problem was in figuring out how to READ the old stuff so that digitized text in ancient fonts or lead type of all sorts could be read.

I am not an expert in this but the basic idea is that you "read" the text in a number of places to be able to identify the entire data set of all the characters that would be found in the document.

That data set is then matched up with standard OCR programming.

You call up the ancient book, the OCR process clicks in and reads the text in the native fonts just like it was all current and up to date modern standard fonts.

They then associate the full OCR'd text with the visible text, and that enables you to quickly find material in the book while thinking you are reading it in the native printing.

The LOC was then able to do a quick and dirty on a vast number of books where a complete "read" wouldn't be processed except for commonly accessed works. If someone wanted to look at an infrequently read document then the OCR software would do the job for you.

The idea was that eventually everything would be OCRd, but all in due course, and as cheap as possible within the framework of the LOC budget.

Indiana University was in the business of digitizing a vast number of pictures ~ they had one of the world's premiere collections. Yale is doing that with their unique sets. Presumably other universities are doing the same thing.

There the problem is you need a deck of super computers on hand to handle the data stream for compression.