Replies

The point of the document archiving process is to be able to search for a document by the text within it but also retrieve an exact copy/image of the document. That's why the OCR'd text is placed in an invisible layer on the PDF--the search function can operate on the text without the text getting in the way of the exact copy of the image.

But it ISN'T an "exact copy of the image." It is a "meddled with" copy. OCR doesn't explain the "meddling."

So the image would still have all the letters in their original positions--it didn't "put" the 'R' in the middle of a string of text, that's just where it was on the original form. But because it was fainter than the other letters, it was left as part of the background rather than being extracted with the rest of the text.

What possible reason would the "background" have for being a different pixel resolution and bit depth from the text? Should we believe the programer was a moron? Your argument that a "fainter" object should have it's pixel resolution decreased while at the same time it's dynamic range INCREASED, makes no sense.

That means it was downsampled along with the rest of the background when the PDF was optimized.

Yeah, about that "optimized" stuff. Your argument previously was that "Optimizing" decreases files size and memory requirements. (As if that is of any concern nowadays.) It just now occurred to me that you can get a 4 X REDUCTION in memory size by using the larger pixels, but you get a 7 X INCREASE in memory requirements by switching from a Binary bit map to 8 bit gray-scale, and a 15 X increase by switching to 16 bit Color! Here is the difference.

binbit-map: 1 bit per pixel.
Gray-Scale: 11111111 bits per pixel.
16bitColor: 1111111111111111 bits per pixel.

How is this supposed to be a benefit?

(It would also mean that if the father's name wasn't there, someone searching for a document containing 'Barack' wouldn't find this, while someone searching for 'ack' would.)

Yeah, I get that. So for the benefit of some dubious secondary feature, they seriously degrade the necessary primary purpose of making an exact copy? And this theory makes more sense to you than different source image formats?

Occam's razor dude.

Yeah, about that "optimized" stuff. Your argument previously was that "Optimizing" decreases files size and memory requirements. (As if that is of any concern nowadays.)

Well, Adobe builds the capability in for some reason, as do other document management system vendors. You don't think that file size matters for people storing thousands of pages, as the users of a document archiving system would?

It just now occurred to me that you can get a 4 X REDUCTION in memory size by using the larger pixels, but you get a 7 X INCREASE in memory requirements by switching from a Binary bit map to 8 bit gray-scale, and a 15 X increase by switching to 16 bit Color!...How is this supposed to be a benefit?

I'll take one more shot at this. There is no switching from a binary bitmap to grayscale. The scanning and processing software recognizes most of the letters either as text (if it's doing OCR) or at least as pure black. It handles those in one way. It recognizes the background as a color image and handles it in another way. Because the 'R' is faint compared to the other letters, it's treated as part of the background and processed as just a gray area of the background image. The whole background image is stored as a color image, and the 'R' is part of it--it's not "switched" to being in color, nor does it have its dynamic range increased. (As I'm sure you realize, if you scan a black-and-white photo at the same settings as you would use for a color photo, you get a file the same size as if it were a color photo. The computer doesn't "understand" that gray isn't really a color--unless you tell it so.) If the background is downsampled, the 'R' is downsampled along with it. The important thing is that the computer doesn't know it's an 'R'. We can recognize it, but the software just thinks it's a gray smudge.

Here's something from a vendor of PDF compression software:

High accuracy recognition rates are achieved by leveraging advanced image processing techniques including: re-sampling, foreground and background separation, auto-rotation, and font learning. [Emphasis mine.]

And this theory makes more sense to you than different source image formats?

What I see is a choice between believing that the BC anomalies are the result of some combination of automatic PDF processing functions; or they're the result of someone sitting down with multiple source files and (using Adobe Illustrator, mind you, not Photoshop or some other tool much better suited to the task) copying and pasting letter by letter--a 'B' from this file, an 'R' from another, a different 'R' from yet another; a box from here, another box from there--to assemble a forgery. Occam's razor only cuts one way for me on that question.