Replies

The trouble with this theory is that it ended up in a string of text where a letter is supposed to be. If the program couldn't recognize it as a letter, why did it put it in the middle of a string of text?

The point of the document archiving process is to be able to search for a document by the text within it but also retrieve an exact copy/image of the document. That's why the OCR'd text is placed in an invisible layer on the PDF--the search function can operate on the text without the text getting in the way of the exact copy of the image.

So the image would still have all the letters in their original positions--it didn't "put" the 'R' in the middle of a string of text, that's just where it was on the original form. But because it was fainter than the other letters, it was left as part of the background rather than being extracted with the rest of the text. That means it was downsampled along with the rest of the background when the PDF was optimized. (It would also mean that if the father's name wasn't there, someone searching for a document containing 'Barack' wouldn't find this, while someone searching for 'ack' would.)

“someone searching for a document containing ‘Barack’ wouldn’t find this, while someone searching for ‘ack’ would. - HHTVL

Silliness - the WH_LFCOLB.pdf has no text at all - it is all images. OCR was not done on the WH_LFCOLB.pdf.

Why do you spread such patently false information?

HMMMMM...

The point of the document archiving process is to be able to search for a document by the text within it but also retrieve an exact copy/image of the document. That's why the OCR'd text is placed in an invisible layer on the PDF--the search function can operate on the text without the text getting in the way of the exact copy of the image.

But it ISN'T an "exact copy of the image." It is a "meddled with" copy. OCR doesn't explain the "meddling."

So the image would still have all the letters in their original positions--it didn't "put" the 'R' in the middle of a string of text, that's just where it was on the original form. But because it was fainter than the other letters, it was left as part of the background rather than being extracted with the rest of the text.

What possible reason would the "background" have for being a different pixel resolution and bit depth from the text? Should we believe the programer was a moron? Your argument that a "fainter" object should have it's pixel resolution decreased while at the same time it's dynamic range INCREASED, makes no sense.

That means it was downsampled along with the rest of the background when the PDF was optimized.

Yeah, about that "optimized" stuff. Your argument previously was that "Optimizing" decreases files size and memory requirements. (As if that is of any concern nowadays.) It just now occurred to me that you can get a 4 X REDUCTION in memory size by using the larger pixels, but you get a 7 X INCREASE in memory requirements by switching from a Binary bit map to 8 bit gray-scale, and a 15 X increase by switching to 16 bit Color! Here is the difference.

binbit-map: 1 bit per pixel.
Gray-Scale: 11111111 bits per pixel.
16bitColor: 1111111111111111 bits per pixel.

How is this supposed to be a benefit?

(It would also mean that if the father's name wasn't there, someone searching for a document containing 'Barack' wouldn't find this, while someone searching for 'ack' would.)

Yeah, I get that. So for the benefit of some dubious secondary feature, they seriously degrade the necessary primary purpose of making an exact copy? And this theory makes more sense to you than different source image formats?

Occam's razor dude.