Replies

There are certain things machines do well. Performing many calculations very quickly is one, and is why computers are so good as playing chess. Distinguishing text from non-text, in context, is not. OCR technology is pretty good, and getting better, but it’s not perfect.

I deal with OCR’d documents every day (I’m an attorney, a number of my clients have scanned many of their contracts and saved them as PDFs). While the technology works pretty well, if I need to find particular language in a contract, a search of OCR’d text is no substitute for a pair of human eyes on the document.

Distinguishing text from non-text, in context, is not. OCR technology is pretty good, and getting better, but it’s not perfect. I deal with OCR’d documents every day (I’m an attorney, a number of my clients have scanned many of their contracts and saved them as PDFs).

Of course, the purpose of the pdf was to be a portable document format suitable for transfer to a printer. The OCR thing is a plus that came later. And just as with early chess programs that couldn't beat an average player, they aren't perfect.

In the case of the White House pdf, I have focused now on the signatures which are clearly non-text; and obviously (to me anyway) have different sources, pdf software or no.

I would be curious though about your pdf documents that you see in your legal work. I assume you produce some and you receive others. And here I'm only concerned with the ones that shouldn't be composites. Do you ever see any where even the text and the signatures have different pixelation? (Let alone two signatures next to each other!) And do you know anything about how the pdfs you produce become pdfs; or do you just scan and a pdf shows up on your computer? (As opposed say to obtaining a jpg from a scan and then utilizing Acrobat include the jpg image in a pdf?)

ML/NJ