Replies

People who were interested in this issue conducted software experiments with the Obama LFBC pdf image. They posted their findings and the steps they took so that ANYONE could try to duplicate what they did and observe the results for themselves.
Replication of an experiment is a key component of utilizing the scientific method,
One of the people who evaluated the findings is a Brazillian computer scientist named Ricardo de Queiroz. The 2nd, 4th, 5th, 7th, 8th, and 13th U.S. patents ever issued on Mixed Raster Content (MRC) compression were granted to Ricardo de Queiroz and his team.
It was suspected that MRC compression accounted for the artifacts observed in the Obama LFBC image.
Evaluation of Obama PDF File by Professor Ricardo de Queiroz

There is no possible way I can tell if the PDF of President Obama’s birth certificate (POBC) made available by the White House is a “forgery” or not. The forgery can happen before being processed not to mention that the paper document itself could be forged, before the scanning. Thus, this is not the point.

The question is whether all these artifacts we see after rendering the PDF of POBC are signs of forgery. I do not see that. I see them more likely as a result of inadequate processing.

The document has poor quality and it has been aggressively processed, no questions about it. The question is whether the corruptive processing was individual with the intent of forging it, or if it was automated within regular MRC segmentation.

If it was a forgery it was a very sloppy job. Any photoshop-knowledgeable person, of the garden variety, can do a much better job than that. If it is automated, it is a lousy job too, but bear in mind that algorithms for these jobs are not trained on specific documents. They were more likely developed, trained and tested on magazine pages and books. A US birth certificate is unlikely to give good results because it may be an outlier in the big picture of all documents they had in mind when developed their MRC tool.

MRC is about separating the single-image document into multiple layers, hopefully each one with a given characteristic. This has to be done automatically, in what we call segmentation. What I see in the document are signs of MRC segmentation consistent with strategies in line with the techniques pioneered by DjVu. I (and my students) do not advocate doing the segmentation that way, but that is not the point either. In fact, I would not be surprised if the software which segmented the WH document was derived from some DjVu tool.
[Continued in next post]

Part Two: Evaluation of Obama PDF File by Professor Ricardo de Queiroz

They first try to “lift” the text to another layer. They can find more than one type of text and place them in different layers. The rest is background and they compress with standard image compression methods. In the POBC
[President Obama Birth Certificate] I see lots of signs of that. It missed a lot of text, like the R in BARACK and in many other places. The missed text is aggressively compressed with JPEG for example, which justifies the damage to those text parts.

About the halos around some text: I am not sure why they do it, but it may be trying to suppress another halo problem caused by “lifting” scanned text that leaves some of the foreground in the background and vice-versa causing trouble to compress the layers. We wrote some papers about it. You can still see background through inside some “O” letters and inside the check boxes.

There might be morphological dilation around the text mask or the segmentation is block-based. The halo could be caused by the foreground in a dilated mask, or by processing the background. One plausible alternative is that the algorithm finds text as the letters with a bit of the surrounding background for safety. Some Adobe tools do that.

I see lots of signs of that. It missed a lot of text, like the R in BARACK and in many other places. The missed text is aggressively compressed with JPEG for example, which justifies the damage to those text parts.
About the halos around some text: I am not sure why they do it, but it may be trying to suppress another halo problem caused by “lifting” scanned text that leaves some of the foreground in the background and vice-versa causing trouble to compress the layers. We wrote some papers about it. You can still see background through inside some “O” letters and inside the check boxes.

Furthermore, the text is lifted to the foreground and sharpened (nearly binarized) making the background surroundings to disappear. When the text layer is pushed back onto the background plane the letter surroundings become halo. There is also some grayish lifted text, which was perhaps found to have different statistics and was then treated differently. The mask is binary, the foreground (text) can have any color or texture, or even parts of the background around the text. All these are conjectures; different algorithmic choices might produce similar results.

I took a birth certificate which has a similar background pattern, scanned and compressed using an older DjVu tool. It has shown the same problems as POBC, like text letters that were missed and sent to background, and multiple text styles. It didn’t have halo, though, because its algorithm decided to obliterate the whole background pattern. Perhaps if I had time to toy around with packages and parameters I might find something very close to what was used to generate the document shown by the WH, but I unfortunately do not have the time right now.

In summary I can only say I see much stronger signs of common MRC algorithmic processing of the image rather than some intentional manipulation.

Ricardo L. de Queiroz, Professor of Computer Science, University of Brasilia, Brazil
http://signal.ece.utexas.edu/~queiroz/resume/resume-consult.pdf