Free Republic
Browse · Search
General/Chat
Topics · Post Article

Skip to comments.

OCR + Searchability of Legal Documents.
Agere_Contra | Me

Posted on 11/26/2020 10:08:42 AM PST by agere_contra

Fellow FReepers!

Sidney Powell can spell just fine!

You are probably getting the idea by now, but it seems that ALL documents submitted to court are put through a OCR process.

OCR refers to 'Optical Character Recognition'. This is the conversion of different types of documents - scanned paper documents, digital images, PDFs and so forth - into clear text

The result is a searchable file.

This is to ensure that the text in those documents is searchable to (e.g.) research attorneys, court officials and other legal professionals.

Moreover: the data can *then* be moved to a common database and cross-referenced with other data. Anyone who has ever used a keyword search in FR will know how useful this is.

__________

I don't know for sure but I would expect that Sidney submits her various Krakens & affidavits in the form of write-protected PDFs & scans of signed documents.

Some of these file formats (PDFs for instance) could already be searchable - but the important thing for the court is to get ALL text into a common database. Searchability of individual files is not helpful - it *all* has to be machine-read.

__________

I suspect that each court Sidney submits files to carry out the following steps:

* Print out any electronic files to paper - OR alternatively they expect her to duplicate all submissions in the form of hardcopy.

* OCR all received hardcopy, and so expose all included text to the court library system.

__________

'District' vs 'Districct'

Large font titles near the begining of documents seem to get particularly hacked about by OCR. This may be because they are in a cursive font? I don't know for sure.

Hope this proves helpful.


TOPICS: Miscellaneous
KEYWORDS: court; filings; lawsuits; ocr; powell
Navigation: use the links below to view more comments.
first previous 1-2021-34 last
To: Fido969

The version we used, was a pain, required manual intervention. To guide it when it was confused and then to correct the errors. Accuracy was extremely important to us. The tool that we used was commercially available.

Sometimes the source was so rough that the OCR would guess wrong and do a terrible job.

Interesting... the human brain can interpolate based on sentence grammar and conceptually when a software program cannot. A computer cannot deal in concepts and ideas. This is AI and pattern recognition both at the shape matching level and conceptually.

We just ended up giving it to our documentation person who transcribed it and she was good.


21 posted on 11/26/2020 12:57:37 PM PST by dhs12345
[ Post Reply | Private Reply | To 16 | View Replies]

To: E. Pluribus Unum

In our situation, the documents were so old and were not in pdf. They were scanned to pdf. Not to be confused with pdf native documents, of course.


22 posted on 11/26/2020 1:01:03 PM PST by dhs12345
[ Post Reply | Private Reply | To 20 | View Replies]

To: dhs12345

Could you have printed them to PDF instead?

I don’t expect an answer.

Just seems like a possiblity.


23 posted on 11/26/2020 1:05:19 PM PST by E. Pluribus Unum (You are in far more danger from an authoritarian government than you are from a seasonal virus.)
[ Post Reply | Private Reply | To 22 | View Replies]

To: E. Pluribus Unum

? Is source a graphic image, a photo, or for example MS Word to pdf.

Two completely different beasts. If you open a native pdf you’ll find complete text and you could almost extract the text using a simple text editor. Cut and paste works too. I do it all of the time.


24 posted on 11/26/2020 1:10:47 PM PST by dhs12345
[ Post Reply | Private Reply | To 23 | View Replies]

To: Fido969

Well the MI filing got a case number. I think the GA filing will to with time.


25 posted on 11/26/2020 1:15:23 PM PST by dynoman (Objectivity is the essence of intelligence. - Marilyn vos Savant)
[ Post Reply | Private Reply | To 19 | View Replies]

To: E. Pluribus Unum

Hi Pluribus - yes you’re correct: DOCx or PDF files can be imported directly into text. And both are usually searchable as individual files (PDF’s might not be).

But sometimes those docs may contain true handwriting.

Imagine for instance an affadavit containing the scan of a hand-marked up work order for the repair of a Georgia toilet. That might need to be scanned for text, or it be ok to leave it as a scan, depending on an organisations workflow.

Case in point. I once had the job of creating a database using text that I OCR-ed from hand-written work-orders scanned to doc files.

These work-orders detailed railway incidents and remedial work on those incidents. The orders obviously had massive potential significance due to legal liability. My work made them searchable.

Data collection tools are far more digitised these days, but not everybody can wave a tablet or a phone and gather all they need from a crash, a bridge-strike location, the site of a leak etc.

Handwritten forms are still a big deal. I can certainly imagine a large organisation bound by Government regulations retaining a safe catch-all way of doing things. At least as a fall-back.

Schools, Hospitals, Nuclear facilities, Courts - the risk of missing a reference by not OCR-ing everything might be enormous.

And you always have the clean PDFs to work from if you need to. I guess that OCR-ed text is for searching on, not for presenting to a Judge.


26 posted on 11/26/2020 3:08:29 PM PST by agere_contra (Please pray for Pope Benedict XVI)
[ Post Reply | Private Reply | To 20 | View Replies]

To: dhs12345

Good point.

Human image recognition is so vastly superior to anything a computer can do that we routinely use jumbled up text on a patterned background - or picking out images with cars or boats in - as an infallible test for humanity.


27 posted on 11/26/2020 3:12:43 PM PST by agere_contra (Please pray for Pope Benedict XVI)
[ Post Reply | Private Reply | To 21 | View Replies]

To: Fido969

“Why paper?”

So Sandy Burglar can stuff them in his shorts, of course.

As I said, the mechanics of the legal profession change slowly, but there are arguments for having a physical copy. It can’t be hacked, for one thing. It creates a paper trail for documents and their history.

Most likely the documents were prepared electronically, printed, and then faxed somewhere, the faxes were scanned into a PDF and filed electronically, with the paper to follow during normal business hours. The people that are prepping docs are not in their home offices and the courts are not open 24 hours. Some of those steps are unnecessary - the print-scan step seems superfluous - but on the other hand, look what happened to the documents relating to Hunter’s laptop that were shipped with (allegedly) no backup. They were swiped en route. Electronic and paper redundancy can matter. And maybe there are security issues involved in the electronic filing, like a VPN connection that is only allowed from certain locations. Most lawyers are not particularly computer savvy, no more than most, and their paralegals are also not typically computer science majors or IT experts.

Anyway, no matter if the documents are prepared electronically, there is ALWAYS a paper backup.

And you are correct, there are numerous PDF engines. Adobe no longer has a monopoly on reading or writing that format. That’s been true for a good, long while, now.

It absolutely was a group project, directed by the lawyers, but the construction of the final filing was done by paras. After that there would have been several iterations of proofs and corrections. And you can see that some portions were boilerplate with square brackets around fields to be populated. (That’s what we were doing 20+ years ago, creating templates with fields to be populated. Square brackets and all. Then stringing them together according to what ever legal/business rules applied and populating the fields.) Most of the document is original work, but the initial parts are standard legal templates. I would bet a decent sum that the initial work was done in Word and it didn’t end up as a PDF until somewhere close to the point where it was filed. Those square brackets are artifacts of Word mail-merge fields (which don’t have to be used for just mail merges).


28 posted on 11/26/2020 4:19:53 PM PST by calenel (Tree of Liberty is thirsty.)
[ Post Reply | Private Reply | To 7 | View Replies]

To: frog in a pot

Immediate production of 36 hours of security camera recording of all rooms used in the voting process at State Farm Arena in Fulton County, GA from 12:00am to 3:00am until 6:00pm on November 3.


Typos abound in this thing


29 posted on 11/26/2020 6:29:32 PM PST by lepton ("It is useless to attempt to reason a man out of a thing he was never reasoned into"--Jonathan Swift)
[ Post Reply | Private Reply | To 6 | View Replies]

To: dynoman
Can you answer post 14?

The Michigan case appears in the Electronic Case File (ECF) system, but the Georgia case does not yet appear in the system.

https://defendingtherepublic.org/wp-content/uploads/2020/11/COMPLAINT-CJ-PEARSON-V.-KEMP-11.25.2020.pdf

A copy of the draft Complaint is posted on the website of Defending the Republic which is affiliated with Sidney Powell. It is probably just waiting for the court clerk to accept it and assign a case file number.

30 posted on 11/26/2020 7:14:17 PM PST by woodpusher
[ Post Reply | Private Reply | To 15 | View Replies]

To: frog in a pot
See footnote, page I. (It was before my first cup of coffee.)

I guess I missed the error. Can you point it out?:

1 The same pattern of election fraud and voter fraud writ large occurred in all the swing states with only minor variations, see expert reports, regarding Michigan, Pennsylvania, Arizona and Wisconsin. (See William M. Briggs Decl., attached here to as Exh. 1, Report with Attachment). Indeed, we believe that in Arizona at least 35,000 votes were illegally added to Mr. Biden’s vote count.

31 posted on 11/26/2020 7:24:11 PM PST by Yo-Yo (is the /sarc tag really necessary?)
[ Post Reply | Private Reply | To 6 | View Replies]

To: calenel; agere_contra
The initial filing was likely scanned from paper into an image, then processed via OCR back into a searchable electronic format (PDF) for transmission to the court.

No, it was converted to PDF electronically with no scanner.

The metadata shows: "xmp:CreatorTool: Acrobat PDFMaker 20 for Word"

https://helpx.adobe.com/acrobat/using/creating-pdfs-pdfmaker-windows.html

The whole source file is converted to a PDF file very quickly by pressing a button.

32 posted on 11/26/2020 7:30:10 PM PST by woodpusher
[ Post Reply | Private Reply | To 5 | View Replies]

To: agere_contra

I amm not whir-eed about it.


33 posted on 11/26/2020 7:35:15 PM PST by linMcHlp
[ Post Reply | Private Reply | To 1 | View Replies]

To: Yo-Yo
I guess I missed the error. Can you point it out?

See
"...attached here to as Exh. 1"

The Complaint is signed off by four legal entities and undoubtedly was reviewed by any number of attorneys and paralegals. Like others on this thread, IMO the final draft of the filing was free of typos. What we are seeing is a mechanically reproduced form intended for after hours electronic transmission.

The Left and its MSM propaganda machine delights in criticizing the filing as nonprofessional, indeed amateurish.

34 posted on 11/27/2020 2:05:11 PM PST by frog in a pot (The American voter should realize there is nothing democratic about the current Democrat Party.)
[ Post Reply | Private Reply | To 31 | View Replies]


Navigation: use the links below to view more comments.
first previous 1-2021-34 last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson