Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

Tech Question: Best OCR software for (ubuntu) Linux?

Posted on 09/13/2011 12:08:14 PM PDT by cc2k

OK. I've answerd my share of Tech questions over the years. Now, I'm faced with a project and looking for words of experience from others.

Any recommendations (either positive, or to avoid) for OCR software for Linux. I run Ubuntu 10.04LTS on my desktop and laptop. I have 100+ pages, some typewritten, some from wordprocessing where the electronic versions are no longer available, which I need to convert to something that can be published on the web (probably on a WordPress site).

Other than typing from the source pages, what are good options for OCR software for Linux. Are there any really good open source solutions?


TOPICS: Technical; Your Opinion/Questions
KEYWORDS: linux; ocr
Sorry for the Vanity. I have a project I'm doing for a church, and I need to get 100+ pages of "history" into something that can be put on their website.
1 posted on 09/13/2011 12:08:22 PM PDT by cc2k
[ Post Reply | Private Reply | View Replies]

To: cc2k; rdb3; Calvinist_Dark_Lord; GodGunsandGuts; CyberCowboy777; Salo; Bobsat; JosephW; ...

2 posted on 09/13/2011 12:10:03 PM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)
[ Post Reply | Private Reply | To 1 | View Replies]

Comment #3 Removed by Moderator

To: cc2k

Look into tesseract or gocr. I’ve heard good things about tesseract.


4 posted on 09/13/2011 12:12:15 PM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)
[ Post Reply | Private Reply | To 1 | View Replies]

To: cc2k

http://jocr.sourceforge.net/


5 posted on 09/13/2011 12:14:58 PM PDT by libh8er
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

I always use one of the web services for that kind of stuff.

Sure you can install a local version, but you can’t beat the accuracy of some of these web services.

I always look for the one where I can just pay per page no contract.

Even if the price is higher, with the guarantee, it is awesome.


6 posted on 09/13/2011 12:33:22 PM PDT by dila813
[ Post Reply | Private Reply | To 2 | View Replies]

To: ShadowAce
Thanks, ShadowAce.

Just checked the Ubuntu Software Center, and the teseract-ocr package is available right there. One click installation, no fuss. I like that. Here's the description:


$ apt-cache show tesseract-ocr
Package: tesseract-ocr
Priority: optional
Section: universe/graphics
Installed-Size: 3216
Maintainer: Ubuntu Developers 
Original-Maintainer: Jeffrey Ratcliffe 
Architecture: amd64
Source: tesseract
Version: 2.04-2
Replaces: tesseract-ocr-data
Depends: libc6 (>= 2.4), libgcc1 (>= 1:4.1.1), libjpeg62, libstdc++6 (>= 4.1.1), libtiff4, zlib1g (>= 1:1.1.4), tesseract-ocr-eng | tesseract-ocr-language
Filename: pool/universe/t/tesseract/tesseract-ocr_2.04-2_amd64.deb
Size: 1034984
MD5sum: 459d4786fcc418b7e06b4f24a9633211
SHA1: 30e93552a4dd8fee4149d5c00296af6f09e9c1b3
SHA256: 70ac09ad1ec89e29943a714089fedbf4946eaac3ff240361ef1a0f86fb36cb76
Description: Command line OCR tool
 The Tesseract OCR engine was originally developed at HP between 1985 and 1995.
 It was open-sourced by HP and UNLV in 2005 and Google has lead further
 development.
 .
 The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV
 Accuracy test.  Between 1995 and 2006 it had little work done on it, but it
 is probably one of the most accurate open source OCR engines available.  It
 will read a binary, grey or color image and output text.
Homepage: http://code.google.com/p/tesseract-ocr/
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Origin: Ubuntu

Not sure how current the version is, but I'll give the version in the repository a spin before I start doing anything more exotic.

That package has an interesting pedigree and history. I'll report back if anyone is interested.

Thanks for the tip.

Oh, and these documents I'll be scanning will be a challenge for any OCR. Some are 2nd or 3rd generation copies, and some are quite old (originally from typerwriters).

Thanks to all who replied, and any further replies.

7 posted on 09/13/2011 12:37:05 PM PDT by cc2k ( If having an "R" makes you conservative, does walking into a barn make you a horse's (_*_)?)
[ Post Reply | Private Reply | To 4 | View Replies]

To: cc2k

I’ve used Paper Port for OCR on Windows for years. Now that I rarely use Windows natively, I use Paper Port on Windows in a VM. However, a friend of mine used a Linux app, GNU Ocrad , and said it suffices. If you use an Ubuntu based distro, it, and others, are in the repos, available through Synaptics or Software Center.

If you d/l and try it (or find something else that works), please let us know how it works out, ok?


8 posted on 09/13/2011 12:45:56 PM PDT by papasmurf (I support Palin & Perry, singular or plural & I pledge to vote (R), regardless.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: cc2k
I also would have to recommend gocr.

I have actually used GOCR on scanned printed pages and it worked quite well. I used it with a USB scanner running SANE. I have heard that Cuneform for Ubuntu 10.10 also works quite well.

If you are using a USB scanner, you might want to check that your scanner is supported by XSane in Linux ?

9 posted on 09/13/2011 12:50:34 PM PDT by pyx (Rule#1.The LEFT lies.Rule#2.See Rule#1. IF THE LEFT CONTROLS THE LANGUAGE, IT CONTROLS THE ARGUMENT.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: cc2k

Thanks for asking....I need something too.


10 posted on 09/13/2011 1:55:40 PM PDT by Ernest_at_the_Beach ( Support Geert Wilders)
[ Post Reply | Private Reply | To 7 | View Replies]

To: cc2k; All

VelOCRaptor: Pretty good free OCR software for the MAC.

11 posted on 09/13/2011 2:41:20 PM PDT by martin_fierro (< |:)~)
[ Post Reply | Private Reply | To 1 | View Replies]

To: cc2k

I have been working with OCR for 11 years now, started with Linux back in Russia, and your question always bugged me.

It is not a secret that Windows-based software always had best OCR quality on the market. Linux and Mac had a small choice, which means low competition, which means no aggressive chase after quality. For example, one of my previous companies developed for Windows and then ported some subset to Linux. With recent cloud developments this question now has a convenient solution - cloud-based OS-independent OCR service.

OCR-IT OCR Cloud 2.0 gives you access to the best web-based OCR on the market today from any computer or mobile platform, as long as there is internet connection (yeah, tough requirement!). You can sign up for a free Testing account and process your documents here: http://www.ocr-it.com/ocr-cloud-2-0-api


12 posted on 09/13/2011 6:55:34 PM PDT by ilyae (OCR Expert for 11 years on all platforms)
[ Post Reply | Private Reply | To 1 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson