Posted on 09/13/2011 12:08:14 PM PDT by cc2k
OK. I've answerd my share of Tech questions over the years. Now, I'm faced with a project and looking for words of experience from others.
Any recommendations (either positive, or to avoid) for OCR software for Linux. I run Ubuntu 10.04LTS on my desktop and laptop. I have 100+ pages, some typewritten, some from wordprocessing where the electronic versions are no longer available, which I need to convert to something that can be published on the web (probably on a WordPress site).
Other than typing from the source pages, what are good options for OCR software for Linux. Are there any really good open source solutions?
Look into tesseract or gocr. I’ve heard good things about tesseract.
I always use one of the web services for that kind of stuff.
Sure you can install a local version, but you can’t beat the accuracy of some of these web services.
I always look for the one where I can just pay per page no contract.
Even if the price is higher, with the guarantee, it is awesome.
Just checked the Ubuntu Software Center, and the teseract-ocr package is available right there. One click installation, no fuss. I like that. Here's the description:
Not sure how current the version is, but I'll give the version in the repository a spin before I start doing anything more exotic.
$ apt-cache show tesseract-ocr Package: tesseract-ocr Priority: optional Section: universe/graphics Installed-Size: 3216 Maintainer: Ubuntu Developers Original-Maintainer: Jeffrey Ratcliffe Architecture: amd64 Source: tesseract Version: 2.04-2 Replaces: tesseract-ocr-data Depends: libc6 (>= 2.4), libgcc1 (>= 1:4.1.1), libjpeg62, libstdc++6 (>= 4.1.1), libtiff4, zlib1g (>= 1:1.1.4), tesseract-ocr-eng | tesseract-ocr-language Filename: pool/universe/t/tesseract/tesseract-ocr_2.04-2_amd64.deb Size: 1034984 MD5sum: 459d4786fcc418b7e06b4f24a9633211 SHA1: 30e93552a4dd8fee4149d5c00296af6f09e9c1b3 SHA256: 70ac09ad1ec89e29943a714089fedbf4946eaac3ff240361ef1a0f86fb36cb76 Description: Command line OCR tool The Tesseract OCR engine was originally developed at HP between 1985 and 1995. It was open-sourced by HP and UNLV in 2005 and Google has lead further development. . The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. It will read a binary, grey or color image and output text. Homepage: http://code.google.com/p/tesseract-ocr/ Bugs: https://bugs.launchpad.net/ubuntu/+filebug Origin: Ubuntu
That package has an interesting pedigree and history. I'll report back if anyone is interested.
Thanks for the tip.
Oh, and these documents I'll be scanning will be a challenge for any OCR. Some are 2nd or 3rd generation copies, and some are quite old (originally from typerwriters).
Thanks to all who replied, and any further replies.
I’ve used Paper Port for OCR on Windows for years. Now that I rarely use Windows natively, I use Paper Port on Windows in a VM. However, a friend of mine used a Linux app, GNU Ocrad , and said it suffices. If you use an Ubuntu based distro, it, and others, are in the repos, available through Synaptics or Software Center.
If you d/l and try it (or find something else that works), please let us know how it works out, ok?
Thanks for asking....I need something too.
I have been working with OCR for 11 years now, started with Linux back in Russia, and your question always bugged me.
It is not a secret that Windows-based software always had best OCR quality on the market. Linux and Mac had a small choice, which means low competition, which means no aggressive chase after quality. For example, one of my previous companies developed for Windows and then ported some subset to Linux. With recent cloud developments this question now has a convenient solution - cloud-based OS-independent OCR service.
OCR-IT OCR Cloud 2.0 gives you access to the best web-based OCR on the market today from any computer or mobile platform, as long as there is internet connection (yeah, tough requirement!). You can sign up for a free Testing account and process your documents here: http://www.ocr-it.com/ocr-cloud-2-0-api
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.