Optical Character Recognition
During your foray into the world of document scanning, you’ve likely encountered the term “OCR” and may even know that it stands for “Optical Character Recognition“. But what exactly is OCR and how can you make the best use of this sophisticated and valuable tool?
We’re here to give you a run-down of what you need to know about Optical Character Recognition, answer any questions you might have, and recommend the best OCR software solution for your scanning project.
Table of Contents:
- What is OCR?
- OCR Output Types
- Limitations of OCR
- OCR for Business
- Levels of OCR Software
- Improving OCR Accuracy
- Desktop OCR Software Comparison
- Feature Comparison Chart
- OCR Apps and Descriptions
What is OCR?
The primary purpose of Optical Character Recognition is to quickly and automatically scanned or photographed document images into machine readable text that can be searched for keywords or edited in a word processor.
In general, an OCR engine analyzes the pixel data of scanned images and searches for patterns resembling letters, numbers, and other symbols to create a digitized record of characters.
The biggest OCR engines employ huge Artificial Intelligence (AI) and Machine Learning (ML) models that have been trained on billions of documents collected over decades of development.
While the exact mechanics of this process can be complicated, OCR engines are a key automation tool for the digital age. It bridges the gap between knowledge stored on physical documents and digital data that can be edited, searched or parsed into structured data to automate data entry tasks.
OCR Output Types
Full Page OCR converts the entire document into one of the following formats:
- Plain Text – Only the text in the document is retained.
- Formatted Text – Text information is retained in consecutive paragraphs while saving font size and style.
- Exact Copy – All information on the page is retained, including graphics, and placed on the page in the manner that most closely recreates the original document.
- Spreadsheet – Documents with tables can be converted automatically to Excel, CSV and other spreadsheet formats.
- Searchable PDF File – Text information is retained on a hidden layer behind the scanned image, allowing the file’s contents to be searched while retaining the appearance of the original.
- E-Book – Convert paper books to popular e-book formats for use in digital readers.
Limitations of OCR
OCR software is also limited in what it is able to recognize. Most OCR software are only designed to recognize machine printed text, as opposed to handwriting. While there are ICR software that can recognize handwritten information, they tend to be enterprise level solutions for forms processing work, rather than full page recognition.
Similarly, most OCR software are only able to convert traditional machine fonts, not cursive scripts or calligraphy. There are many fonts out there, and OCR engines depend on common, separated letter shapes to recognize the text, so fonts that are unusual or flow together will not be recognized.
OCR Solutions for Business
OCR can do a lot more than convert scanned documents to Word and PDF files. Businesses can use OCR to automate a wide variety of document workflows and data entry tasks.
Business OCR data capture solutions including OCR servers for high volume conversions, document scanning and archiving systems, forms processing software with handprint recognition to capture surveys and applications, invoice processing for accounts payable automation, and document management systems to create secure repositories for searching, security and regulatory compliance.
Robotic Process Automation is becoming one of the most popular applications of OCR by making it possible for IT and knowledge workers to integrate OCR data capture into business workflows without having to write code or interface with APIs.
Integration services are available from our expert staff, each of whom has at least 10 years experience with implementing OCR data capture solutions for businesses.
Levels of OCR Software
OCR Software for full-text conversion comes in many different types, which vary in price range based on their features, speed, and accuracy.
For instance, you can get OCR freeware such as SimpleOCR or Tesseract that will serve in a pinch, but it will not provide acceptable accuracy if the document images are not pristine, and have other limitations like language support and the number of pages that can be processed at once.
One step up from freeware is Desktop OCR software. These are the best option if you need to convert several documents to Word or PDF and can spend $50-$200 to ensure that you get quality results with minimal need for corrections and reformatting.
If you have need to convert hundreds or thousands of documents, you can invest in a Batch OCR designed for scanning and converting large volumes of documents, or Server OCR software that watches “hot” folders for incoming documents in a variety of formats and languages and convert them to Word, PDF, eBook and other formats automatically.
For more information check out:
Improving OCR Accuracy
Although some OCR engines are better than others, no software can guarantee 100% accuracy. This is because there are other factors in play, including scan quality. Recognition software will not be able to do its work if the scanner is not properly digitizing the page.
It is recommended to scan at a resolution of 300dpi for best results. Black & White (Bitonal) is preferred over Greyscale or Color modes, and although most modern scanners are fairly well configured out of the box, you may want to adjust your Brightness and Contrast settings for your particular documents.
If you do not have a scanner that has the necessary speed, quality, or other features that you require to scan your documents, you can always find a large selection of document scanners at ScanStore! ScanStore even has a handy scanners guide to help you find the perfect scanner for your specific requirements and price range.
For more on improving OCR accuracy check out these articles:
SimpleOCR is brought to you by:
Document Scanners – Scanner Parts
Celebrating 20 Years
OCR Software Guide
There are several OCR (Optical Character Recognition) software solutions available to convert scanned images to text, Word, Excel, HTML or searchable PDF. The differences between them can often be obscure, leaving many to wonder why some OCR software cost under $100 while others cost $500 or more.
The main features that differentiate OCR software are:
- Character recognition accuracy
- Page layout reconstruction accuracy
- Support for languages
- User interface design
- Output file formats (Word, Excel, PDF, eBook, etc.)
- OCR speed and support for multi-core CPUs
- Batch processing modes
- Advanced PDF encryption or compression
- Special features for niche projects
Because of the infinite combinations of document types, OCR engines, project requirements and special features, it may be possible that one engine will perform better with your particular documents than another. Use our handy OCR feature comparison chart to determine which OCR program best meets your requirements. And you can always ask an expert for a recommendation anytime!
And The Winner Is…
Our OCR experts have tested the latest versions of FineReader, Kofax OmniPage, and ReadIRIS, and we consider ABBYY FineReader the best overall value for business users, while ReadIRIS is the best OCR software for under $100.
The key deciding factors were:
- User interface design
- Page layout reconstruction capabilities
- Extensive language support
- Engine stability when processing large files
- Availability and quality of technical support
Though other testing labs have ranked OmniPage‘s overall accuracy slightly higher, we find the difference is nearly negligible. All modern OCR software has very good accuracy, so we recommend going with the one that has particular special features like ReadIRIS Corporate‘s CardIRIS, FineReader’s camera OCR and screenshot reader, or OmniPage Ultimate’s form data collection, auto-redaction and barcode filing capabilities.
If you would like to try them out yourself, ScanStore offers free demo downloads for ReadIRIS and FineReader. Kofax does not provide demos for its OCR products.
Businesses with many documents to process should use our SimpleIndex batch document scanning software with the FineReader OCR engine to scan and OCR large batches of documents. Barcode and OCR can also be used to sort and file documents into folders, databases or SharePoint.
OCR Data Capture
OCR can also be used to automate data entry from forms, surveys, invoices and other documents. Handwriting recognition (ICR) solutions are also available. For more information, check out these links:
- SimpleIndex batch scanning, zone OCR & barcode recognition software
- Forms processing & data capture software guides
- Invoice OCR software guide
- Compare batch scanning & forms processing solutions
The SimpleOCR freeware does NOT have any handprint OCR capabilities. it will not be able to recognize handwritten text. ICR (Intelligent Character Recognition) software is needed to read handwriting. There is a demo of SoftWriting included in the download but this is not part of the freeware and no longer supported by the vendor.
SimpleOCR Freeware | ABBYY FineReader 15 | ABBYY FineReader Corporate 15 | IRIS ReadIRIS Pro 17 | IRIS ReadIRIS Corporate 17 | Kofax OmniPage Ultimate | SimpleIndex Desktop 9 | |
---|---|---|---|---|---|---|---|
Scanner Drivers Supported | TWAIN | TWAIN | TWAIN | TWAIN | TWAIN | TWAIN / ISIS | TWAIN / ISIS |
Table/Spreadsheet Recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ||
PDF Password Support | ✓ | ✓ | ✓ | ✓ | |||
Searchable PDF Output | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Highly Compressed PDF Output | MRC | MRC | iHQC | iHQC | MRC | ||
Vertical Text Recognition | ✓ | ✓ | ✓ | ||||
Barcode Recognition | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Image Pre-processing | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Watched / Hot Folder | ✓ | ✓ | ✓ | ✓ | |||
Batch processing | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Managed server processing | |||||||
Indexing | ✓ | ✓ | |||||
Business Card Recognition | Box version only | ||||||
Screenshot reader | ✓ | ✓ | |||||
Zone Templates | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
Proofing & Training | ✓ | ✓ | ✓ | ✓ | |||
Languages Supported | 3 | 193 | 193 | 128 | 137 | 137 | 179 |
Arabic/Farsi/Hebrew | Hebrew & Arabic Version | Hebrew & Arabic Version | ✓ | ✓ | |||
Chinese/Japanese/Korean | ✓ | ✓ | ✓ | ✓ | ✓ | ||
User Dictionaries | ✓ | ✓ | ✓ | ✓ | |||
Page Limit per Document | 1 | none | none | 50 | none | none | none |
Multi-Core Support | ✓ | ✓ | ✓ | ✓ | ✓ | ||
Installation | Desktop | Desktop | Desktop/Server | Desktop | Desktop | Desktop | Desktop |
License | Freeware | Standalone | Per Seat / Concurrent | Standalone | Standalone | Standalone 3 Licenses | Standalone |
Product Information | More Info | More Info | More Info | More Info | More Info | More Info | More Info |
ABBYY FineReader Server | IRIS Powerscan Server | PaperVision Capture OCR Server | SimpleIndex 9 | |
---|---|---|---|---|
Scanner Drivers Supported | TWAIN / ISIS | N/A | TWAIN / ISIS | TWAIN / ISIS |
Table/Spreadsheet Recognition | ✓ | ✓ | ✓ | |
PDF Password Support | ✓ | ✓ | ✓ | |
Searchable PDF Output | ✓ | ✓ | ✓ | ✓ |
Highly Compressed PDF Output | MRC | iHQC | ||
Vertical Text Recognition | ✓ | ✓ | ||
Barcode Recognition | ✓ | ✓ | ✓ | ✓ |
Image Pre-processing | ✓ | ✓ | ✓ | ✓ |
Watched / Hot Folder | ✓ | ✓ | ✓ | |
Batch processing | ✓ | ✓ | ✓ | ✓ |
Managed server processing | ✓ | ✓ | ✓ | ✓ |
Indexing | ✓ | ✓ | ✓ | ✓ |
Business Card Recognition | ||||
Screenshot reader | ||||
Zone Templates | ✓ | ✓ | ||
Proofing & Training | ✓ | ✓ | ||
Languages Supported | 191 | 137 | 121 | 179 |
Arabic/Farsi/Hebrew | Hebrew & Arabic Version | 2 Add-ons | ||
Chinese/Japanese/Korean | ✓ | Asian Add-on | ||
User Dictionaries | ✓ | ✓ | ✓ | |
Page Limit per Document | none | none | none | none |
Multi-Core Support | ✓ | Multi-CPU Add-on | ✓ | |
Installation | Server | Desktop / Server | Desktop / Server | Desktop / Server |
License | Volume-based | Core-based | Named / Concurrent | Standalone |
Product Information | More Info | More Info | More Info | More Info |