Posts and articles addressing OCR accuracy and how to improve OCR results with optimal scanner settings, recognition parameters, and OCR-friendly document design.

OCR Solutions for Email Archiving and Data Retention Compliance

Email Archiving Solutions

Email Processing and ArchivingEmail archiving solutions are essential for organizations to manage, store, and retrieve email communications efficiently. These solutions ensure that emails are retained in a secure, tamper-proof format, which is crucial for compliance with various regulatory requirements. For instance, the Sarbanes-Oxley Act (SOX) mandates that businesses retain certain types of records, including emails, for up to seven years. Non-compliance can result in severe penalties, including fines and imprisonment.

Advanced email archiving solutions offer more than just backup and retention; they provide data loss prevention (DLP) capabilities. DLP policies help prevent sensitive data from leaving the corporate network, adding an extra layer of security 1 These solutions also generate audit trails, which are vital for proving compliance during regulatory inspections. Audit trails document all user actions and system activities, ensuring that email retention policies are enforced and that security measures are maintained.

Email Retention Compliance

Email retention compliance is a critical aspect of email management, ensuring that organizations adhere to legal and regulatory requirements. The SEC, for example, requires that certain business records, including emails, be retained for specific periods. Section 802 of the Sarbanes-Oxley Act outlines these retention requirements and the penalties for non-compliance, which can include fines and imprisonment.

To achieve compliance, organizations must implement email archiving solutions that offer fast search and retrieval capabilities. This is particularly important during litigation, where eDiscovery processes can be costly and time-consuming. Efficient email archiving solutions can reduce these costs by providing quick and accurate retrieval of relevant emails.

Best practices for creating an email archiving policy include setting clear retention rules, automating the archiving process, and ensuring that the policy is consistently enforced. Organizations should also consider the specific regulations that apply to their industry, such as HIPAA for healthcare or SEC rules for […]

Do you need AI for OCR?

AI is not needed for most document capture automation applications because many of these tasks involve structured, repetitive, and rules-based processes that can be handled with traditional software techniques. Here’s why:

1. Rule-Based Systems Handle Structured Data Well.

Many document capture tasks involve standardized forms like invoices, purchase orders, or tax forms. These documents follow consistent layouts. And it is possible to extract information using template matching, OCR, and regular expressions without needing AI. For example, a system that always extracts the invoice number from the top-right corner of an invoice doesn’t need AI, just zone OCR.

2. High Structure = Low Ambiguity

When the document layout is predictable and the fields to extract are well-defined, rule-based engines can be more accurate and significantly faster than AI models for these predictable cases.

3. Cost and Complexity

AI systems, especially machine learning or deep learning ones, require large volumes of labeled training data, high computational resources, and ongoing maintenance and tuning. For many companies, such investments aren’t worth it when a simpler solution works as well or even better.

4. Document processing needs to work offline or in a strict security environment.

There could be plenty of reasons why your documents need to be processed offline or in severely limited online mode. Most of these reasons have to do with security, and they are not to be treated lightly. When it comes to processing sensitive data (financial or medical), the use of AI could be simply impossible. It is possible to host an AI locally, but it is rather complicated. Training a locally hosted AI is even more burdensome, while advantages of using AI are not that significant.

5. […]

Kodak Alaris OCR

Kodak is very well known brand for anything photography, filming and imaging. They also have a long line of scanners of a decent quality that many companies keep using. Surprisingly, Kodak did not had OCR software to come with their hardware. That all have changed when Kodak went bankrupt and in the end got merged with Alaris hardware and software development company that already had Alaris Capture Pro OCR software.

Now, Kodak Alaris offers two lines of OCR for many document automation needs. These are Kodak Alaris PRO OCR and Kodak Alaris Info Input Solution which is more an enterprise solution that offers many more options for much larger document volume.

Kodak Alaris Capture Pro is scalable and flexible to grow along with your business. Find the right edition of the software for your business, from desktop scanning to dedicated, high-volume operations.

Extract and Index with Confidence

Prevent post-scan rework with technology that automatically validates accurate capture. Intelligent Exception Processing immediately identifies missing information, such as a required signature. Intelligent Quality Control automatically flags questionable information for review.

Ensure Process Integrity

Securely capture and share information across your business. High-quality imaging delivers accurate data for your business applications. Share job setups across the organization to maintain standard indexing and routing rules. You can install Kodak Alaris Capture Pro software on local workstations and use it without an internet connection.

KODAK Alaris Info Input Solution is an intelligent document processing software that automates and simplifies the journey from document arrival to usage in business processes – quickly, accurately, and reliably. Simplify document onboarding processes, harmonize data across departments, and scale your business faster with IDP automation that makes sense.

Unique Open Intelligence™ architecture for optimization from every angle

Right out of the box,

Total Cost of Ownership of an OCR Software

What is TCO (total cost of ownership)?

Total Cost of Ownership

The Total Cost of Ownership (TCO) is a financial estimate intended to help buyers and owners determine the direct and indirect costs of a product or service. It is a management accounting concept that can be used in full cost accounting. (Wikipedia).

TCO is a popular concept when it comes to comparing Software. It allows you to estimate the cost of using the OCR software and plan your automation strategy according to your possible workload and budget over time.

There are several aspects that are too circumstantial to be able to include in general cost estimates, but there are some that we know for a fact, and we can combine them in a total cost of ownership for each OCR scenario and compare them to the possible solutions on the market.

Front-End Cost

The first and most obvious element is front-end cost. This is the price that will be presented with the product, and the marketing team will be pointing towards special offers and discounts to cut it down. It is an important factor, but it is far from everything you would need to take into account. And yet, even here, there are different options. Some of the OCR solutions will offer you the “pay and forget” option of making one payment right here, right now. However, lately, the annual cost or subscription model has become more and more popular, allowing customers to spread costs over time and OCR creators to organize a steady flow of income.

Cost of Support and Maintenance

The second element of the total cost of ownership would be the cost of support and maintenance. Theoretically, it is possible to avoid […]

OCR Guide

What is OCR?

OCR stands for Optical Character Recognition and is the technology that allows software to interpret text on scanned images. When this technology is applied to automating business data entry processes it’s referred to as OCR Data Capture.

Many are familiar with popular desktop OCR applications designed to convert scanned images to editable documents. When this process is applied to specific areas of the document containing data fields it’s called zone OCR. But OCR data capture software is more than just simple zone OCR. Modern applications use some or all of these technologies:

80%

Using the OCR software enables enterprises to reduce the document processing time by as much as 80%

Benefits of using OCR

If not for the trees then do it for the savings on paper, toner, copiers and their services contracts, etc.

How much time is wasted searching for paper files? Digital documents can searched and viewed instantly from anywhere.

Paper is much harder to backup and restore than digital data.

Office square footage and off-site records storage adds to the cost of keeping paper documents.

Government mandates for records retention […]

Grooper Document Processing

Grooper was built from the ground up by BIS, a company with 35 years of continuous experience developing and delivering new technology. Grooper is an intelligent document processing and digital data integration solution that empowers organizations to extract meaningful information from paper/electronic documents and other forms of unstructured data.

The platform combines patented and sophisticated image processing, capture technology, machine learning, natural language processing, and optical character recognition to enrich and embed human comprehension into data. By tackling tough challenges that other systems cannot resolve, Grooper has become the foundation for many industry-first solutions in healthcare, financial services, oil and gas, education, and government.

  • Single platform
  • Patented OCR
  • Image processing
  • Machine learning
  • Natural language processing
  • Zero code
  • Zero templates
  • Open architecture

Amazon Textract API

Automatically extract handwriting, plain text or form data from any document using the world’s largest OCR machine learning model based on billions of sample documents.

Amazon Textract is a cloud OCR service that automatically detects and extracts text and data from scanned documents and PDF files. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

Amazon Textract API also lets you implement OCR in your RPA workflows. UiPath and other bots offer connectors that let you include Textract OCR into your RPA process.

Textract is not a “ready-to-use” product. It requires programing skills, experience with AWS systems and decent amount of coding to implement it into your systems, especially once you add user interfaces for scanning and data validation.

Simple Software developers have the necessary skills and experience to integrate Textract into your custom applications. Contact us or click the Request a Quote button to get a proposal for your custom application development project.

Simple Software also offers the ready-to-use SimpleIndex application that incorporates Textract into a fully-featured scanning, indexing and document processing application.

You can learn more about Amazon Textract into SimpleIndex here.

Google Cloud Vision API

Vision API offers powerful pre-trained machine learning models through REST and RPC APIs. Assign labels to images and quickly classify them into millions of predefined categories. Detect objects and faces, read printed and handwritten text, and build valuable metadata into your image catalog.

Automatically extract handwriting, plain text or form data from any document using a huge machine learning model based on billions of sample documents.

Google Vision is a cloud OCR service that automatically detects and extracts text and data from scanned documents and PDF files. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables.

Google Vision API also lets you implement OCR in your RPA workflows. UiPath and other bots offer connectors that let you include Vision OCR into your RPA process.

Google Vision is not a “ready-to-use” product. It requires programing skills, experience with Google cloud services, and decent amount of coding to implement it into your systems, especially once you add user interfaces for scanning and data validation.

Simple Software developers have the necessary skills and experience to integrate Google Vision into your custom applications. Contact us or click the Request a Quote button to get a proposal for your custom application development project.

ABBYY Vantage

ABBYY Vantage leverages AI machine learning and a huge library of document “skills” to provide out-of-the-box data capture for all kinds of documents.

Vantage provides a simple way to implement new data capture processes without the need for programmers.

It takes the FlexiCapture platform, hosts it in the cloud, and dramatically simplifies the interface. The thousands of settings you can use with FlexiCapture to build templates are managed by the AI, giving you a simple point and click interface to create new document capture workflows.

The “Skills” library gives you pre-configured capture workflows for hundreds of the most common documents. Simply connect them to your import and export destinations and you are ready to go, saving you hours or even days of development time.

Why are the prices of OCR applications so different?

OCR software ranges in price from freeware all the way up to tens of thousands of dollars. What explains the difference between these applications? Here’s the breakdown:

  • OCR Freeware uses the SimpleOCR or Tesseract engines and provide limited scanning and output format capabilities. Recognition quality is generally poor except for the highest quality document images.
  • PDF OCR Converters provide good quality OCR engines like ABBYY, IRIS and OmniPage, but limit the output to searchable PDF files. These cost less than $100.
  • Standard OCR applications range from $100-$200 and provide full OCR capabilities including converting scans to Word, Excel, HTML and other editable formats.
  • Corporate OCR applications add advanced features like automated hotfolder processing, concurrent licensing and other features useful for business applications. Pricing for these is $200-$500.
  • OCR Servers provide scalable, enterprise OCR services for processing very high volumes of documents or providing OCR capabilities to users throughout the organization. Prices start around $1,500 and go up based on processing volume.
  • Enterprise Data Capture and Forms Processing applications are used to capture structured data from complex documents like healthcare claim forms and invoices that include things like tables, handwriting, checkboxes, and movable zones. These solutions can cost anywhere from around $1,000 to hundreds of thousands of dollars depending on the document volume and complexity of the project.

Creating forms optimized for handprint recognition

Handprint recognition applications can provide dramatically different results in terms of accuracy depending on whether the form is designed with intelligent character recognition (ICR) in mind.

Forms Processing applications like ABBYY FlexiCapture have a built-in form design tool with ICR-optimized field layout elements and rules that validate whether your form uses best practices for recognition. These forms can be automatically converted to recognition templates for scanning for data capture. This saves you dozens of hours of trial and error during the design process and even more in data entry once the filled in forms are collected.

Best practice recommendations for ICR and OCR forms include:

  • Plenty of space between form elements and labels, at least 0.5cm / 0.25in
  • Use drop out colors for form backgrounds when possible
  • Hand printed characters should be constrained with boxes or combs to force filler to write legible, separated, printed characters
  • Use check boxes instead of handprint when possible since these are nearly 100% accurate
  • Use numeric codes instead of alphanumeric text when possible to reduce the number of possible characters and increase accuracy
  • Use validation rules to check against possible values and flag data with incorrect values
  • Check box fields can be used to verify the presence of signatures

Can OCR be trained for specific fonts?

OCR training was once a critical part of the conversion process. After a document was read, the operator would review the results to correct mistaken characters and these corrections would be used to train the engine so the next time you read a similar document the results are improved.

Modern OCR applications no longer rely on user training for accuracy unless you have very non-standard fonts. These engines have had decades of development and billions of samples used to train their algorithms. In most cases, the introduction of user training will only diminish the results for any documents that are different than the ones being trained.

The training functions still exist for these edge cases, but they are no longer an integral part of the OCR process.

Training in modern OCR is more likely to refer to enterprise data capture applications that use AI-based learning algorithms to find the locations of data points on documents with various different formats, such as invoices.

What are the best scanner settings for OCR?

Most OCR applications are optimized for 300 dots per inch resolution images.

While color is supported and most often performs better than black & white images, OCR algorithms will generally convert the color to B&W automatically as part of the OCR process. With color input, the dynamic conversion usually produces the best result, but not always.

Especially when an image contains stray markings, stamps, notes, colored paper or other elements that can throw off the binarization process, OCR results can be improved by paying careful attention to image processing settings and using a pristine black & white image for OCR instead of a color scan.

In forms processing and handprint recognition applications, guide marks in the form can often be removed during the scanning process, improving the OCR results when the software doesn’t have to distinguish between the form background and the words being recognized.

Using drop-out forms, traditionally printed in red or green and then scanned with a corresponding red or green light, automatically removes the form background during scanning and leaves only the text to be recognized. This can dramatically improve recognition results, especially for handprinted data.

Older, black & white scanners would require you to change out the lamps in order to perform color drop-out. All but the least expensive modern color scanners have the ability to enable drop-out colors in the scanner driver.

Advanced forms processing applications can perform color drop-out on-the-fly with scanned color images. Though this is generally not quite as accurate as scanning with a drop-out lamp enabled, it has the advantage of retaining a full-color original copy of the image with the form element and labels visible.

Tungsten Automation formerly Kofax OCR

Tungsten Automation formerly Kofax already had a large variety of products for your business automation like Tungsten Capture for high-volume document scanning and data capture, or Tungsten VRS Elite to deal with less then perfect images and to capture even the toughest to recognize documents.

Recently Tungsten Automation formerly Kofax had acquired Nuance’s Document Imaging Division and thus created one of the most powerful family of products for business automation. With products like OmniPage Ultimate or Standard offers you a good versatile OCR packages for small or mid level businesses. There is also an OmniPage Server option for much larger document volumes.

Kofax OmniPage OCR Software Nuance Scan Soft Ultimate Tungsten OmniPage converts paper, PDF files and forms into documents you can share, edit on your PC, listen to with natural speech, or archive in a document repository. Amazing accuracy, support for virtually any scanner, the best tools to customize your process, and automatic document routing make it the perfect choice to maximize productivity. Improved OCR engines deliver amazing accuracy for document conversion and archiving business critical documents.

Tungsten OmniPage Server is a cost-effective and reliable solution for business process owners to easily deploy a highly scalable, always-available OCR server solution for large volume of documents processing.

Tungsten Power PDF is the smart replacement for Adobe Acrobat for maximum savings without compromise. Power PDF allows you to make changes to PDF files with the fluidity, flexibility and interactivity of real word processing. In addition you can share, edit and discuss document changes using text or voice chat in real-time with multiple people. Plus you can have anywhere, anytime access to your documents using popular Cloud […]

Handprint Recognition Guide

What is ICR, Handprint Recognition?

ICR stands for Intelligent Character Recognition and is the technology that allows software to interpret hand printed text on scanned images.

Forms Processing Software uses ICR technology to automate data entry tasks involving hand-filled surveys, applications and forms. It provides interfaces for scanning, recognition, data verification and export, as well as management and monitoring tools to track large volumes of documents and data through the workflow.

Forms Processing also includes OCR (Optical Character Recognition) technology to recognize machine printed text, and OMR (Optical Mark Recognition) for check boxes and multiple choice bubbles.

Traditional forms processing relies on constrained handwriting, where boxes on the form force the filler to write with separated, printed block characters. Modern AI technology has dramatically improved the ability to recognized unconstrained handwriting and cursive script. Hand printed notes, free-form comments blocks, non-segmented fields, historic documents, and more can now be converted to text with acceptable accuracy where these were impossible just a few years ago.

Who can benefit from handwritten recognition software?

Any organization that collects data on paper-based forms, surveys or applications on a regular basis can get a very high return on investment by automating the data entry with forms processing software.

You do need to have a significant number of forms to justify the expense, at least a hundred forms per month or more depending on how much data is being captured. If the data entry task can be done in under 25 working hours then it is probably not a good candidate for automation with ICR software.

Organizations that have many separate departments that collect data on forms can share the budget for forms processing software by re-using it for other projects. Your current project […]

IRIS OCR

IRIS is a Belgian company that is the developer of one of the world’s top OCR engines. While more popular in European markets, their OCR and data capture solutions offer great performance and features for the price.

IRIS is offering very competitive pricing compared to OCR alternatives. Plus for the month of March SimpleOCR offers a great 50% discount on new version of IRIS ReadIRIS PDF 22!

IRIS ReadIRIS OCR SoftwareReadIRIS will allow you to convert any paper document, image or PDF into editable and searchable digital files (Word, Excel, PDF, HTML, etc.) using Optical Character Recognition (OCR) technology. Simply scan your paper document using the built-in scanning wizard or import image from folders or digital camera. ReadIRIS will instantly convert it to the format of your choice without altering the original layout. Your digital documents will now be easy to edit, archived and shared!

IRISmart File is intelligent software for semi-automatic naming and classification of electronic and paper documents. Ideal for freelancers, microbusinesses and SME, IRISmart will help you carry out long, slow everyday administrative tasks quicker than ever before. Anyone who wants to file a large amount of paper or electronic files and invoices into ordered folders quickly and efficiently will find this intelligent software to be a major ally.

IRIS Powerscan is a full-featured document scanning and data capture application designed for high-volume document processing. Please contact us for a quote or demo of IRIS Powerscan.

IRIS IRISXtract for Documents is THE software system for intelligent, automated document processing, for all types of documents. The product line of the IRISXtract system is designed to handle ALL your data capture needs, from the inbound mail, whether hardcopy paper or electronic, through […]

Go to Top