Indian Languages OCR Applications

There are plenty of languages spoken in India (Hindi, Tamil, Telugu, Gujarati, Marathi, Urdu, Sanskrit, and many others), plus there are many scripts to write on these languages (Devanagari (Nagari), Bengali, Tamil, Perso-Arabic) with regional differences. Billions of people speak these languages, amount of documents that are created on them is enormous. But the complexity of recognition of not very standardized texts and an ease of English alternative creates a very unfortunate situation with no good document management OCR to work with Indian languages.

There are several OCRs that already exist. First of all Tesseract OCR needs to be mentioned. Tesseract is open source OCR tool, that works with Bengali, Hindi, Marathi, Nepali, Sanskrit, Tamil, Telugu and Urdu. However, it is more a good base for your automation project, and it would require a lot of work of turning rather raw OCR capabilities into document management system for your company.

OCRConverter offers a Hindi (Devanagari script) online option. It is Free and easy to use, but it lacks the automation component. You would need a lot of manual labour to integrate it into your document flow, but at least it works.

Our German colleagues have created SanskritReader for Sanskrit. It is a very decent project, but created first of all for academic purposes, and without any automation features.

There are several other online options that you can easily find but most of them look abandoned and we could not really test them.

Main document automation OCR families like ABBYY, IRIS or Kofax do not support Indian languages and are not expressing any desire to include them in […]


When you scan a document that has text or numeric data on it, you are able to read and understand what is written in the scanned image. However, to a computer, the resulting image file is just as meaningless an assortment of pixels as a landscape photo. In order to transform this information into an editable format that you can search through, copy, and modify without retyping it manually, you will need the an Optical Character Recognition (OCR) software.

There is a wide variety of OCR software available. While they all share the ability to convert images of machine printed (not handwritten) text or numbers into an editable format, the various software often have different features, accuracy, prices, and language options.

Our OCR Software Guide and Comparison Chart explain the differences between the assortment of software available, as well as offer our recommendation for the best overall software when it comes to converting English documents. However, there is also a difference in the number and selection of languages that the various software can convert. Below, you will find a list of languages that our top three choices in Desktop OCR software are able to convert, with the languages that have dictionary support marked in italics.

Some language groups are more recent additions to the OCR scene. Among these are Arabic scripts, including Hebrew, and Asian characters, such as Chinese. While not all software support them out of the box, they are slowly being integrated, first as add-ons to the base software and eventually as part of the default language selection.

SimpleSoftware OCR engines are using two different systems for language support. In the end languages supported by your OCR is based on your basic version of SimpleIndex installed, any addons (SimpleIndex Server, SimpleCoversheet, and so on) […]

