Automatic Text Recognition

What is Automatic Text Recognition?

Automatic Text Recogniton (ATR) transforms your paper document piles into a searchable, editable and analysable digital format.

What is Optical Character Recognition (OCR)?

OCR (Optical Character Recognition) is the traditional method for recognizing printed text: it analyses individual character shapes and matches them to trained fonts. It works well when characters are clearly separated, fonts are standard, and image quality is good. Otherwise the accuracy degrades quickly.

OCR/HTR Automatic Text Recognition TEKLIA

What is Handwritten Text Recognition (HTR)?

HTR (Handwritten Text Recognition) is designed for recognizing handwritten text (or mixed pen-written styles). Unlike OCR, HTR uses models that consider entire words or lines of text, combining optical information and linguistic context to produce more coherent transcriptions.

OCR/HTR Automatic Text Recognition TEKLIA

What about Automatic Text Recogniton (ATR)?

ATR is the convergence of OCR and HTR technologies. Over the past decade, with the development of deep learning algorithms, the boundaries between OCR for printed documents and HTR for handwritten documents are blurring. The models recognise both printed and handwritten text and process lines or paragraphs.

OCR/HTR Automatic Text Recognition TEKLIA

Our ATR processing capacities

Multi-line transcription

Mixed handwritten and printed text

All script types (Latin, Arabic, Cyrillic...)

Historical and modern handwriting

Ancient languages

Enrichment with metadata

La méthodologie ATR unique de TEKLIA

Our advanced Automatic Text Recognition (ATR) services are powered by the latest advances in deep learning technologies. We train fully customised models designed to accurately understand and transcribe rare language scripts and complex handwriting. TEKLIA has developed two different approaches for historical document processing:

Sequential ATR method

A standard sequential approach that includes layout analysis, text line detection, and handwriting recognition using separate deep learning models. This approach is particularly effective in straightforward cases where the text follows a conventional structure and when the goal is to provide a full transcription of the documents for full text search.

Integrated ATR method

In more complex scenarios where the page layout does not conform to a standard structure, a canonical reading order for the entire page may not be feasible. In such cases, TEKLIA offers an integrated end-to-end approach using a single deep learning model. The model's learning is based on the patterns and structures present in the training data, enabling it to accurately determine the reading order of text zones even in complex and diverse layouts.

Q&A Document specificities

What solution do you have for correctly recognizing lines of text that are rotated 90, 180 or 270 degrees?

Text line detection in any orientation:

Our text line recognition models (Doc-UFCN , YOLO V8), are adept at detecting text lines in any orientation (0 to 360°). These models are designed to accurately detect text lines regardless of their rotational position.

Determine the reading direction:

Once the text lines have been detected, the reading direction is determined by one of two methods. The first method involves training a classifier specifically for this purpose, to predict if the reading order is right-to-left or left-to-right. The second method is to perform text recognition in both possible directions and then select the result with the highest confidence. Note that both Doc-UFCN and YOLO can detect and classify horizontal and vertical lines.

What solution do you have for transcribing text written in pencil?

Our models are fully capable of recognizing text written in pencil. This capability is contingent on the models being trained on samples that include this type of writing. Our training datasets encompass a variety of writing instruments, including pencil, to ensure the models can accurately recognize and transcribe text regardless of the writing medium used.

In some instances, image processing techniques, such as contrast enhancement, can be employed to improve the visibility and clarity of pencil-written text. However, these techniques are usually not necessary

What solution do you have for multiple languages in a data set?

Our models can be trained to recognise multilingual documents. The training process involves exposing the models to the diverse range of languages present in the target corpus, ensuring that they can accurately transcribe text in each of these languages. We have already trained models to process corpora containing documents in different languages (e.g. Latin, German and Czech) and also documents containing different languages (e.g. a mixture of Latin and French).

In scenarios where language models are used to improve recognition accuracy, we incorporate statistical language detection as a preliminary step. This involves identifying the language of the text before applying the appropriate language model.

Additionally, integrated models such as the Document Attention Network (DAN) have the capability to predict the language concurrently with the text transcription, enhancing the efficiency and accuracy of processing multilingual documents. We are currently experimenting with DAN to recognise documents containing languages written in different directions (French and Arabic).

Q & A Quality Control

What solution do you have for determining the effectiveness of a model on a dataset that has not been trained on and validated based on a limited amount of ground truth?

We perform both qualitative and quantitative evaluations to assess the quality of a model. Crucially, these evaluations are performed on samples that are representative of the target corpus and, importantly, on samples that were not used during the training phase of the model. This approach ensures that our evaluation is robust and reflects the real performance of the model.

For the quantitative aspect, we use metrics such as Character Error Rate (CER) and Word Error Rate (WER), which are computed on an annotated test sample. This provides a clear numerical indication of the performance of the model. In addition, we examine examples of the best and worst recognition results according to CER. This examination helps to identify specific areas where the model performs well and areas where improvement is needed, providing targeted insights for model refinement.

For qualitative evaluation, we analyse samples from the test set that have the highest and lowest confidence scores. This type of evaluation does not require manual transcription of the test samples. Instead, this approach focuses on understanding the performance of the model in terms of its confidence in its own output. This analysis helps to understand the nuances of the model's performance and to identify patterns or specific characteristics of the dataset that may affect the model's effectiveness.

What solution do you have for representing the probability that an automatic transcription is correct?

All our models, including both text line detection and text recognition, are designed to produce a confidence score along with their predictions. This confidence score is a critical component as it provides an estimate of the reliability of the transcription.

Pour calculer le score de confiance des modèles de reconnaissance de texte, nous avons intégré diverses méthodes à nos modèles. Celles-ci vont de techniques plus simples, telles que la mise à l'échelle de la température, à des approches plus complexes, telles que le test time dropout. Nous avons également exploré le développement de modèles de notation spécifiques formés à cette fin. Cette diversité de méthodes nous permet de prendre en charge différents types de textes et niveaux de complexité au sein du corpus cible. La mise à l'échelle de la température est implémentée par défaut dans tous nos modèles HTR. Nous avons également publié un article de recherche spécifiquement consacré à ce sujet pour la détection de lignes de texte (Boillet et al., Confidence Estimation for Object Detection in Document Images, 2023).

What solution do you have for visually assessing automatic transcription by displaying scans and transcription jointly on premise?

Arkindex for Exploration: Arkindex is our first tool designed to visualise the results of automatic transcription and compare them with the corresponding images. It allows users to view transcriptions at different levels - word, line, paragraph or page. In this interface, the relevant element in the image is highlighted and the transcription is displayed in a detailed panel. This panel includes the confidence score, the source of the transcription (algorithm, model) and a link to the execution process that produced this transcription. Arkindex is particularly well suited to exploring the results of transcription, providing an intuitive and informative interface for users to delve into the details of the transcription process and its results.

Callico for evaluation and validation: Callico is designed for evaluation or validation campaigns where a team of annotators evaluates or corrects the results of automatic transcription. This tool provides a comprehensive workflow management system for handling validation campaigns involving a large number of documents and annotators. As well as facilitating the validation process, Callico also provides an evaluation of the Character Error Rate (CER). It also allows all transcriptions to be exported in CSV or XLSX format for further statistical analysis. This makes Callico an invaluable tool for teams undertaking systematic and large-scale evaluation or correction of automated transcriptions.