Optical Character Recognition (OCR) is a technology that enables the conversion of various types of documents, such as scanned paper documents, PDF files, or images taken by a camera, into editable and searchable text. One of the most popular and freely available OCR libraries is Tesseract. In this article, we will focus on the configuration and usage of Tesseract OCR on the CentOS operating system for converting scanned documents and images to text.
Installation of Tesseract OCR
To install Tesseract OCR on CentOS, you first need to add the EPEL (Extra Packages for Enterprise Linux) repository, as Tesseract is not available in the CentOS base repositories. This can be done using the following command:
sudo yum install epel-release
After adding the EPEL repository, you can install Tesseract by running the following command:
sudo yum install tesseract
Tesseract supports many languages, so if you need to recognize text in a language other than English, you should also install the corresponding language packages. For example, to install the Czech language package, use:
sudo yum install tesseract-langpack-ces
Configuration of Tesseract OCR
After installing Tesseract, no special configuration is needed. Tesseract is ready to use with default settings, which are optimal for many scenarios. The Tesseract command-line interface offers various options for manipulating input images and output text files.
Usage of Tesseract OCR
To convert an image to text using Tesseract, use the following syntax in the command line:
tesseract [input_file] [output_base] -l [lang_code]
[input_file]
is the path to the image you want to convert.[output_base]
is the base name of the output file (without the extension). Tesseract automatically adds the.txt
extension to the file name.-l [lang_code]
specifies the language of the text in the image. For example, for Czech, use-l ces
.
An example command to convert the image document.png
to text in Czech:
tesseract document.png output -l ces
Running this command will generate a text file output.txt
containing the recognized text from the image.
Advanced Options
Tesseract offers a range of advanced options for improving the quality of text recognition or for working with PDF files. For example, you can use the --dpi
option to explicitly specify the resolution of the scanned document, which can help if automatic detection fails.
To create a PDF file with searchable text, use the -c textonly_pdf=1
option together with the output format .pdf
:
tesseract input_file output -l lang_code pdf
This generates a PDF file where the text is searchable and can be copied, although it remains in its original visual form.
Optimization for Better Results
When using Tesseract OCR, you may encounter various challenges, such as recognizing text on images with low quality or in documents with unusual layouts. To improve recognition results, you can:
- Use image preprocessing tools such as ImageMagick or OpenCV to improve the quality of images before processing them with Tesseract. This may include contrast adjustments, noise removal, or image binarization.
- Experiment with different Tesseract settings, such as finer configuration options using configuration files.
- Use Page Segmentation Mode (PSM) and Output Base Format (OEM) options to improve recognition depending on the type of document.
Tesseract OCR is a powerful tool for converting scanned documents and images to text, available for free and easily installable and usable on CentOS. With its wide language support and advanced processing options, Tesseract can be a useful tool for automating document processing in various applications, from digital archiving to text recognition for data analysis. To achieve the best results, it is important to perform image preprocessing and optimize OCR settings according to the specific needs of your projects.