Install Tesseract 3.0.5 in Ubuntu 16.04

I recently got involved in a project requiring the use of an OCR (Optical Character Recognition) to extract text from images. After a bit of research, we decided to use Google’s Tesseract.

In particular we decided to go for version 3.0.5 due to the possibility to save the output in a nicely formatted tsv file containing, among other things, information on the blocks of texts appearing in the image and the location of the bounding boxes from which text is extracted.

Ubuntu 16.04 repositories contain version 3.0.4 of Tesseract; installing version 3.0.5 was not hard but it required a bit of reading from various sources (the Tesseract’s wiki and Leptonica’s documentation) and a bit of fiddling around to put files in the right
system locations, which is why I decided to collect all the steps I followed in this blog post.

Preliminary Step

If you already installed tesseract-ocr using apt-get
you need to uninstall it.

In a terminal type:

Installation steps

Install required libraries

Install leptonica-1.74.1

  • Download the source code from this link

  • Unzip the archive and cd to the folder you extracted

  • open a shell and execute

    sudo make
    sudo make install

Install tesseract-3.0.5

Dowload the source code from this link

Extract it in a directory and go to that directory

Open a shell and run the following:

Check installation

If everything worked fine when you type

it should show 3.05.00

Install languages

Get the tesseract data from the tessdata repository (these file work with version 3.0.5 of tesseract too)

Unzip the file. You should have now a directory called tessdata-3.04.00.

I had residuals from previous tesseract installations both in
/usr/share/tesseract-ocr/tessdata and /usr/local/share/tessdata.

According to some guides on the internet, you have to place the tessdata in the first folder, according to others in the second.

The way I managed to make it work is to move all the downloaded trained data into /usr/share/tesseract-ocr/tessdata.

Next I deleted the pre-existing tessdata directory /usr/local/share/ and created a symbolic link from /usr/share/tesseract-ocr/tessdata to /usr/local/share with the command:

Probably only the second location is needed, but there is not harm in placing the trained data in both.

After these steps running tesseract image.png works flawlessly.

Leave a Reply