I recently got involved in a project requiring the use of an OCR (Optical Character Recognition) to extract text from images. After a bit of research, we decided to use Google’s Tesseract.
In particular we decided to go for version 3.0.5 due to the possibility to save the output in a nicely formatted tsv file containing, among other things, information on the blocks of texts appearing in the image and the location of the bounding boxes from which text is extracted.
Ubuntu 16.04 repositories contain version 3.0.4 of Tesseract; installing version 3.0.5 was not hard but it required a bit of reading from various sources (the Tesseract’s wiki and Leptonica’s documentation) and a bit of fiddling around to put files in the right
system locations, which is why I decided to collect all the steps I followed in this blog post.