Tesseract OCR tips — custom dictionary to improve OCR
Here is my first post on OCR using Tesseract. Tesserast is a very popular library for OCR maintained by Google which achieves high accuracy and has support of more than 100 languages.
Tesseract supports dictionaries. User can provide a list of own words to Tesseract so Tesseract is aim to recognize these words.
You can extend the standard dictionary for a language model with your own words or retrain the model replacing completely the standard dictionary words with your own words. Whatever way you choose you should know that Tesseract is using special dawg files for different categories of words in dictionary. For instance, <lng>.word-dawg file is used for main dictionary words, freq-dawg is used for most frequent words. More information can be found here.
This tutorial works on both versions of Tesseract 3.0.5 and 4.0.0. I didn’t try this on another version. The only difference in Tesseract 4.0.0 is that v4 of Tesseract uses LSTM model so dictionary dawg files will have extension lstm-<type>-dawg (in v3.0.5 just <type>-dawg), e.g. lstm-freq-dawg vs freq-dawg, and unicharset file will have extension lstm-unicharset (unicharset in older version).
If you haven’t done yet install Tesseract OCR. In this tutorial we will use Ubuntu OS (I tested it on Ubuntu 18.04) and Tesseract v4. Simply install Tesseract from apt packages:
sudo apt update && sudo apt install tesseract-ocr
all the required training tools will be installed with this command.
Firstly augment the model with user words. Let say we want to augment english language model with own words. Create a plain text file called wordlistfile with words you want to add to dictionary one per line. Then go to tessdata directory. You will need root rights to operate in the tessdata system directory
sudo su
and unpack the eng.traineddata to a bunch of training files (you’ll need the unicharset later)
combine_tessdata -u eng.traineddata traineddat_backup/eng.
create a eng.lstm-word-dawg file using the wordlistfile:
wordlist2dawg wordlistfile eng.lstm-word-dawg traineddat_backup/eng.lstm-unicharset
replace the lstm-word-dawg file
combine_tessdata -o eng.traineddata eng.lstm-word-dawg
You will get eng.traineddata out of your custom words.
Now let’s retrain the eng model completely replacing the standard dictionary words with own words. Firslty we need to remove or move all dawg files (.lstm-word-dawg, .lstm-freq-dawg etc) in the traineddat_backup directory somewhere. Just create a tmp folder and move all the dawg files there.
Once we got rid of standard dawg files copy the eng.lstm-word-dawg file created before to the the traineddat_backup directory. Go to this directory and combine all into traineddata file.
combine_tessdata <lng>.
You can rename your novel model and put it to the tessdata file so Tesseract can find it and use. Test new model using command
tesseract <image> -l <your_model> <output>
where output is the name of text file to write the result or ‘stdout’ for printing to standard output.
Tesseract uses configs (simply plain text files containing variables and their values as space-delimited key/value pairs) allowing user to control the output of OCR. You can create your own config (like myconf) and put it in the folder path/to/Tesseract/tessdata/configs and specify the config name when using Tesseract:
tesseract <image> <options> myconf
where options are: out -file name for output or ‘stdout’ to print to the standard output,-l <language>, — psm <psm>.
Tesseract provides a large set of control parameters to tune the output and improve its accuracy. Tesseract has some variables controlling the use of dictionaries, e.g. penalizing words not in the word_dawg / user_words wordlists. [For instance, language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word. ]More on them is here. I used these values in my config and that allowed to recognize some words from my dictionary:
language_model_penalty_non_freq_dict_word 1
language_model_penalty_non_dict_word 1
That’s it. Enjoy the OCR using Tesseract and see you soon.