The Long, Arduous Journey To Realize Optical Character Recognition (OCR) For Yorùbá And Ìgbò

Technology for People of African Descent

By Victor Williamson Aug. 31, 2015

I have been trying to use the open source optical character recognition (OCR) program Tesseract to train for the Yorùbá and Ìgbò languages of West Africa. So far, I have not been successful. This is my second time trying. This time I've used the automated procedure and I'm getting failures during the training phases. The first time, I followed the manual procedure, but Tesseract output a bunch of hyphens rather than any text. Either way, I'm mostly stuck. I shared my challenges on the Tesseract Google forum here and did obtain some helpful tips from other developers.

I recently attempted to build a simple OCR in Java in hopes that I could customize a quick and dirty solution to a specific source text. Initially this idea showed promise, and I was able to subdivide the image into one or more characters, but as it turns out the matrix matching approach which compares characters on pixel by pixel basis fails to be a good classifier even for a single source text. Furthermore, the characters that stick together are difficult to divide into separate characters without the software having afore knowledge of the characters to begin with.

Primitive Java OCR on a page from A Dictionary of the Yoruba Language

So, the last approach is to simply run the available English OCR available on Yorùbá and Ìgbò texts. I have run OCR on Yorùbá digitized scans of the three Yorùbá dictionaries R. C. Abraham's Dictionary of Modern Yoruba, J. A. Fashagba's The First Illustrated Yoruba Dictionary and S. A. Crowther's A Dictionary of the Yoruba Language. The good news is that the English portions of the text OCR very well. The bad news is that diacriticized characters that use subdots, accents, graves, macrons and combinations of them are incorrect. I've run OCR using Tesseract and ABBYY FineReader. ABBYY is an industry standard, and worked best on the English portions, and it also tends to show English characters with diacritics removed, which is somewhat easier to work with than the random gibberish produced from diacritized characters that Tesseract produces. After customizing ABBBY for the Yorùbá and Ìgbò languages, i.e. by adding in the special characters manually, I am able to get most characters with a single diacritic, s.a. subdots or accents/graves, OCR'ed very well with ABBYY, but as of yet not for characters with both subdots and a combining accent, grave or macron.

So as of now the best approach to OCR Yoruba texts is to use a standard English OCR and then to modify the special characters by hand. This, however, is increasingly inconvenient for texts that are properly diacritized, since almost all Yoruba words have diacritics. No doubt research should continue with Tesseract and ABBYY FineReader to provide full support for the Yorùbá and Ìgbò languages. This will involve preparing vast amounts of corpus across various fonts, and training the target tools, preferably working hand in hand with the actual developers since characters that include multiple diacritics such as subdots combined with acutes, graves and macrons (ẹ̀, ọ́, ụ̄, etc.) tend to produce problems with existing OCR software.

The academics from Obafemi Awolowo's Computing and Intelligent Systems Research Group built an OCR for handwritten Yoruba texts from scratch under the leadership of Prof. Odetunji Odejobi which is claimed to recognize 90+% of Yoruba characters. This needs to be adapted to printed text because the handwritten samples used to test the engine include substantial spacing between characters which is usually not the case with printed text where individual characters often stick together horizontally.

Web Site names:

yorubabook.com, readyoruba.com, yorubaocr.com