Statistical Unicodification of African Languages

Published in Language Resources and Evaluation, 2011

Recommended citation: Kevin P. Scannell. Statistical Unicodification of African Languages. Language Resources and Evaluation, 45(3):375–386, 2011.

DOI: doi:10.1007/s10579-011-9150-3

Abstract: Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: ị, ọ, ụ) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ɛ, ɔ). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an open source package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.