Statistical models for text normalization and machine translation

Published in Proceedings of the 1st Celtic Language Technology Workshop at COLING 2014, 2014

Recommended citation: Kevin Scannell. Statistical models for text normalization and machine translation. In Proceedings of the First Celtic Language Technology Workshop, pages 33–40, Dublin, Ireland, 2014. Association for Computational Linguistics and Dublin City University.

DOI: doi:10.3115/v1/W14-4605

Abstract: Irish and Scottish Gaelic are closely-related languages that together with Manx Gaelic make up the Goidelic branch of the Celtic family. We present a statistical model for translation from Scottish Gaelic to Irish that we hope will facilitate communication between the two language communities, especially in social media. An important aspect of this work is to overcome the orthographical differences between the languages, many of which were introduced in a major spelling reform of Irish in the 1940’s and 1950’s. Prior to that date, the orthographies of the two languages were quite similar, thanks in part to a shared literary tradition. As a consequence of this, machine translation from Scottish Gaelic to Irish has a great deal in common with the problem of normalizing pre-standard Irish texts, a problem with applications to lexicography and information retrieval. We show how a single statistical model can be used effectively in both contexts.