Corpas na Gaeilge (1882–1926): Integrating Historical and Modern Irish Texts

Published in Proceedings of the Workshop “Language resources and technologies for processing and linking historical documents and archives” at LREC 2014, 2014

Recommended citation: Elaine Uí Dhonnchadha, Kevin Scannell, Ruairí Ó hUiginn, Eilís Ní Mhearraí, Máire Nic Mhaoláin, Brian Ó Raghallaigh, Gregory Toner, Séamus Mac Mathúna, Déirdre D’Auria, Eithne Ní Ghallchobhair, and Niall O’Leary. Corpas na Gaeilge (1882–1926): Integrating Historical and Modern Irish Texts. In Kristín Bjarnadóttir, Mathew Driscoll, et al., editors, Language Resources and Technologies for Processing and Linking Historical Documents and Archives – Deploying Linked Open Data in Cultural Heritage, pages 12–18, 2014.

Abstract: This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882–1926. The texts which have been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the modern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger. This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modern lemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, and enables integration of historical and modern Irish corpora.