Giorraíonn BERT Bóthar: Benchmark datasets and implementations for Irish NLP


Invited talk, via Zoom.

Abstract: I have been working on developing Irish language technology for almost 25 years, and over that time I have accumulated a large number of datasets used for training models to perform various important NLP tasks: machine translation, language modeling, spelling and grammar correction, etc., with some, but not all, publicly available. I will begin by surveying some of this work, and laying out what I view as key priorities for language communities seeking to develop advanced language technologies. I will then introduce a new resource that brings together all of these datasets in one place, together with baseline implementations of the various tasks that others can build on. This is similar in spirit to efforts like Papers with Code and, although we have taken steps to try and mitigate some of the negative influence that so-called “leaderboard culture” as had on NLP research for English and other major languages.