Scaling up language technology for the next 1000 languages


Invited talk.

Abstract: There have been tremendous gains in the performance of Natural Language Processing systems over the last several years, but the majority of research and development has focused on English and a handful of other “major” languages. I will discuss a long-running academic project that involves the development of basic language technologies for many indigenous and minority languages around the world. There are two principal challenges to overcome in this work. The first is purely linguistic: many of these languages have incredibly complex morphological structures, and even models that achieve state-of-the-art results for English can fail badly for these languages. The second involves the lack of training data. Most of the language communities I work with are very small, some having only a few dozen speakers, and so text corpora with millions of words are virtually impossible to come by. I’ll discuss some attempts to overcome the lack of data through web-crawling and mining of social media sites, as well as the development of deep learning models that can perform reasonably well in a context with very little training data.