Abstract: We all recognize the amazing success of language language models in developing advanced AI/NLP technologies like ChatGPT for English. But very few other languages have access to the billions of words of text needed to train high-quality LLMs; this is what I call “the new digital divide”. I’d like to talk in broad terms about the prospects in terms of AI and NLP for the other 7000 or so languages spoken in the world today.
A few potential discussion points:
- How are language communities trying to bridge this digital divide?
- What about the majority of the world’s languages that are spoken languages only (rarely if ever written by speakers of the language)?
- Are there novel architectures or training methods that could help overcome the lack of training data?
- How can we do this work while respecting the priorities of the language communities themselves?