There are currently 7,000 languages actively spoken in the world and about 40% are endangered, at risk of disappearing forever (see map below, click for a larger version). Can Generative AI systems help us with preservation and education about these languages via translation into English or other high-resource languages? Not today.
Current state-of-the-art, off-the-shelf large language models like OpenAI’s GPT-4, Anthropic’s Claude Opus, or Google’s Gemini are able to translate easily between high-resource languages, say translating Spanish to English. But training data for low-resource and endangered languages is sparse and absent from the pre-training data sets used by language models, like the Common Crawl, discussed in last week’s episode.
But a team of researchers at Carnegie Mellon University and UC Santa Barbara is trying to solve this problem. They’ve developed LingoLLM, a workflow and pipeline for improving the translation capabilities of large language models for low-resource and endangered languages that don't have much digitized content. Importantly, the workflow doesn’t require any additional training of the language model or special fine-tuning.
This week I spoke to Kexun Zhang, a PhD student in computer science at Carnegie Mellon University, who helped lead the first phase of LingoLLM’s development.
The LingoLLM workflow automates the creation of a package of linguistic artifacts — like grammar books and a gloss — both of which we talk about during our conversation. This package can then be passed to off-the-shelf language models as part of a structured prompt along with the passage in the low-resource language that needs to be translated. LingoLLM upgrades off the shelf language models from essentially useless in translating low-resource languages to a translation tool that, while not perfect, is still pretty good.
Kexun and I talked about how he got interested in linguistics, provide some background about low-resource and endangered languages, and talk in detail about the workflow behind LingoLLM and what challenges remain. I had a great time talking to Kexun, and I think you'll enjoy the conversation.