“Hallucinations” aside, today’s sophisticated chatbots can sometimes seem like magic — passing standardized tests with flying colors, or conjuring up multilingual poetry in the blink of an eye. Well… depending on what language you speak. A recent paper awaiting peer review from a group of researchers at Amazon and the University of California, Santa Barbara found that chatbots’ linguistic skills might be threatened by ghosts from a past era of AI, raising significant questions about their ability to communicate effectively in lesser-used languages on the web (think regional dialects from Africa or Oceania). Analyzing a database of billions of sentences, they found that a huge chunk of the digital text likely to be hoovered into LLMs from those languages wasn't written by native speakers but instead was crudely machine-translated by older AIs. That means today’s cutting-edge multilingual models are training on very low-quality data, leading to lower-quality output in some languages, more hallucination, and potentially amplifying the web’s already-existing shortcomings and biases. That’s obviously bad in its own right, but it raises a larger question about the future of generative AI: Is it doomed, as some have predicted, by the “garbage in, garbage out” principle? I spoke today with Ethan Mollick, an AI researcher and professor at the University of Pennsylvania’s Wharton School, and asked him what he thought about the findings given his work on how people actually interact with AI models in a professional or classroom setting. He was skeptical that messy, photocopy-of-a-photocopy results like those the Amazon and UC Santa Barbara researchers found could lead to the “model collapse” that some researchers fear, but said he could see a need for AI companies to tackle language issues head-on. “There are worlds where this is a big problem, and data quality and data quantity both matter,” Mollick said. “The real question is whether there’s going to be a deliberate effort, like I think Google has done with Bard, to try and train these models for other languages.” Usually, large language models are trained with extra weight given to heavily-edited, high-quality sources like Wikipedia or officially published books and news media. In the case of lesser-used languages there’s simply less native, high-quality content in that vein on the web. The researchers found that AI models then disproportionately train on machine-translated articles they describe as “low quality,” about “topics like being taken more seriously at work, being careful about your choices, six tips for new boat owners, deciding to be happy, etc.”. All it takes to determine what the “garbage in” to an AI model might be, then, is a quick web search. The “garbage out” is, of course, apparent from one’s interaction with the model, but exactly how it got made is less clear — and researchers like Mollick say the very size and sophistication of current AI models means that remains opaque to researchers for the moment. “Even with open-source models, we just fundamentally don’t know” how, or why, certain AI models operate better or worse in any given language, Mollick said. “There are dueling papers about how much the quality versus quantity of data matters and how you train better in foreign languages.” So, for those keeping score: Old, low-quality machine-translated foreign-language content does predominate in more obscure languages, reducing AI models’ fluency with them. But we don’t know exactly how this happens within any given AI model, and we also still don’t know exactly the extent to which AI development is threatened by training on AI-generated content. Mehak Dhaliwal, a former AWS intern and current PhD student at UC Santa Barbara, told Vice’s Motherboard that the team initiated the study because they saw the lack of quality firsthand. “We actually got interested in this topic because several colleagues who work in MT [machine translation] and are native speakers of low-resource languages noted that much of the internet in their native language appeared to be MT generated,” he said. So what can actually be done about it? Brian Thompson, a senior scientist at Amazon AWS AI who is one of the paper’s authors and its listed contact, told DFD via email that he couldn’t comment. But he pointed to the conclusion of his fellow researchers that model trainers could use tools to identify and eliminate machine-translated content before it gums the model’s works up. Both researchers and the data analysts fine-tuning these models are able to flag and classify data at an almost psychedelically minute level, meaning it should be no problem to at least attempt a prophylactic against bad translated content. Still, with the most sophisticated AI models like GPT-4 rumored to have roughly 1.8 trillion parameters, those scientists could have their work cut out for them.
|