Imagine waking up in the morning. You read your emails with the morning coffee and use Gmail’s autocomplete feature to compile the answers. Before leaving the house, you ask Siri for the weather forecast, to decide whether you need to bring your jacket. Later in the day, you interact with a customer service chatbot about an order you didn’t receive. In the evening, while scrolling social media, you come across a post from a friend written in a foreign language and click “translate to English” out of curiosity. Language technology use is prevailing in our daily lives.
I’ve spent the last decade researching AI and language technologies. In parallel to my attempts to “teach the computer to speak English”, my own English has been improving, especially after relocating to English-speaking countries. This parallel process of acquiring anything from vocabulary through figurative expressions to cultural references inspired me to write “Lost in Automatic Translation”. Here I include several highlights about language technologies, focusing on large language models (LLMs) such as ChatGPT and Gemini and automatic translation tools such as Google Translate.
Take Google Translate for example. If you’ve been using it for many years, you may have noticed a substantial improvement around 2016 when Google replaced its previous translation models which were based on corpus statistics to neural network-based models. Automatic translation has improved immensely in the last decade, and it is incredibly useful, but the translation quality is not equal across language pairs. One of the key enablers of progress in language technologies in general and in automatic translation in particular was the availability of massive amounts of data to train these models. When this data is unavailable, things are far less impressive. For example, translation works reasonably well for language pairs with a lot of online data, such as English and French, but it is far worse for translating from and into “low-resource” languages such as Igbo.
Similarly, LLMs were first developed for English, and only later applied to other languages. The design choices – such as how to split sentences into words – were informed by English and are often suboptimal for other languages. While most LLMs today are multilingual, they still support a few dozen languages, which is a fraction of the world’s 7,000 languages. The quality of outputs is also not uniform across the supported languages, and the models typically perform best when prompted in English. This is unsurprising given that English dominates the internet, which serves as the training data for LLMs. While most popular LLMs such as GPT-4, as well as their training data, are proprietary, an open-source model from Meta, Llama 2, was reportedly trained on a corpus composed of 90% English text.
Human conversations have a certain situational context, such as where and when they take place, the relationship between the participants, the cultural background of each participant, previous interactions, and more. Human interaction is efficient, so we leave a lot of implied meaning unsaid. Language technologies for the most part lack this context and so things are often “lost in translation”. My favorite example is translating a cake recipe from Hebrew to English that called for “preheating the oven to 180 degrees”. The recipe omitted the implied Celsius unit, which the intended audience in Israel could infer. When I used Google Translate to translate the recipe for my Canadian partner, it translated it perfectly to English to “preheat the oven to 180 degrees”. It worked just as expected, but had he been less experienced with baking, he would have underbaked the cake at 180 degrees Fahrenheit, which is the implied unit in recipes for him.
One of those aspects of situational context is the speakers’ cultural background. Culture shapes our worldviews and communication styles. Just like cross-cultural human interactions can lead to miscommunications, so do interactions with language technologies. We already established that LLMs learn about the world from reading mostly English web text. Statistically, this text is predominantly from US-based users. As a result, LLMs develop a North American – or more broadly Western – lens. So much so that if you ask them about reasons that someone tipped 5% at a restaurant, they go out of their way to justify this behavior (bad service? budgetary constraints? miscalculation?) – completely missing the possibility that this person may live outside North America, in a country where tipping less than 20% is entirely acceptable.
One of the risks from this cultural bias is that to be better understood by language technologies, users may adapt their communication styles. As more people use language technologies, overtime, we run the risk of erasing individual and cultural differences. Crucially, now that LLMs are used for automating processes that affect people’s lives, such as in healthcare and recruiting, this North American bias may result in discriminating against people from different cultures.
It’s hard to imagine a world today without automatic translation – and this has been the case for many years. LLMs have been popularized in the last three years, but quickly gained millions of users around the globe. We love them. We use them to brainstorm, answer complex questions, get advice, and much more. The progress made in the last few years has been incredible – but there are some lingering technical limitations. LLMs “hallucinate”, i.e., generate factually incorrect made up content; and they also have limited reasoning abilities and overconfidence, and societal biases. While many in industry and academia are working to address these issues, I do not expect dramatic breakthroughs anytime soon. Data was the main enabler of the recent progress, but even OpenAI admits that “we are running out of data”. To make real progress, I predict that we will have to go back to the drawing board and come up with additional ideas to supplement our current data-driven approaches.
Latest Comments
Have your say!