"Everyone knows what a horse is": What we got wrong about the Universal Translator
Linguistic Ruminations and a dream of a Universal Translator
I’ve been into linguistics for as long as I can remember. I was captivated by ideas like the Hopi language having no concept of time1, or that Chinese had no way to think about hypotheticals2. It turns out both these claims, along with many similar linguistic myths, are complete BS, but were so fun to ruminate on that they survived longer than they should.
True or not, it did get me thinking and becoming fascinated with how much of our own thoughts and perceptions feel framed by language3. It also didn’t help that I was heavily into sci-fi. In those imaginative worlds, sometimes you would run into languages where offering a glass of water meant you were asking to form a lifetime bond4 or merely understanding the language allowed you to see through time5.
And why shouldn’t this be possible? The sky’s the limit in terms of thought, and if we assume language is primarily encoded thought, anything should be possible in language (or so one might reason).
Even when people speak the same language, personal experience and linguistic ambiguity will cloud communication because our definitions of common words aren’t exactly the same. You could be living in Antarctica so when you say and hear “bird” you think of a penguin, but when I say it, I think of a magpie. With this view, it’s a wonder we can even communicate anything at all.
Now, imagine the near-impossibility of a Universal Translator. This is a common trope in sci-fi: basically some sort of device that lets the protagonists and alien cultures speak together quickly without having years of linguists and anthropologists (xeno-pologists?) study each other’s cultures and languages. It was understood to be a plot device and not a real, feasible piece of technology. How could it be? Again, we barely understand each other when speaking the same language… different languages involve different worldviews, baseline understanding, semantics and syntax… even beginning to propose a hypothesis of how one could be theoretically built… surely that would take centuries.
Except all of the sudden, with LLMs, it just kind of happened6, almost as a side-thought. It’s not even the aspect of LLMs people talk about the most. So what did the philosophy get wrong?
Computers can do it, right?
Computers are, first and foremost, better at math than we are. So it stood to reason that getting a computer to understand language first meant turning language into some absolute, precise, mathematical representational form that a computer could understand.
This was still the primary way of thinking back in the mid-2000s when I was getting my degree in Computational Linguistics. We had a lot more data about the diversity of language than generations past, and rather than thinking “absolutely anything goes in language”, it felt like we were on the brink of precise linguistics principles somewhere, waiting to be articulated. Sure there were exceptions to every rule and exceptions on top of those exceptions, but the tower of exceptions finally felt like it wasn’t infinite. Somewhere in there, in some convoluted way, we could describe language… we hoped.
But the deeper down this route we went, the harder it was to generalize. When we did actual tests of these systems (actually trying to get them to translate real sentences they’d never seen) they would do very very poorly, unless the examples were really contrived.
Statistics: silly, but less wrong
Experimentally, the best computational approach we had at the time was (mostly) throwing linguistics away and using pure statistics: find a bunch of pre-translated text (say, parliamentary speeches from multilingual countries) and infer translations based on when words occur together. So if you have thousands of English speeches hand-translated to French by workers in parliament, and every time the English transcript has the word “cat”, the French one says “chat”, you can infer these are translations of each other. There are thousands of ways that this approach doesn’t work, but in terms of actual results, this approach outperformed any other approach we had by a longshot. Still, it felt like cheating.
Is self-reference a bad thing?
In the word-for-word translation, all you did here was create a circular definition. All your system could say is “I’m pretty sure ‘cat’ is ‘chat’ in French”. The limitations of this approach were seen with ‘epic’ Google Translate fails in the 2010s.7
It still felt like language translation needed to start with exact definitions. If you wanted to say “chat” is “cat” with some confidence, you need to know what exactly a “cat” is in the first place, right?
Well, what does a dictionary say?
“Everyone can see what a horse is”
This is the beginning of the entry for “horse” in the first Polish encyclopedia ever written8. It’s funny but it does illustrate a paradox of dictionaries an encyclopedias: they can only define things in terms of other things (usually with a few steps between). A horse is the thing commonly understood to be a horse. That’s the best, most accurate definition possible.
From a mathematical-rigor and philosophy perspective, this seems useless and immediately dismissable because it’s self-referential and you’re looking for some kind of first-principles to build up from. But, what if there are no first-principles? What if it’s only referential? Can we model that?
King - man = queen
An amazing paper9 came out in 2013 that really should have upended all of linguistic philosophy (but sadly didn’t). In it, some computer scientists again ignored linguistics10 and thought “what if every word is just dependent on the words around it and nothing else?” So if you see phrases like “the cat is soft”, “the cat is furry”, you know “the”, “is”, “furry”, “soft” all predict “cat”. In fact, you could say “cat” is defined by words that it shows up around. We could define “cat” as “a word that’s likely to show up when you say a bunch of these related words”, which each related word being “embedded” in a similar relationship with all the other words it’s associated with.
This gives words an “address” relative to other words in the language. If you imagine turning the whole language into a multi-dimensional street map11 where every word is a street, then “cat” will show up at the intersection of “the”, “furry”, “is”, and “soft” etc. streets.
Note how insane this sounds, especially since we have such a mix of words in there. But then it starts making sense, because of the mix of words are so different. Take “the” and “furry”. One of them tells you the grammatical class of the word (it’s a noun because it takes ‘the’) and another one tells you a semantic aspect of the cat (it’s furry as opposed to hard). You’ll end up with ambiguities in there (like “scan” from “CAT scan”), but what if that ambiguity is also just part of the word ‘cat’? When someone says “cat” there is a chance they’re talking about a CAT scan, shouldn’t any model of it you have include that possibility too?
It feels like a fun thought experiment, except for the fact that this actually produces surprizing results in real life. We’ve taken all words and given them addresses, right? So what if you tried to travel around this “city”? What if you started at the word “king” (intersection of “man” and “royal” and “the” etc.) and went one house away from “man” street, what’s the next word you would expect to see? King, except going away from “man”? You would expect “queen”… and shockingly, that’s exactly what you find! Start at “king” and go towards “young” street, and you find “prince”!12
Again, this isn’t something we’ve imposed or even taught it, it turns out that if you take this view, then words automatically arrange this way by themselves!
I don’t know how to emphasize how weird and unexpected this is, so I’ll say it again… if you just take words and define them as “the words that surround them”, all of the sudden grammatical rules, semantic classes, analogies, ambiguities, humor… all of it starts structuring itself in front of you.
Everyone does know what a horse is
In this view, “horse” is just “the word that’s most likely to show up if you see a bunch of other horse-related words around it”.
Just based on that, suppose you saw a bumper sticker that said “home is where the horse is”. What could you infer?
This is referencing the phrase “home is where the heart is”, since this is the most likely similar phrase given context
It could be a an accidental typo, they really meant “heart”
They could be using the word “horse” in their personal way of speaking the way other people would use “heart”, when they say “horse” they mean “heart”, ie, they “heart horses”
They mean it literally, that they have a horse at home,
But are also deliberately evoking the more common phrase for some rhetoric effect
All of these ambiguities are immediately derivable in a context definition of language.
In a more ‘first principles’ version of language (the way we imagined we had to teach it to computers before), the best case would be the literal interpretation. Again, in sci-fi, we have tropes of machines not getting humor that comes from the ambiguity of language because we assume they can’t handle ambiguity. When language is defined in context instead, then humor from ambiguity (which we see in real life) becomes an obvious feature of the model as well.
To clarify: I’m not claiming the way word embeddings are made is the same as the way we think about words (we know that’s not the case)13, but the idea that words and language might mostly just be about context… that part seems to have a lot of merit to it.
The Universal Translator
Somewhat disappointing, but it turns out that translation is more about context and statistics than philosophy.14
If your goal is communicating an idea, the best way to do it is to discuss the ideas surrounding it and let the context evoke the idea. Learning or teaching a new language can similarly be inferred by just knowing a few ‘neighbor’ concepts and letting the contextual nature of language itself ‘map’ these concepts out.
You can find words that don’t have exact translations, but you’ll know (quantifiably!) how far off these concepts are from each other, what contextual markers are important for clarity… all of these things just live within these contextual maps automatically.
The output is just as good as the input
A word of caution though: it’s important to remember these maps aren’t objective either… they’re specifically a map of what has been given to them and reflect the cultural values of the input. If you fed it xenophobic training text, you might find the word “criminal” and “foreigner” are close neighbors, whereas a more multicultural text might find “foreigner” closer to “traveler”. Only now, these kinds of hidden intents and subtexts in language are easy to spot and traverse.
https://en.wikipedia.org/wiki/Hopi_time_controversy - Basically some combination of linguistic misunderstanding, imprecise definitions, and intentional exotification made it seem like the Hopi people just lived in this ever-present timeless state of only knowing ‘now’. That’s not the case; they’re humans, not magical sprites.
The Linguistic Shaping of Thought: A Study in the Impact of Language on Thinking in China and the West, AH Bloom 1981. Again, poor understanding and intentional exotification made it seem that Chinese speakers had no way to express hypotheticals, which (if true) would limit their creativity and imagination. It’s absolutely not even a tiny bit true, just to be clear.
https://en.wikipedia.org/wiki/Sapir-Whorf This has been a source of controversy as to how much a language frames the world and your thoughts.
Stranger in a Strange Land, Heinlein 1961. Also where we get the word “grok” from.
Story of Your Life, Chiang 1998. The movie “Arrival” is based on it.
Kind of happened, almost. Sure LLMs aren’t perfect translation machines but they’re way better than anything else we’ve ever come up with
It was and still is pretty impressive for the time, I just mean the seams and limitations of this process became pretty well-known
Kind of… they cleaned up a bit so words like “think” and “thinking” end up being the same, “The” (capital T) and “the” (lowercase T) end up being the same etc etc. but you get what I mean
This note is to say hello to the “um, actually…” people. I know it’s not a great analogy, but I don’t know a better way to talk about multi-dimensional vectors
Yes there are limitations to this approach too, and it doesn’t always have results this neat and interpretable.
Based on the amount of “training data” needed for us to pick up language as children vs the amount of data needed to build a ‘reasonable’ version of word embeddings or LLMs. If we needed to be exposed to that much data before learning language, most of us would barely be able to speak. It seems to indicate we’re wired for it.
Unless you’re trying to translate poetry or something that is “playing with language” in some way, but I’m talking about 99% of the cases where you’re just trying to convey “wash hands before going to work” or similar