Major Breakthrough: Google Cracked the Code for Building AI in 400+ Languages
Ever wonder why ChatGPT speaks English better than, say, Swahili or Arabic?
It’s not an accident, or some special favorability of English in the training data; it’s math. AI companies have been flying blind when building models for non-English languages, guessing at how much data to use and which languages to train together.
Google’s research team just published ATLAS (paper), the largest public study on multilingual AI training. They ran 774 experiments across 400+ languages to answer questions that have stumped developers: How much bigger should your model be if you want to support 50 languages instead of 10? Which languages actually help each other during training?
The key breakthrough
ATLAS creates a “transfer matrix” showing which languages boost each other’s performance. Norwegian improves when you train it alongside Swedish and German. Malay benefits from Indonesian. Arabic gets better with Hebrew. The pattern? Languages that share the same alphabet and language family help each other most.
Three practical tools ATLAS provides:
- Scaling calculator: If you want to double your language support (from K to 2K languages), increase model size by 1.18x and total data by 1.66x.
- Language pairing guide: A heat map showing which languages work best together; English, French, and Spanish help the most languages overall.
- Pre-train vs. fine-tune decision: A formula showing when to start from scratch versus building on an existing multilingual model (usually between 144-283 billion tokens for 2 billion parameter models).
They also tackled the “curse of multilinguality”, or the fact that adding more languages typically hurts performance. Good news: the curse is real, but mild. Languages sharing scripts create enough positive synergy to offset most capacity constraints.
Why this matters
Over 50% of AI users speak non-English languages, but scaling laws have been overwhelmingly English-focused. Developers building multilingual AI have been making expensive guesses about model size and training data.
ATLAS gives them a data-driven playbook. Expect the next wave of multilingual models to actually work well in languages beyond English, because companies now know exactly how to allocate compute budget across languages efficiently.
What’s next: Model developers at companies like Anthropic, OpenAI, and Google will likely adopt these scaling principles over the next 6-12 months (maybe the Chinese labs will as well!). If you’re building or evaluating multilingual AI products, check which languages they prioritized in training; ATLAS shows those choices have measurable impact.
Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.
The post Major Breakthrough: Google Cracked the Code for Building AI in 400+ Languages appeared first on eWEEK.