In the frantic, eye-wateringly expensive race to build thinking machines that can navigate the physical world, a philosophical rift is widening into a canyon. On one side stand the pragmatists, intent on harnessing the colossal power of existing Large Language Models. On the other are the purists, who argue that true physical intelligence cannot be bolted on—it must be forged from the ground up. This week, the humanoid robotics firm 1X Technologies planted its flag firmly in the second camp, launching the 1X World Model Lab with a declaration that felt less like a press release and more like a manifesto.
“You can’t fine-tune your way to AGI,” declared 1X CEO Bernt Bornich in a pointed announcement. “And you definitely can’t fine-tune your way to robots that can operate in the physical world.” It was a direct shot across the bow of competitors who are enthusiastically adopting Vision-Language-Action (VLA) models—AI systems that essentially “wrap” a powerful model like GPT-4 with motor control capabilities. 1X is staking its future on a different, far more arduous path: embodied world models.
The Great Divide: Fine-Tuning vs. First Principles
To understand the gravity of 1X’s move, one has to appreciate the two competing doctrines currently battling for the soul of robotics.
The Vision-Language-Action (VLA) approach, championed by the likes of Figure AI, is very much the path of least resistance. The logic is seductive: take a multi-billion-pound foundation model that already “understands” language and vision, fine-tune it on a dataset of robot actions, and—Bob’s your uncle—you have a robot that can follow instructions. It’s a strategy that piggybacks on the immense progress (and staggering investment) in LLMs. The snag, critics argue, is that these models lack a genuine grasp of physics. They are sophisticated pattern-matchers, not physics engines. They might know from their training data not to drop a pint glass, but they don’t intrinsically understand that gravity will inevitably smash it to bits.
Then there is the World Model approach. This is the hard road. The goal is to build a foundation model that learns an internal, predictive simulation of reality. Before it ever learns a specific task like “pick up the apple,” it must first master concepts like space, motion, object permanence, causality, and the unforgiving laws of physics. Proponents believe this is the only route to true generalisation—the ability for a robot to act intelligently in novel situations it has never encountered in its training data.
Bornich’s stance is unequivocal. “The frontier is not better VLA wrappers,” he stated. “The frontier is embodied world models.”
1X’s All-In Bet and a Strategic Coup
The new 1X World Model Lab is the company’s answer to this challenge. Its mission is to build the most generalisable foundation model for humanoids from scratch. To lead this ambitious effort, 1X has poached Sam Sinha, a founding research scientist from the generative video AI darling Luma AI.
The hire is a masterstroke. Luma AI specialises in creating spookily realistic video models, a technology that is conceptually adjacent to building a world model that predicts future physical states. Sinha’s entire career has been spent at the frontier of scaling multimodal generative video. As he put it, for too long robotics has been treated as a “second-class citizen” in AI, with robot data being a “thin fine-tuning layer bolted onto a model.” The new lab aims to flip the script, treating embodied data as a first-principle ingredient.
1X’s strategy relies on a “data flywheel”—a virtuous cycle of learning:
- The Foundation: Web-scale media, egocentric human videos, and simulation data.
- The Nuance: Dexterous data harvested from remote-operated robots.
- The Deployment: A fleet of NEO humanoids collecting real-world data on the fly.
- The Result: The robot collects data, the model evolves, and the robot becomes more capable. Rinse and repeat.
An Alliance of World-Builders
1X is not alone in its philosophical conviction. The world model camp has some heavy hitters, even if they aren’t all building bipedal robots.
Tesla’s Full Self-Driving (FSD) system is perhaps the most famous real-world application of this concept. FSD relies on a “World Model” to predict the likely future actions of every car, cyclist, and pedestrian in its vicinity, running an internal simulation of plausible futures to inform its driving decisions. It isn’t just reacting; it’s anticipating.
AI luminary Yann LeCun, now leading AMI Labs after a storied career at Meta, has been a vocal proponent of world models for years. He argues that LLMs are “fundamentally incomplete” because they lack an internal model of how the world actually works. His work on Joint Embedding Predictive Architectures (JEPA) aims to build models that learn common sense by observing and predicting video—a core tenet of the world model philosophy.
The Road Ahead is Paved with Petabytes
1X’s move is a high-risk, high-reward gambit. Building a foundational world model from the ground up is an astronomically expensive and data-hungry endeavour. While the VLA camp gets a massive head start by standing on the shoulders of giants like Google and OpenAI, 1X is choosing to dig its own foundations.
The success of the 1X World Model Lab hinges on its ability to execute its data flywheel strategy at a massive scale. If it succeeds, it could create a formidable data moat and a generation of robots with a far more robust and generalisable intelligence than their VLA-powered counterparts. If it fails, it will serve as a cautionary tale of eschewing a pragmatic shortcut for an elegant but impossibly difficult ideal.
The battle lines have been drawn. Is the future of robotics a clever extension of the LLM revolution, or does it require a completely new beginning? The industry is now watching to see if 1X’s bold bet on building the world from scratch will pay off, or if they’ll find themselves stuck fine-tuning their balance sheets.
