AGI Is Not Multimodal
3 min read
For example, tasks like repairing a car or untying a knot cannot be solved with symbols alone. Consequently, gluing modalities together creates a patchwork, not a coherent mind. Therefore, this strategy will not achieve human-level AGI.
Thus, we must rethink our path. Specifically, intelligence should treat embodied understanding and environmental interaction as primary. Hence, we should let specific skills, like language or vision, emerge from this core experience.
| Aspect | Multimodal / Scale Maximalist Approach | Embodied / Structuralist Approach |
|---|---|---|
| World Understanding | LLMs learn bags of syntax heuristics and superficial token-prediction rules — not a grounded model of physical reality. Semantic and pragmatic reasoning is approximated through brute-force memorization of symbol behavior. | Intelligence is situated in a physical world model enabling sensorimotor reasoning, motion planning, and social coordination — capacities that cannot be reduced to symbol manipulation. |
| Modality Integration | Separate neural modules are pre-trained per modality (text, vision, action) and stitched into a shared latent space — severing deep cross-modal connections and decentralizing “meaning” across inconsistent decoders. | Modalities naturally fuse through an interactive, embodied cognitive process; modality-specific processing emerges rather than being architecturally prescribed, blurring lines between perception streams. |
| Learning Paradigm | Optimizes for the end products of human intelligence (text, images, video) by scaling compute and data — copying human conceptual structures rather than learning to form novel concepts independently. | Learns through interaction with the environment, forming durable concepts from few examples, enabling analogical reasoning and the invention of new abstractions — a foundational attribute of general intelligence. |
| Relation to Sutton’s Bitter Lesson | Interprets the lesson as “make no structural assumptions,” yet ironically encodes implicit per-modality assumptions about how modalities should be processed and joined — contradicting its own principle. | Heeds the lesson correctly: invest in deep, human-intuited structural inductive biases (like CNNs for vision, attention for sequences) that accelerate discovery for the specific domains an AGI must master. |
| AGI Outlook | Produces impressive narrow benchmarks and Turing-test-passing chatbots, but a “Frankenstein AGI” glued from general narrow models will lack coherent, complete intelligence — especially for physical-world tasks. | Unifies perception and action under one cognitive umbrella, yielding flexible general ability at the cost of short-term efficiency — the more promising path to human-level AGI that truly feels general. |
AGI Beyond Multimodal
Moreover, the article challenges the idea that multimodal scaling alone can achieve AGI. Additionally, it argues that embodied understanding and physical world models are essential for true intelligence. Specifically, large language models likely learn syntactic heuristics rather than genuine semantic comprehension. Furthermore, gluing modalities together severs deep cognitive connections between them. Consequently, everyone should reconsider whether scale maximalism truly addresses the core problem. Therefore, treating interaction and embodiment as primary may offer people a more complete path toward general intelligence.
Intelligence Requires Embodiment
This indicates that AGI requires more than multimodal integration. Therefore, we must prioritize embodied understanding over symbolic manipulation. Similarly, LLMs learn syntax, not world models. Consequently, scaling alone cannot yield general intelligence. As a result, pursue approaches where modalities emerge
“In projecting language back as the model for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence.”
Ultimately, scaling multimodal models will not produce true AGI. In conclusion, they mistake fluent output for genuine understanding. Looking ahead, we must ground intelligence in physical interaction. As a result, current approaches lack crucial embodied reasoning. Therefore, a fundamental shift in research focus is required. Thus, future systems should learn from situated experience. Hence, cognition must emerge from interaction with the world. In summary, general intelligence cannot be assembled from narrow modules. To conclude, we need architectures that treat embodiment as primary. Finally, the path forward lies in interactive, world-engaging AI. Accordingly, let us design systems that learn by doing.
Ultimately, multimodal scaling alone cannot create true AGI. In conclusion, it learns narrow skills, not general understanding. Therefore, this approach misses the essence of human intelligence. Thus, it fails to grasp the physical world.
Consequently, we must focus on embodied and interactive learning. As a result, intelligence should emerge from direct experience. Accordingly, we need to build systems that learn from doing. In summary, real progress comes from engaging with reality.



