Experts Have World Models, LLMs Have Word Models: The Simulation Gap

The Artifact vs. The Adversary

Ask a trial lawyer if AI could replace her, and she'll scoff. Ask a startup founder, and he'll say it's already happening. They're looking at the same AI-generated legal brief. The founder sees a coherent document; the lawyer sees a document riddled with exploitable vulnerabilities. This disconnect defines a critical frontier in artificial intelligence: the chasm between producing plausible artifacts and executing robust, strategic action in a world where other agents fight back.

This is the core argument of a new analysis by Ankit Maloo, featured on Latent.Space. It posits that while experts operate with sophisticated "world models"—mental simulations of other agents' motives, hidden information, and likely reactions—today's Large Language Models (LLMs) are fundamentally "word models." They are optimized to generate the next probable token, judged in isolation, not to survive and thrive in adversarial, multi-agent environments.

The Slack Message Test: A Microcosm of the Gap

Consider a simple task: drafting a Slack message to a busy colleague for feedback. An LLM might produce a polite, deferential request: "Hi Priya, when you have a moment, could you please take a look?... No rush at all." To an outsider, this sounds perfect.

But an experienced coworker runs a simulation. They model Priya's triage heuristics under pressure. "No rush" signals low priority. A vague "take a look" feels risky and is avoided. The expert rewrites: "Hey Priya, could I grab 15 mins before Friday? Blocked on the onboarding mockups. I'm stuck on the nav pattern." This version specifies a bounded time, a concrete problem, and clear stakes. It is a move designed for the real-world environment it will enter.

The LLM, like the outsider, evaluated the text statically. The expert evaluated it as a move landing in an environment full of agents with their own models and incentives.

Chess vs. Poker: The Perfect vs. Imperfect Information Divide

This distinction maps cleanly onto game theory. Chess is a game of perfect information. All pieces are visible, rules are symmetric. AlphaZero didn't need a theory of mind; it needed superior calculation from a known board state. LLMs excel in similar "chess-like" domains: code generation (deterministic, verifiable), math proofs, translation, and factual research.

Poker, however, is a game of imperfect information. You don't know your opponent's cards. Success requires modeling their likely hand, their perception of your hand, and their strategy based on that asymmetry. This is the realm of experts in law, negotiation, geopolitics, and medicine. As Maloo notes, "The hidden state is what turns a problem from 'just compute the best move' into 'manage beliefs and avoid being exploitable.'"

The benchmark AI research is now confronting this. Google DeepMind recently announced it is expanding its AI benchmarks beyond chess to include poker and social deduction games like Werewolf, explicitly to test "social deduction and calculated risk."

Why LLMs Are Inherently Exploitable

The fundamental mismatch is in the training signal. LLMs are refined via Reinforcement Learning from Human Feedback (RLHF) to be helpful, harmless, and honest—traits that score well in one-shot, cooperative evaluations. Domain experts, however, are trained by the environment itself: a weak argument gets countered; a poorly framed concession is exploited; a vague request is deprioritized.

This creates a fatal asymmetry. An LLM prompted to be an "aggressive negotiator" will execute that strategy consistently. A human counterparty can probe, detect that pattern, and exploit its predictability. The LLM doesn't know it's being modeled. It lacks the recursive loop: "I think they think I'm weak, so they'll bet, so I should trap."

Contrast this with Meta's Pluribus poker AI. As Noam Brown explained, Pluribus would "calculate how it would act with every possible hand, being careful to balance its strategy across all the hands so as to remain unpredictable." Its moves were designed to be unexploitable, not just reasonable-sounding. LLMs, optimized for agreeable output, are the opposite: highly readable and consistently exploitable.

Real-World Stakes: Medicine and Autonomous Driving

The consequences of this gap are moving from theoretical to critically practical, as evidenced by recent studies and industry shifts.

A randomized, preregistered study published in Nature Medicine tested LLMs (GPT-4o, Llama 3, Command R+) as medical assistants for the general public. When given full clinical scenario text, the models correctly identified conditions 94.9% of the time. However, when interacting with real human participants who didn't know what details to provide, that performance plummeted to below 34.5%.

The researchers concluded: "None of the tested language models were ready for deployment in direct patient care." The problem wasn't raw medical knowledge but the inability to navigate the hidden state of a patient's unspoken symptoms, ask the right clarifying questions, and convey appropriate uncertainty—a quintessential "poker" problem with life-and-death stakes.

Separately, a study in npj Digital Medicine found that LLMs across generations and sizes are poorly calibrated, often presenting incorrect information with high, unwarranted confidence. This lack of reliable self-assessment makes them dangerous in clinical contexts.

Meanwhile, companies are investing heavily in "world models" to bridge simulation gaps in other fields. Waymo, for autonomous driving, is leveraging Google's Genie 3 model to create photorealistic, interactive simulations. The goal is to train vehicles on "rare, unpredictable events" beyond their logged camera and lidar data. This is a spatial and physical world model, distinct from the multi-agent social kind, but driven by the same core idea: training on realistic dynamics, not just static patterns.

Closing the Loop: The Path Forward for AI

The solution is not simply more scale or smarter models. As Maloo argues, more raw "IQ" doesn't solve a missing training loop. The fix requires a paradigm shift in how we train AI.

Multi-Agent Adversarial Training: Models must be trained in environments where other self-interested agents react, probe, and adapt. The grading must shift from "does this output sound good?" to "did this action achieve the objective without being exploited?"
Outcome-Based Rewards: Instead of judging text artifacts, systems need feedback based on real-world outcomes: Did you get the review? Did you concede leverage? Did the patient get correct, actionable advice?
Recursive Modeling: AI agents must develop the ability to model that they are being modeled by others and adjust their strategies accordingly, moving beyond consistent, prompt-driven behavior.

This represents a flip from the "age of scaling" back to an "age of research" focused on novel architectures and training regimes. The frontier is no longer just bigger models, but models that understand the world as a game of hidden information and adaptive opponents.

The Bottom Line

The debate over AI replacing expert jobs often confuses artifact quality with strategic competence. LLMs can produce outputs that look expert to outsiders who judge coherence and tone. Experts judge robustness in adversarial environments where every move is met with a countermove.

LLMs produce artifacts that look expert. They do not yet produce moves that survive experts. Until they can simulate the multi-agent world with its hidden states and recursive reasoning, their application in high-stakes domains like law, medicine, negotiation, and strategy will remain limited—and dangerously exploitable. The race is now on to build AI that doesn't just know words, but understands worlds.