A systematic examination of why large language models fail at population inference and what the architecture requires.
There is a category error embedded in how large language models are currently used for audience prediction. The error is not a matter of prompt engineering or model scale. It is architectural, and understanding it is a prerequisite for evaluating any AI-based approach to predicting how populations respond to messages, brands, or policies.
The error can be stated precisely: response imitation and population inference are different computational tasks, and optimizing for the former does not produce competence at the latter. Language models are trained to predict the most probable continuation of a text sequence given a training corpus. When asked to simulate how a target audience will respond to a message, they apply this same mechanism: generating the text most likely to follow the prompt, conditioned on whatever surface-level representation of the target group the prompt contains. The output sounds plausible because plausibility is what the model is optimized for.
Plausibility and accuracy are not the same thing. This distinction matters most when the target audience is meaningfully different from the distribution of voices in the training data, or when the question requires modeling cognitive dimensions (beliefs, values, trust heuristics) that are not recoverable from surface-level text patterns.
An Empirical Illustration
The failure mode has a characteristic signature: overconfident predictions that are systematically wrong in predictable directions. Consider a representative example. When GPT 5.2 was asked to estimate what percentage of UK adults perceive Apple as innovative, it produced an estimate of approximately 65%. Independent survey data from a sample of 517 UK respondents put the actual figure at 29%. The error was 36 percentage points, with no accompanying uncertainty estimate.
This is not an isolated failure. Across a controlled benchmark involving four global brands and 48 perceptual attributes each (236 measurements total), GPT 5.2 produced a mean absolute error of 35.2 percentage points against survey ground truth. The errors were not distributed randomly: they were systematically biased toward what internet-scale training data would suggest about these brands, rather than what a representative sample of UK adults actually believes. The model was not predicting population responses; it was predicting the aggregate sentiment of its training corpus.
The Mechanism
The failure has a clear mechanistic explanation. When a model generates a response “as” a target audience, it is performing next-token prediction conditioned on a prompt that describes that audience. It has no structured internal representation of that audience’s demographic composition, belief distribution, or trust heuristics. It cannot weight its response by population proportions because population structure is not a variable in the generation process. And it has no mechanism for distinguishing between the signal available about a given population in its training data and the actual characteristics of that population.
This produces the characteristic overconfidence pattern: because the model is generating text that sounds like what a knowledgeable person might say about the audience, it produces a single point estimate rather than a distribution, and it anchors to whatever the training data most strongly associates with the target group. In data-rich contexts (major US brands, English-speaking markets), the training data may be sufficiently representative to produce estimates that are directionally correct. In data-sparse contexts (smaller markets, underrepresented demographic groups, non-English media ecosystems), the errors are both larger and less predictable.
Independent Corroboration
This analysis is corroborated by independent academic research. In February 2026, researchers at Stanford University published HumanLM, a framework for evaluating audience simulation approaches across 23,000 real users and 227,000 responses. Their central finding: models trained on response imitation (supervised fine-tuning to predict the text of human responses) performed worst among all approaches evaluated. The authors attribute this to the same mechanism described above: training on surface-level language patterns causes models to miss the higher-order cognitive dimensions that actually determine how people respond.
The Stanford research also provides a positive finding: models that explicitly generate latent cognitive dimensions (stated beliefs, values, stances, emotional orientations) before producing responses achieved 16.3% higher alignment with ground-truth human responses than the best response-imitation alternatives. This suggests that the relevant architectural requirement is not more data or larger models, but a fundamentally different approach to representing the audience as a structured input.
What the Architecture Requires
Accurate population inference requires that an audience be represented as a structured input to the prediction process, not as a natural language description in a prompt. That structured representation needs to encode, at minimum, the demographic composition of the target population, the distribution of beliefs and values relevant to the content domain, the cultural and media context that shapes how information is processed, and the trust relationships that determine which sources are weighted as credible.
It also requires calibration against real human data. A model that has never been tested against population ground truth has no mechanism for correcting the systematic biases that emerge from training data distributions. The validation loop is not optional: it is what distinguishes a measurement instrument from a generative text system being asked to do something it was not designed for.