From Zealots to Poisoned Data: What Social Learning Can Teach Us About Securing LLMs

4th November, 2025 | Research blog

Alan Turing Institute’s recent blog post  ‘LLMs may be more vulnerable to data poisoning than we thought’  highlights a troubling vulnerability in large language models (LLMs): even a tiny fraction of poisoned data—just 250 documents—can reliably implant backdoors across models of vastly different sizes. This finding challenges the assumption that large-scale models are inherently robust to small-scale manipulation and raises urgent questions about how we assess reliability in AI systems.

INFORMED AI Hub work by Jonathan Lawry (University of Bristol), Zealot Detection in Probabilistic Social Learning Using Bounded Confidence, offers a compelling parallel from the world of agent-based modelling. In Lawry’s framework, agents learn by pooling beliefs with others—but only if those beliefs are sufficiently similar. This “bounded confidence” approach helps agents detect and isolate zealots: individuals who stubbornly promote fixed, often false, beliefs regardless of evidence or peer input.

Lawry’s work shows some of the strengths and weaknesses of collectively applying simple heuristics to help agents filter out unreliable sources, even in noisy environments or when those sources are numerous. The Turing study suggests that LLMs lack sufficient internal mechanisms to detect and reject poisoned inputs, making them vulnerable to subtle, persistent manipulation. Hence, there may be scope to develop tailored decentralised rules to detect poisoned inputs before they corrupt the model by thinking of training data as a social network of beliefs. This would suggest that as AI systems become more embedded in decision-making, learning from social models like Lawry’s may be key to building resilient, trustworthy LLMs.

*Hub acknowledges the use of Microsoft CoPilot to assist with the drafting of this blog.