LLMs Killed Practical Obscurity: Online Anonymity Was Always a Lie
A new paper shows LLM agents deanonymize pseudonymous users at 67% accuracy for $4. The real story is that online anonymity was never structurally guaranteed — just expensive to break.
A paper published last week demonstrates that LLM (large language model) agents can link pseudonymous Hacker News accounts to real LinkedIn identities at 67% recall with 90% precision, for a cost of $1–4 per identification. The technical press has covered it as a privacy alarm bell. The framing is understandable but slightly off.
This is not a new vulnerability. It is the end of a 35-year bet that exploiting an old vulnerability would always cost more than it was worth.
The Paper
Lermen, Paleka, Swanson, Aerni, Carlini, and Tramèr (arXiv:2602.16800, February 2026) built what they call the ESRC pipeline — Extract, Search, Reason, Calibrate — to deanonymize pseudonymous online users at scale. The pipeline works on raw unstructured text: forum posts, comment threads, interview transcripts. No predefined schemas, no domain-specific feature engineering.
The four stages:
- Extract — an LLM reads a user's post history and extracts structured biographical attributes: location, profession, interests, specific events mentioned, writing patterns
- Search — dense embeddings match against a large candidate pool (in the Hacker News experiment, the full LinkedIn public graph)
- Reason — extended chain-of-thought reasoning verifies top candidates against extracted features
- Calibrate — confidence scoring lets you set a precision target and accept the corresponding recall
Against the Hacker News → LinkedIn dataset, this identified 226 of 338 targets correctly at 90% precision. For comparison, the best classical baseline — the kind of approach used for the 2008 Netflix Prize attack — achieved near 0% recall on the same data.
A companion paper by Tianshi Li (arXiv:2601.05918) independently attacked Anthropic's published interview dataset using off-the-shelf LLMs with web search — no specialized pipeline, a few natural-language prompts — and linked 6 of 24 anonymized scientist interviews to specific published works, in some cases uniquely identifying the interviewees. The Lermen et al. paper separately estimates that its pipeline re-identifies at least 9 of the 125 scientist participants from the same dataset.
The Register's coverage anchored this accurately to Latanya Sweeney's 2000 work: the structural problem is not new. What LLMs changed is the cost of exploitation.
What "Practical Obscurity" Actually Meant
The phrase "practical obscurity no longer holds" in the Lermen et al. paper is not casual language. It is a reference to a specific legal doctrine with a specific origin.
The term enters US law in U.S. Department of Justice v. Reporters Committee for Freedom of the Press, 489 U.S. 749 (1989). The Supreme Court was asked whether FOIA (the Freedom of Information Act) required the FBI to release compiled criminal rap sheets. The Court held no — and the reasoning is instructive. The underlying records were individually public, accessible at courthouses around the country. But the Court found a privacy interest anyway, because actually compiling them required prohibitive effort. Justice Stevens wrote that "substantial privacy interests can exist in personal information... even though the information has been made available to the general public at some place and point in time."
The doctrine, stated plainly: technically public + prohibitively costly to access = effectively private.
This was always a bet on friction. The Court was not saying the data was private in any structural sense. It was saying the cost of aggregation was high enough to constitute a meaningful barrier. That bet held as long as the cost held.
Legal scholars were already arguing the doctrine was "misguided in principle and unworkable in practice" in 2015 — before LLMs. The internet made aggregation cheap for structured data. LLMs made it cheap for unstructured text. The erosion was a two-step process; the Lermen et al. paper documents the second step.
The Cost-Reduction Timeline
The trajectory of deanonymization research follows a consistent pattern over 25 years: each advance reduces cost and loosens requirements on data structure.
2000 — Sweeney's quasi-identifiers. Latanya Sweeney demonstrated that an estimated 87% of Americans are uniquely identifiable by ZIP code, birthdate, and gender alone — three fields present in virtually every public dataset. She proved the point by purchasing Massachusetts voter rolls for $20 and cross-referencing them with a "anonymized" state employee health insurance dataset: Governor William Weld's medical records emerged from the intersection. The problem was never that names were retained. The problem was that removing names while keeping quasi-identifiers provides no structural protection.
2006 — AOL's search logs. AOL released 20 million "anonymized" search queries from 650,000 users, replacing usernames with numerical IDs. Within days, New York Times reporters identified "user 4417749" as Thelma Arnold, a 62-year-old widow in Lilburn, Georgia, from queries about landscapers in her county and properties in her neighborhood. The lesson: unstructured text is a biography. You cannot separate the content from the identity it reveals.
2008 — Netflix Prize. Narayanan and Shmatikov demonstrated that the Netflix Prize dataset could be deanonymized with surprisingly sparse auxiliary information: just two ratings with approximate dates were sufficient to uniquely identify 68% of the 500,000-user dataset by matching against IMDb reviews, and 84% could be identified using six movie ratings outside the top 500. The insight: a small number of data points that are individually non-identifying become collectively equivalent to a unique identifier. The Lermen et al. paper explicitly compares LLM methods against Netflix-Prize-style classical approaches on their own datasets — the classical approach achieves near 0%.
2013 — Bulk metadata. Snowden disclosures revealed NSA (National Security Agency) collection of telephone metadata from major telcos covering millions of Americans with no individualized suspicion. General Hayden, former NSA and CIA (Central Intelligence Agency) director, later said: "We kill people based on metadata." The government had already concluded that call records — no content, only metadata — suffice to identify behavioral patterns, relationships, and location. The gap between protecting content while allowing metadata collection became operationally significant.
2026 — Unstructured text at API (application programming interface) cost. The Lermen et al. paper closes the gap that all previous attacks left open: unstructured, cross-platform, public text. Prior attacks worked on structured data (ratings, query logs) or required human investigators to read and reason over text. LLMs do the reading and reasoning automatically. The remaining constraint — that deanonymization required specialized expertise and significant effort — is gone.
Each step in this timeline reduced cost and eliminated a prerequisite. The 2026 paper is not an anomaly. It is the endpoint of a trend that was visible by 2008.
The Shadow Profile Problem
Every deanonymization attack requires two things: the pseudonymous data to be de-identified, and a reference database of real identities to match against. Most coverage focuses on the first half. The second half has its own disturbing history.
The reference database — LinkedIn profiles, Facebook accounts, the broader social graph — was not built entirely from data individuals voluntarily provided. A significant portion of it was assembled without knowledge or consent.
Facebook's shadow profiles are the clearest example. When users upload their contact books, Facebook extracts phone numbers and emails for every contact, including people who have never created an account. Multiple users uploading the same non-member's contact information creates an aggregate dossier. In 2013, a Facebook bug exposed this: the breach revealed contact details for six million accounts, including data for people who had never signed up — extracted from friends' phones. Facebook had denied shadow profiles existed. The bug made denial impossible.
The non-user cannot opt out of a profile they were never told exists. They cannot delete data they cannot see. As one description put it: "a zombie you — sewn together from scattered bits of your personal data — is still sitting there in sort-of-stasis on its servers."
Facebook is not unique. The data broker industry is the industrial version of the same practice. Brokers collect from public records, loyalty programs, mobile apps, location tracking, credit data, and other brokers — building profiles on people they have no direct relationship with. In the US, there is no federal law requiring consent for data collection and sale. Brokers legally sell to law enforcement without warrants. The FTC has taken enforcement action against brokers for tracking individuals visiting medical clinics and religious institutions — after the fact.
The full attack pipeline is: LLMs aggregate pseudonymous online activity → match against real-identity reference database → the real-identity database was itself built from data individuals never consented to provide. You are being deanonymized against a profile you were enrolled in without asking.
This is the structural failure that stays below the surface of most coverage. The capability problem — LLMs reading and reasoning over text — gets all the attention. The infrastructure problem — the reference database that makes matching possible — has existed for decades and is largely legal.
What Actually Changed
It is worth being precise about what LLMs changed, because "LLMs changed everything" is imprecise.
Deanonymization was always theoretically possible. The constraint was cost: human investigators reading text, forming hypotheses, searching for matches, verifying candidates. That process was expensive in time and expertise. It did not scale. A skilled investigator could identify a specific target given enough time; that same investigator could not process 338 targets in parallel.
LLMs changed the unit economics of reading and reasoning over text. The ESRC pipeline runs the equivalent of investigator work at API pricing. The process that required hours per target now costs $1–4 and runs in parallel across an arbitrarily large list.
There is a secondary change that receives less attention: the attack no longer requires structured data. Every previous automated deanonymization method required some predefined structure — explicit fields, schemas, ratings matrices. Unstructured text (posts, comments, forum threads) was the remaining safe harbor, because humans had to read it. That safe harbor is gone.
The defenses the paper proposes are honest about their limits:
- Platform rate limiting raises the cost per identification but does not eliminate the attack; the paper notes that all such mitigations can be circumvented given sufficient motivation.
- LLM provider refusal guardrails are ineffective because the pipeline decomposes into individually benign tasks — summarize this profile, embed this text, rank these candidates. No single step looks like a deanonymization attack.
- Open-source models render all API-level controls irrelevant, per the paper's own defense analysis. Guardrails can be removed. There is no usage monitoring. The pipeline runs locally at near-zero marginal cost.
- Individual pseudonym hygiene requires truly isolated identities with no overlapping biographical signals across platforms, maintained consistently over years. In practice, every specific detail shared online — a city neighborhood, a conference attended, an employer hinted at, a specialized hobby — compounds into a unique fingerprint. Substantive online participation over time makes hygiene extremely difficult.
The paper's authors state the conclusion plainly: assume that pseudonymous accounts can be linked to real identities. Design your threat model accordingly.
The Threat Model Problem
The threat model most individuals applied to their online pseudonymity was calibrated for a pre-2026 world. The implicit assumption was that deanonymization required nation-state capabilities or a dedicated, skilled investigator with a specific target and significant motivation. Under that assumption, a pseudonym provided meaningful protection for most people against most adversaries.
That assumption is now wrong. The cost threshold that excluded low-motivation adversaries has dropped below the price of a coffee. The threat model must now include: employers vetting candidates who post in professional communities, journalists researching sources, stalkers, political opponents, curious ex-partners — anyone with a credit card and an API key.
This is a categorical expansion in adversary scope, not a marginal one. Most people's online behavior was not designed for this threat model. A Hacker News commenter who hints at their employer and city while discussing their work is providing enough signal for the Lermen et al. attack. They designed that behavior for the old threat model — the one where reading and reasoning over posts required a motivated human.
The adjustment is not "stop posting online." It is "treat your pseudonymous online activity as potentially linkable to your real identity, and design accordingly." Be less specific about biographical details that compound. Assume cross-platform linkage is possible. Do not rely on a pseudonym to protect information you would not want attributed to your real name.
What This Is Not
It is worth saying what this paper does not mean.
It does not mean anonymity provides zero protection. For low-value targets against low-capability adversaries — casual curiosity, unsophisticated actors — obscurity still raises friction. The attack costs $4, but it requires motivation to spend that $4. Not all adversaries are motivated.
It does not mean differential privacy is useless. DP provides formal mathematical guarantees for specific data release scenarios. Those guarantees hold. The problem is that DP cannot easily be applied to public posts retroactively, and the practical utility tradeoff makes it unsuitable for most real-world data releases. It is a tool for specific contexts, not a general solution to pseudonymity.
It does not mean the situation is novel. The structural problem — that privacy through obscurity was a bet on friction — has been visible since Sweeney's 2000 work. The novelty is that the last expensive step (reading unstructured text) has been automated.
What it does mean is that the doctrine of practical obscurity, which was already intellectually vulnerable, is now operationally dead. If your privacy model depends on the cost of aggregation exceeding the motivation of adversaries, and the cost just dropped by an order of magnitude, the model needs revision.
The LessWrong community discussion of this paper notes that the authors published despite misuse risks because transparency is necessary for defense development. That judgment is defensible. The attack capability exists whether or not it is published. Defenses cannot be designed without understanding the attack.
The question now is whether the institutions that built the reference databases — the platforms, the data brokers, the telcos — will face meaningful accountability for the infrastructure that makes this attack viable. The technical capability is new. The dossiers it searches are decades old.
Found This Helpful?
This newsletter covers security research, operational tooling, and practitioner-focused frameworks. Subscribers get methodology posts, implementation deep dives, and analysis like this when it drops.
Subscribe at Intelligence Adjacent
Sources
Primary Research
- Large-scale online deanonymization with LLMs (arXiv:2602.16800) — Lermen, Paleka, Swanson, Aerni, Carlini, Tramèr, February 2026
- Large-scale online deanonymization with LLMs — full text — arXiv HTML version
- Agentic LLMs as Powerful Deanonymizers (arXiv:2601.05918) — Tianshi Li, companion paper on Anthropic Interviewer dataset
- Simon Lermen's Substack post on the paper — author's framing and context
- Large-Scale Online Deanonymization with LLMs (LessWrong) — community discussion including author responses
Classical Deanonymization Research
- Simple Demographics Often Identify People Uniquely (Sweeney, 2002) — 87.1% of Americans uniquely identifiable from 3 fields
- Robust De-anonymization of Large Sparse Datasets (Narayanan & Shmatikov, 2008) — Netflix Prize attack; 2 ratings identify 84% of records
- AOL Proudly Releases Massive Amounts of Private Data (TechCrunch, 2006) — contemporary reporting on the release
- AOL search log release (Wikipedia) — comprehensive incident overview
- You Are What You Write: Author re-identification attacks in the era of pre-trained LLMs (ScienceDirect, 2025) — differential privacy limitations against LLM-based attribution attacks
Legal and Historical Context
- DOJ v. Reporters Committee for Freedom of the Press, 489 U.S. 749 (1989) (LII) — origin of the practical obscurity doctrine
- Society of American Archivists: practical obscurity citation — doctrine origin and gloss
- "Misguided in Principle and Unworkable in Practice": Discarding the Reporters Committee Doctrine (Tandfonline) — 2015 legal critique
- The Case for Online Obscurity (Hartzog & Stutzman, FPF) — the pre-LLM academic argument for obscurity as privacy protection
- Me, My Metadata, and the NSA (ResearchGate, 2014) — post-Snowden legal analysis of bulk metadata collection
Shadow Profiles and Data Brokers
- Facebook shines a little light on "shadow profiles" (Sophos, 2018) — 2013 bug disclosure and shadow profile confirmation
- Facebook's "shadow profiles": the involuntary dossiers you can't opt out of (Boing Boing, 2017) — non-user data collection
- Facebook, Shadow Profiles, & Data Brokers (markn.ca, 2018) — connection between platform shadow profiles and broker industry
- What Are Data Brokers? How They Put Your Privacy at Risk (Aura) — industry overview
- Closing the Data Broker Loophole (Brennan Center for Justice) — legal landscape, absence of federal consent requirements
- The shadow data market: Privacy risks lurking in forgotten information (IAPP) — professional privacy community analysis
Contemporary Coverage
- AI takes a swing at online anonymity (The Register, 2026-02-26) — tech press coverage anchoring to Sweeney's work