
The Technical Anatomy of Model Extraction in 2026 (The Great AI Theft of the Century?)
Article Brief
Why this article matters
Model extraction has industrialized this year and perhaps last year as well. Attackers now run phased pipelines of synthetic queries, hydra-style account multiplexing, and targeted logit harvesting to distill proprietary models at scale. This post presents the mathematics behind knowledge distillation and temperature manipulation, explains why watermarking fails under character-level perturbations, and covers the 2026 disclosures from Anthropic, Google GTIG, and NDSS. The defense section goes beyond API rate limits to behavioral fingerprinting, semantic embedding clustering, and optional logit poisoning—giving you a structured mental model for why perimeter controls alone are insufficient.
On February 23, 2026, Anthropic published a disclosure that sent shockwaves through the AI industry. Basically, three laboratories—DeepSeek, Moonshot AI (Kimi), and MiniMax—had been running industrial-scale distillation campaigns against Claude. The numbers were staggering: over 16 million exchanges funneled through approximately 24,000 fraudulent accounts, all in violation of Anthropic's terms of service and regional access restrictions. Eleven days earlier, Google's Threat Intelligence Group (GTIG) had published its own AI Threat Tracker, confirming a parallel surge in model extraction attempts targeting Gemini, including a campaign of over 100,000 prompts designed to coerce the model into revealing its internal reasoning traces.
This is no longer theoretical academic research. Model Extraction (or Model Theft) has matured into an industrialized attack vector with nation-state implications.
The Asymmetric Economics of Extraction
"Distillation can be used to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently." — Anthropic, Detecting and Preventing Distillation Attacks, February 2026
To understand how these heists occur, we must move beyond the buzzwords and check the underlying mathematics, the extraction pipelines, and the cryptographic vulnerabilities of current defensive mechanisms.
1. The Mathematics of Theft: Knowledge Distillation
Model extraction relies heavily on the concept of Knowledge Distillation, originally formalized by Geoffrey Hinton, Jeff Dean, and Oriol Vinyals in their seminal 2015 paper "Distilling the Knowledge in a Neural Network." The goal is to transfer the knowledge from a massive, proprietary "Teacher" model (for example, GPT-4, Gemini, Claude) into a smaller, open-weights "Student" model (like Llama-3 or Mistral) that the attacker controls.
The attacker doesn't just want the final answer (the Hard Label); they want the Soft Labels—the probability distribution across all possible next tokens.
Why Soft Labels Matter When an LLM generates a response, it calculates a probability for every word in its vocabulary (that's why we talk about LLMs like no deterministic machines, probabilities and entropy are part of this game). For example, if asked "The capital of France is...", the model might output:
Paris: 98.1%Lyon: 1.2%Marseille: 0.5%
These probabilities (derived from the model's logits before the softmax function) contain what Hinton called the "dark knowledge" of the model. They reveal how the model relates concepts to one another. If an API exposes these probabilities (often called logprobs), the attacker's job becomes exponentially easier.
The 2025 survey by Zhao et al. formalizes this: Model Extraction Attacks (MEAs) against LLMs fall into three categories—functionality extraction (behavior replication), training data extraction (recovering training examples), and prompt-targeted attacks (stealing valuable system prompts). Distillation attacks like those disclosed by Anthropic target the first category at industrial scale (for sure the most sensitive one of the three).
The Temperature Scaling Exploit
Attackers often manipulate the temperature parameter in the API to flatten the probability distribution. By increasing the temperature, they force the Teacher model to reveal more information about its lower-probability token choices, exposing the intricate decision boundaries of the proprietary neural network. This is precisely the mechanism Hinton described: a higher temperature T in the softmax function produces softer probability distributions that transfer more information per query.
2. The Extraction Pipeline: From Theory to Industrial Warfare
A modern model extraction attack is not a simple script; it is a distributed, robust data-engineering pipeline designed to evade detection. The Anthropic disclosure gave us unprecedented visibility into how these campaigns actually operate.
- 1
Phase 1: Synthetic Query Generation
- 2
Phase 2: Account Multiplexing & Hydra Clusters
- 3
Phase 3: Logit Harvesting & Capability Targeting
- 4
Phase 4: Student Model Fine-Tuning
Phase 1: Synthetic Query Generation
Attackers cannot simply ask random questions. To map the Teacher model effectively, they use a smaller, local LLM to generate millions of highly diverse, edge-case prompts. This technique, known as Self-Instruct or Evol-Instruct, ensures the extraction covers the entire latent space of the target model. Anthropic noted that DeepSeek's prompts specifically asked Claude to "imagine and articulate the internal reasoning behind a completed response and write it out step by step"—effectively generating chain-of-thought training data at scale. Additionally, DeepSeek used Claude to generate censorship-safe alternatives to politically sensitive queries, likely to train their own models to steer conversations away from censored topics.
Inside the Attacker's Terminal
Here is a simulated output of a distributed extraction coordinator script, modeled after the real techniques described in the Anthropic and GTIG disclosures:
3. The Failure of Cryptographic Watermarking
In response to model theft, the industry invested heavily in LLM Watermarking. The most cited approach is the Kirchenbauer et al. (2023) algorithm, adopted as a baseline by multiple frontier labs.
How Watermarking Was Supposed to Work During text generation, the watermark algorithm uses a pseudo-random number generator (seeded by the previous token) to divide the vocabulary into a "Green List" and a "Red List." The model is mathematically biased to select words from the Green List. To a human, the text reads normally. To a statistical detector, the unusually high frequency of Green List words proves the text was generated by that specific model. If an attacker trains a Student model on this data, the Student inherits the Green List bias, proving the theft.
The 2026 Vulnerability: Character-Level Perturbations
Research by Zhang et al. ("Character-Level Perturbations Disrupt LLM Watermarks," published at NDSS 2026, arXiv:2509.09112) demonstrated a devastating and practical flaw in this defense. The attack exploits a fundamental dependency: watermarks rely entirely on the tokenization process.
By introducing Character-Level Perturbations—such as swapping characters, using Cyrillic homoglyphs (e.g., replacing the Latin 'a' with the Cyrillic 'а'), or injecting zero-width Unicode characters into the API prompts—attackers force the Teacher model to alter its tokenization boundaries.
Security Alert
Current token-based watermarking algorithms are mathematically sound but practically vulnerable to token-desynchronization attacks via input perturbation (Zhang et al., NDSS 2026).
Technical Details
LoRD: The Reinforcement Learning Attack on Watermarks
Beyond character perturbation, the LoRD algorithm (Li et al., ACL 2025) represents an even more sophisticated threat. Instead of using traditional Maximum Likelihood Estimation or Knowledge Distillation to train the Student model, LoRD uses the divergence between the Student and Teacher (victim) models as an implicit reward signal for reinforcement learning. The authors proved that with a pre-trained local model of only 8 billion parameters, they could steal capabilities from a commercial LLM with 175 billion parameters under a given domain—with the resulting model performing statistically similar to the victim. Critically, LoRD achieves stronger watermark resistance and higher query efficiency than MLE-based approaches, because it is consistent with the alignment optimization procedure used by the victim model itself.
4. Architecting Modern Defenses
If rate limits are bypassed by hydra clusters, and watermarks are destroyed by perturbations and RL-based attacks, how do we defend the crown jewels? Both Anthropic and Google's disclosures point to the same direction: AI-Driven Behavioral Analysis and Semantic Anomaly Detection.
Anthropic disclosed that they have built "several classifiers and behavioral fingerprinting systems designed to identify distillation attack patterns in API traffic." What distinguishes a distillation attack from normal usage is the pattern: massive volume concentrated in narrow capability areas, highly repetitive prompt structures, and content that maps directly onto what is most valuable for training an AI model. When variations of a prompt arrive tens of thousands of times across hundreds of coordinated accounts, all targeting the same narrow capability, the pattern becomes clear—even if each individual prompt looks benign.
Code Implementation: Semantic vs. Naive Defense
Below is a technical comparison of how API gateways have evolved. The legacy approach relies on Redis counters. The modern approach utilizes vector databases to detect distributed semantic probing.
# Legacy: Simple Redis Token Bucket
def check_rate_limit(api_key, ip_address):
# Attackers bypass this using 24,000 rotating proxies
# and multiplexed API keys.
key = f'rate_limit:{api_key}:{ip_address}'
requests = redis_client.incr(key)
if requests == 1:
redis_client.expire(key, 3600) # 1 hour window
if requests > 1000:
raise HTTPException(429, 'Rate limit exceeded')
return True# Modern: Vector-based Semantic Anomaly Detection
import numpy as np
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')
def detect_extraction_probe(prompt, session_id, db_client):
# 1. Convert prompt to dense vector embedding
prompt_vector = embedder.encode(prompt)
# 2. Query Vector DB for highly similar recent prompts
# across ALL accounts and IPs (defeats multiplexing)
similar_queries = db_client.query(
vector=prompt_vector,
top_k=50,
time_window='1h'
)
# 3. Calculate semantic density
density_score = calculate_cluster_density(similar_queries)
if density_score > 0.85:
# High density indicates systematic boundary probing
# Action: Poison the logprobs silently
enable_logit_perturbation(session_id)
log_security_event('DISTRIBUTED_EXTRACTION_DETECTED')
return TrueThe Bottom Line
Model extraction has evolved from an academic exercise to an industrialized attack vector with geopolitical implications. Anthropic noted that illicitly distilled models "lack necessary safeguards, creating significant national security risks"—foreign labs that distill American models can feed unprotected capabilities into military, intelligence, and surveillance systems. If distilled models are open-sourced, dangerous capabilities proliferate beyond any single government's control.
As long as the economic asymmetry exists—spending weeks to steal what took years and hundreds of millions to develop—attackers will continue to refine their distillation pipelines. Security engineering must shift from perimeter defense (WAFs, IP bans) to deep behavioral analysis, semantic anomaly detection, and proactive logit poisoning.
References
- Anthropic. "Detecting and Preventing Distillation Attacks." February 23, 2026. anthropic.com/news/detecting-and-preventing-distillation-attacks
- Google Threat Intelligence Group (GTIG). "AI Threat Tracker: Distillation, Experimentation, and (Continued) Integration of AI for Adversarial Use." February 12, 2026. cloud.google.com/blog/topics/threat-intelligence/distillation-experimentation-integration-ai-adversarial-use
- Zhang, Z., Zhang, X., Zhang, Y., Zhang, H., Pan, S., Liu, B., Gill, A., & Zhang, L. Y. "Character-Level Perturbations Disrupt LLM Watermarks." NDSS 2026. arXiv:2509.09112.
- Zhao, K., Li, L., Ding, K., Gong, N. Z., Zhao, Y., & Dong, Y. "A Survey on Model Extraction Attacks and Defenses for Large Language Models." arXiv:2506.22521, June 2025.
- Li, H., et al. "LoRD: Language Model Reverse Distillation." ACL 2025. (2025.acl-long.73)
- Birch, J., et al. "Model Leeching: An Extraction Attack Targeting LLMs." arXiv:2309.10544, 2023.
- Hinton, G., Vinyals, O., & Dean, J. "Distilling the Knowledge in a Neural Network." NIPS 2014 Deep Learning Workshop, 2015.
- Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. "A Watermark for Large Language Models." ICML 2023.
Test Your Technical Knowledge
Advanced Model Extraction Assessment
According to Anthropic's February 2026 disclosure, what was one of the clearest indicators that the Claude campaigns were industrial-scale distillation operations rather than normal customer usage?
Why are "hydra cluster" architectures effective for model extraction campaigns?
How do Character-Level Perturbations (e.g., injecting zero-width spaces or Cyrillic homoglyphs) defeat traditional Green/Red list LLM watermarking?
Continue Reading
Next steps in the archive
Keep Exploring
Related reading
Continue through adjacent topics with the strongest tag overlap.

Rules vs. Skills: Creating Secure AI Context in Engineering Teams
At my company we ran into a familiar question while scaling AI coding assistants: when should context live in a Rule or `CLAUDE.md`, and when does it deserve a Skill...

MCP Security for Enterprise Organizations: Real-world experiences and advanced defense
A personal reflection and technical analysis on the MCP protocol, from the challenge of presenting to the community to the real-world methods and risks in AI Security, MCP Server, and recommended defenses for organizations. Includes resources, papers, and key sites for modern research in AI agent security.

A2AS: A New Standard for Security in Agentic AI Systems
Reflection, explanation, and analysis of the A2AS paper, the BASIC model, and the A2AS framework, from the perspective of real-world challenges in controls and attack mitigation in AI Security and GenAI Applications.

