Distillation at the Projection Layer: The Industrialized Theft of AI Models
Learn how competitors use knowledge distillation and projection layer extraction to clone frontier AI models. Discover the defenses that work and the ones that fail.
Disclaimer
This article is intended for informational purposes and reflects the state of published research and industry practice as of early 2026. It is not professional security advice. Your specific environment, threat model, and regulatory obligations will shape how these principles apply to your situation.
For Security Leaders
Proprietary AI models are no longer protected by API opacity as industrialized campaigns have successfully extracted architectural secrets and functional capabilities for minimal cost. The 2026 disclosures confirm that competitors are using millions of coordinated queries to clone frontier model behavior and bypass safety alignment. This shift turns your inference API into an open data source for model extraction and adversarial capability transfer.
What this means for your organization:
Your model geometry is exposed. Attackers can recover your internal model dimensions and projection layers for as little as $20, enabling more precise local attacks.
Functional capability is being cloned. Knowledge distillation allows adversaries to build high-fidelity competitor models using your model’s own outputs as training data.
Traditional rate limiting is failing. Industrialized campaigns now use “hydra clusters” of thousands of fraudulent accounts to mix distillation traffic with legitimate requests, evading simple detection.
What to tell your teams:
Restrict logit-bias API access. Remove access to full logit distributions and logit-bias controls wherever possible to close the parameter recovery attack path.
Deploy behavioral fingerprinting. Invest in coordinated activity classifiers that can detect cluster-level request correlation rather than relying on IP-based rate limiting.
Monitor for structured reasoning probes. Watch for campaigns specifically targeting chain-of-thought traces, which provide a significantly richer training signal for competitors.
Prepare for local adversarial testing. Expect high-fidelity clones of your models to be used locally by attackers to craft adversarial examples that will bypass your production defenses.
Distillation at the Projection Layer
In March 2024, a team at Google Research posted a preprint with a specific, falsifiable claim: the embedding projection layer of OpenAI’s Ada and Babbage language models could be extracted for under $20 in standard application programming interface (API) query costs, with success defined as recovering the projection matrix up to a rotation symmetry. Not approximated. Not inferred. Extracted, up to a rotation symmetry, with precision sufficient to confirm Ada’s previously undisclosed internal width of 1,024 dimensions and Babbage’s of 2,048. The preprint was later awarded Best Paper at the International Conference on Machine Learning (ICML).
The security assumption that proprietary AI models are protected by API opacity has been empirically falsified. The geometry of a transformer’s output layer is readable from the outside. Knowledge distillation, a training technique that produces a smaller model from the outputs of a larger one, converts that external signal into a functional competitor without touching a single weight file. By February 2026, the academic demonstration had become industrial practice: Anthropic documented coordinated campaigns it attributed to three Chinese AI laboratories, generating over 16 million API exchanges with a single frontier model across tens of thousands of fraudulent accounts and proxy networks designed to evade detection. Available defenses reduce the efficiency of these campaigns. None prevent them. Legal frameworks nominally covering this conduct have produced zero enforcement actions.
The gap between “cannot see the weights” and “cannot steal the model” has been open for a decade. The 2026 disclosures confirmed it is being walked through at scale.
The projection layer
Every autoregressive transformer, the architecture underlying the GPT series, Claude, Gemini, and the frontier models that followed them, ends with the same operation. The model produces a hidden state vector, a long sequence of numbers encoding its internal representation of what comes next. That vector gets multiplied by a matrix to produce a score for every token in the vocabulary. Those scores are the logits. A normalization step called softmax converts raw logit scores into a probability distribution over the vocabulary: the model’s stated belief about what word, code token, or character should come next.
That final matrix is the projection layer, also called the unembedding matrix. It maps from the model’s internal hidden dimension to the vocabulary. The hidden dimension is the width of the model’s internal representation space; it is proprietary, undisclosed, and a rough proxy for model capacity.
Finlayson et al. (COLM 2024) identified why this layer is structurally exposed through what they called the softmax bottleneck. Because modern transformers share the same matrix at the input and output layers (a design choice called weight tying), and because all token embeddings are constrained to unit magnitude (each embedding vector has length 1 in the model’s internal space), a precise mathematical constraint holds. Given a matrix L of logit outputs collected across n probe queries, an attacker can recover the projection matrix W by solving the decomposition WH = L under the constraint that each row of W has Euclidean norm 1 (each row vector has length 1). The rotation ambiguity in that solution is what “up to symmetries” means: any rotation applied uniformly to W can be offset by an inverse rotation of H, producing identical outputs. The attacker recovers W up to that symmetry class, which is sufficient for most downstream purposes.
The number of queries required scales with the hidden dimension d. Recovering a model with a 1,024-dimensional hidden space requires approximately 1,024 probe queries plus overhead for noise resistance. Carlini et al. (ICML 2024) ran those queries against Ada and Babbage for under $20 in API costs and confirmed Ada’s hidden dimension at 1,024 and Babbage’s at 2,048. Applying the same query strategy to a larger model and using the same success standard, they estimated the API cost of full projection matrix recovery for GPT-3.5-turbo at under $2,000. Finlayson et al. independently estimated GPT-3.5-turbo’s hidden dimension at approximately 4,096 dimensions through a complementary approach.
The attack requires APIs that return full logit distributions or expose logit-bias controls that function equivalently. A logit-bias control is an API parameter that lets a caller adjust how likely the model is to produce specific tokens; attackers exploit this to probe the full output distribution on demand. Carlini et al. identified removing logit-bias API access as the most effective single countermeasure, while noting that logit-bias has legitimate uses in output filtering and controlled generation, making removal a real capability trade-off. OpenAI retired Ada and Babbage in January 2024. The original attack in its published form cannot be replicated against those targets. The structural vulnerability it demonstrated is a property of any API that exposes full logit outputs.
From geometry to capability
Recovering the projection layer is architectural intelligence, not capability theft. It tells an attacker how wide the model is and yields the last linear transformation in the chain. It does not produce training data, intermediate weights, or the model’s behavioral policy. For that, a different technique applies: knowledge distillation.
In its legitimate form, knowledge distillation is how a large, expensive-to-deploy model trains a smaller, faster one. The smaller model, called the student, trains on the outputs of the larger one, called the teacher. The student learns to reproduce the teacher’s behavior across a distribution of inputs, compressing capability into a form that costs less to run. Every major AI laboratory uses this technique internally to produce smaller variants of their flagship models.
Applied offensively, the teacher is the target API. The attacker constructs a query dataset covering the capability domain they want to clone: code generation, document analysis, multilingual reasoning, tool use. Those queries go to the API. The outputs come back. A student model trains on the resulting input-output pairs using standard supervised fine-tuning (a training process where the model adjusts its internal parameters to match labeled examples, with the API’s responses serving as the labels). No logit access required. No projection matrix recovery required. The attack operates entirely at the level of the model’s natural-language outputs, the same text a legitimate user reads.
A 2025 survey on model extraction attacks and defenses for large language models (Zhao et al., KDD 2025) classifies this as general functionality extraction, distinct from the parameter recovery class that Carlini represents. The operational significance of that distinction: general functionality extraction degrades gradually with defensive countermeasures because it operates on the same outputs that humans read. Output perturbation disrupts parameter recovery meaningfully; it affects distillation only at the margins, because the student learns from surface-level responses rather than from the raw logit distribution.
What transfers well under distillation: factual retrieval, instruction-following format, surface-level reasoning, stylistic consistency. What degrades as the size gap between teacher and student widens: complex multi-step reasoning, calibrated uncertainty, and alignment behaviors that emerged from reinforcement learning with human feedback. The 2025 survey on knowledge distillation of large language models (arXiv:2504.14772) confirms that forcing a student to mimic teacher output distributions propagates the teacher’s imperfections and creates a fidelity ceiling at tasks that depend on model scale, regardless of how many queries the attacker runs.
For the attacker’s purposes, that ceiling does not matter. Replicating code generation or document analysis does not require perfect alignment. It requires functional coverage of a commercially valuable task domain, and that transfers.
The empirical record
The history of this attack class begins with Tramèr et al. (USENIX 2016), who demonstrated near-perfect fidelity extraction of logistic regression, neural network, and decision tree models against live commercial platforms including BigML and Amazon Machine Learning. The attack surface they identified was economic: cheap inference let an adversary purchase the information content of a model one query at a time. That paper established the two metrics that define the field: accuracy (does the stolen model make correct predictions) and fidelity (does it replicate the specific behavior of the target, including its errors). Fidelity is the operationally significant metric because a high-fidelity clone enables adversarial example transfer: crafting inputs that fool the original model by testing them against the local copy, and running attacks that determine whether a specific data point appeared in the original model’s training set, what researchers call membership inference, by exploiting the clone’s decision boundary to probe the original’s training data.
Liang et al. (AsiaCCS 2024) tested extraction against real production Machine Learning as a Service (MLaaS) platforms in 2024, using a facial emotion recognition task as the target capability. Success in their evaluation meant the extracted student model matched the production API’s behavior on a held-out test set at or above 80% functional fidelity. They reached 84% fidelity against Microsoft’s production API at 27,000 queries. Their evaluation used non-adaptive attackers unaware of any defensive measures on the API. For multi-domain frontier language models with larger output spaces, query requirements scale with capability domain breadth and the desired fidelity threshold.
Before 2025, those experiments ran in controlled research settings. Then the disclosures came.
Each campaign below is measured by confirmed exchange volume as documented by Anthropic’s internal traffic analysis and coordinated activity classifiers; the figures represent confirmed totals from Anthropic’s disclosure, not estimates. Success for an offensive distillation campaign means accumulating sufficient query volume across the target capability domain to produce a meaningful student model training set.
Anthropic published its primary disclosure on February 23, 2026. Three campaigns by Chinese AI laboratories were documented in operational detail.
Anthropic states that DeepSeek conducted over 150,000 exchanges with Claude, targeting reasoning capabilities and chain-of-thought generation. A documented subset generated alternatives to politically sensitive queries, indicating an alignment-suppression objective alongside capability transfer. DeepSeek’s operation used synchronized traffic across coordinated accounts with shared payment methods.
Anthropic states that Moonshot AI ran over 3.4 million exchanges, targeting agentic reasoning, tool use, coding, data analysis, and computer vision. The breadth across capability domains is consistent with a systematic survey of Claude’s competencies rather than targeted extraction of a single capability class.
Anthropic states that MiniMax ran over 13 million exchanges, concentrating on agentic coding and tool orchestration. MiniMax pivoted its query strategy within 24 hours when Anthropic released updated models, with the campaign detected while still active.
The three campaigns collectively used over 24,000 fraudulent accounts routed through “hydra cluster” architectures: proxy networks coordinating up to 20,000 simultaneous accounts, designed to mix distillation traffic with legitimate requests. When Anthropic disabled individual accounts, the proxy services replaced them within hours.
OpenAI’s February 12, 2026 letter to the US House Select Committee on Strategic Competition reported a parallel operational pattern: DeepSeek employees developed code to access US AI models programmatically and built methods to circumvent OpenAI’s access restrictions through obfuscated third-party routers.
Success in a Reasoning Trace Coercion operation means accumulating sufficient multilingual chain-of-thought traces to train a student model on the target reasoning capability; the figure below represents Google’s Threat Intelligence Group (GTIG)’s count of identified prompts before real-time disruption, not the full volume the attacker intended.
GTIG documented a third incident class in its February 12, 2026 report: a Reasoning Trace Coercion campaign using over 100,000 structured prompts designed to extract Gemini’s internal chain-of-thought reasoning traces across multiple languages. The campaign targeted multilingual reasoning capability specifically, a domain expensive to develop independently. Chain-of-thought traces, unlike final answers, expose the intermediate reasoning steps the model takes to reach a conclusion, giving a student model a richer training signal for complex reasoning tasks than output-only distillation provides. Language-consistency instructions forced the model to produce those traces in the target language rather than defaulting to English, making the extracted data directly useful for multilingual capability transfer without a second translation step. GTIG states that Google’s systems detected and disrupted the campaign in real time.
Three independent frontier labs, operating separate detection systems, documented consistent attack patterns over the same period. The convergence is not coincidental. It reflects a rational economic calculation. DeepSeek’s reported training cost for R1 is approximately $6 million (Khawam, Just Security, March 30, 2026). US frontier lab training runs cost hundreds of millions per run. If 16 million Claude exchanges plus comparable volumes from OpenAI and Google contributed to R1’s training data, that $6 million figure is more plausibly distillation-augmented training cost than independent capability development. Khawam cites Dmitri Alperovitch of the Silverado Policy Accelerator directly: “It’s been clear for a while now that part of the reason for the rapid progress of Chinese AI models has been theft via distillation.”
That attribution rests on behavioral intelligence: coordinated account clustering, shared payment metadata, query pattern analysis. It is the standard methodology of threat intelligence. It does not constitute forensic chain-of-custody proof sufficient for litigation, and that distinction matters for what comes next.
What each defense delivers and where it fails
Four defense classes appear in the research literature with measured effectiveness data. The critical distinction the literature regularly obscures: two of these defenses change outcomes, two add friction without closing the attack surface. All four have been evaluated primarily under non-adaptive conditions, meaning attackers unaware of the specific defense deployed. Adaptive evaluations, where the attacker knows the defense and modifies their strategy accordingly, produce materially different numbers and are noted separately where data exists.
Output perturbation adds noise to the model’s returned probabilities or responses, designed to mislead the attacker’s training objective while remaining small enough that legitimate users do not notice. Orekondy et al. (ICLR 2020) introduced prediction poisoning as the first active perturbation defense targeting model stealing. Their evaluation used non-adaptive attackers. Against adaptive attackers aware of the perturbation, effectiveness drops because the attacker can weight-compensate for or filter perturbed responses. Verdict: friction only. Perturbation raises the query count required to reach target fidelity; it does not prevent extraction by a patient attacker with sufficient budget.
Rate limiting restricts the number of queries a single account or IP address can issue within a time window. To frame what success means: the baseline is an unlimited-query attack reaching target fidelity; a successful defense forces the attacker below the fidelity threshold or raises organizational cost to unacceptable levels. One 2024 analysis of API rate-limiting mechanisms reported that context-aware rate limiting achieves a 94% reduction in exploitation attempts compared to IP-based approaches, under non-adaptive conditions. The hydra cluster architecture is the documented operational answer: 20,000 simultaneous accounts, rotated on detection, traffic mixed to obscure volumetric signatures. Verdict: friction only. Rate limiting raises organizational overhead. It does not block an attacker with access to a large account proxy network, which the 2026 disclosures confirm is precisely what industrialized campaigns deploy.
Watermarking embeds a statistical signal into model outputs, allowing a defender to verify after the fact whether a student model was trained on the teacher’s outputs. ModelShield (Pang et al., IEEE TIFS 2025) is a plug-and-play academic technique requiring no additional model training, evaluated across two datasets and three language models under non-adaptive adversarial conditions. Neural Honeytrace (arXiv:2501.09328, January 2025) demonstrated lower extraction accuracy and fidelity than prior defense methods in non-adaptive settings. Verdict: outcome-changing only if a legal framework can act on the evidence produced. Watermarking does not stop extraction; it creates a forensic trail. Its operational value is bounded by enforcement, and confirmed enforcement actions stand at zero.
Behavioral fingerprinting, as deployed by Anthropic, uses coordinated activity classifiers to identify distillation campaigns from traffic patterns: synchronized queries, shared payment metadata, cluster-level request correlation. This is what detected the MiniMax campaign while it was still active, before the model whose capabilities were being extracted had been released. Verdict: outcome-changing. Detection during an active campaign limits total capability transfer and generates the evidentiary record that other defenses cannot produce. The cost is significant: building and maintaining a real-time threat intelligence capability of this kind is not available to providers without the resources of a frontier lab.
The pattern across all four: the defenses that genuinely interrupt campaigns require either substantial ongoing investment or a willingness to reduce API utility. The defenses cheapest to deploy have already been operationally bypassed by industrialized campaigns.
What remains open
After every documented defense is applied, the general functionality extraction channel remains unclosed.
Parameter recovery attacks, the Carlini class, are addressed by removing full logit API access. That countermeasure works. Providers who have restricted logit-bias endpoints have closed the specific attack path demonstrated in the 2024 paper. What they have not closed is the distillation path that requires no logit access at all.
API outputs carry information about the projection layer because the projection layer directly computes those outputs. They carry negligible information about attention weights, layer norms, and intermediate representations deeper in the network, because those values are not visible in the output distribution. This is the information-theoretic boundary: parameter recovery attacks reach as far as the projection layer and no further, and full weight recovery through black-box queries is not a documented or theoretically tractable threat against current architectures. What the output distribution does expose, fully and at low cost, is the behavioral policy of the model: the specific capability profile that users pay for and competitors want to replicate.
General functionality extraction operates at the level of natural-language outputs. No deployed defense eliminates it. Perturbation and rate limiting slow it. Watermarking detects it after the fact. Behavioral fingerprinting catches campaigns mid-execution, not before they start. A sufficiently patient attacker who stays below detection thresholds, rotates accounts, and mixes distillation traffic with legitimate queries can continue accumulating training data indefinitely. The Anthropic disclosure demonstrates that the detection capability exists and works at scale. It also demonstrates that a campaign reaching 13 million exchanges before detection is not a hypothetical; it happened.
All published defense evaluations used non-adaptive attacker conditions. ModelShield, Neural Honeytrace, and prediction poisoning were each evaluated against attackers who did not modify their strategy in response to the defense. Adaptive bypass rates for 2025-era watermarking defenses have not been published. The gap between non-adaptive and adaptive performance is well-documented in adversarial machine learning: defenses that appear strong in non-adaptive settings frequently degrade when the attacker knows the defense exists. The community does not yet have a peer-reviewed adaptive evaluation of any 2025-era watermarking method against a distillation-aware adversary.
The downstream attack surface that extraction enables also remains open. A high-fidelity student model built from API outputs enables adversarial example transfer to the original: inputs crafted to fool the clone transfer to the original at rates substantially above chance, enabling model-specific attacks without direct gradient access to the target (meaning the attacker works entirely from their local copy, with no further queries to the production model required). Membership inference against the original’s training data, using the clone’s decision boundary as a shadow model, becomes substantially more precise than blind API-based inference. These second-order attacks are not addressed by any of the defenses discussed above, because they exploit the student model locally, entirely outside the provider’s visibility.
The legal gap compounds the technical one. The Defend Trade Secrets Act of 2016 (DTSA) defines misappropriation as acquisition by “improper means,” including misrepresentation and breach of a duty to maintain secrecy. The hydra cluster architectures documented by Anthropic, fraudulent accounts, commercial proxy services, deliberate traffic obfuscation, satisfy that definition as a matter of statutory reading. The first judicial signal on extraction-adjacent conduct arrived in OpenEvidence, Inc. v. Pathway Medical, Inc. (D. Mass. 2025), where the filing argued that strategic digital manipulation to extract protected AI instructions constitutes “improper means” rather than lawful reverse engineering. No precedential ruling emerged. The EU AI Act (Regulation 2024/1689) added Article 78 confidentiality obligations applicable from August 2, 2025, but creates no extraterritorial remedy against laboratories operating outside EU jurisdiction. Confirmed enforcement actions against model distillation theft as of April 2026: zero.
The forensic capability to document campaigns exists and is operational at three major labs. The legal infrastructure capable of converting that documentation into a consequence has not moved. Watermarking produces evidence the courts cannot yet act on. Behavioral fingerprinting catches campaigns the DTSA has not yet prosecuted. The technical defenses and the legal framework are not coordinated, and the gap between them is where industrialized extraction currently operates.
The displacement between documented harm and available remedy is structural, not temporary. Detection is the realistic defense posture, not prevention. Behavioral fingerprinting and rate limiting raise extraction costs and generate the evidence trail a future legal action would require. Removing logit-bias API access where it is not operationally necessary closes the Carlini-class parameter recovery attack entirely. Providers that have already restricted full logit outputs, as OpenAI did when retiring Ada and Babbage in 2024, have closed this specific path. For any provider that still exposes full logit distributions, removal remains the single most effective countermeasure for this attack class. The distillation channel stays open regardless.
If you operate a production model with API access, the record here points to one operational conclusion: the detection capability needs to be proportional to the value of what you are protecting, because prevention is not achievable with any currently deployed technique. The enforcement gap will remain until a DTSA case tests API circumvention as “improper means” at scale, or until an international framework imposes costs on actors that a terms-of-service takedown cannot reach. The behavioral intelligence to build that case has been assembled by three separate frontier labs. The legal system capable of acting on it has not yet been asked to.
Peace. Stay curious! End of transmission.
Fact-Check Appendix
Statement: The embedding projection layer of Ada and Babbage could be extracted for under $20.
Source: Carlini et al., ICML 2024 | https://arxiv.org/abs/2403.06634
Statement: Ada’s hidden dimension is 1,024; Babbage’s is 2,048.
Source: Carlini et al., ICML 2024 | https://arxiv.org/abs/2403.06634
Statement: GPT-3.5-turbo projection matrix recovery estimated at under $2,000.
Source: Carlini et al., ICML 2024 | https://arxiv.org/abs/2403.06634
Statement: GPT-3.5-turbo hidden dimension estimated at approximately 4,096 dimensions.
Source: Finlayson, Ren, Swayamditta, COLM 2024 | https://arxiv.org/abs/2403.09539
Statement: WH = L decomposition with unit-norm constraint enables projection matrix recovery.
Source: Finlayson, Ren, Swayamditta, COLM 2024 | https://arxiv.org/abs/2403.09539
Statement: Near-perfect fidelity extraction of logistic regression, neural network, and decision tree models demonstrated against live commercial platforms including BigML and Amazon Machine Learning.
Source: Tramer, Zhang, Juels, Reiter, Ristenpart, USENIX Association (USENIX) Security 2016 | https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer
Statement: 84% functional fidelity against Microsoft production API at 27,000 queries (non-adaptive evaluation, facial emotion recognition task).
Source: Liang, Pang, Li, Wang, AsiaCCS 2024 | https://arxiv.org/abs/2312.05386
Statement: DeepSeek conducted over 150,000 exchanges with Claude.
Source: Anthropic, February 23, 2026 | https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
Statement: Moonshot AI conducted over 3.4 million exchanges with Claude.
Source: Anthropic, February 23, 2026 | https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
Statement: MiniMax conducted over 13 million exchanges with Claude.
Source: Anthropic, February 23, 2026 | https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
Statement: Over 24,000 fraudulent accounts used across the three campaigns.
Source: Anthropic, February 23, 2026 | https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
Statement: Hydra cluster proxy networks coordinating up to 20,000 simultaneous accounts.
Source: Anthropic, February 23, 2026 | https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
Statement: Total exchanges across three campaigns exceeded 16 million.
Source: Anthropic, February 23, 2026 | https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
Statement: MiniMax pivoted query strategy within 24 hours of Anthropic releasing updated models.
Source: Anthropic, February 23, 2026 | https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks
Statement: Reasoning Trace Coercion campaign used over 100,000 structured prompts against Gemini.
Source: Google GTIG, February 12, 2026 | https://cloud.google.com/blog/topics/threat-intelligence/distillation-experimentation-integration-ai-adversarial-use
Statement: DeepSeek employees developed code to circumvent OpenAI access restrictions via obfuscated third-party routers.
Source: OpenAI, Letter to US House Select Committee on Strategic Competition, February 12, 2026 | https://cdn.openai.com/pdf/045aa967-ee96-4a09-94ee-3098ddf6db2c/OpenAI-US-House-Select-Cmte-Update-[021226].pdf
Statement: DeepSeek R1 reported training cost approximately $6 million.
Source: Khawam, Just Security, March 30, 2026 | https://www.justsecurity.org/134124/costs-china-ai-distillation/
Statement: “It’s been clear for a while now that part of the reason for the rapid progress of Chinese AI models has been theft via distillation.” (Alperovitch quote)
Source: Khawam, Just Security, March 30, 2026 | https://www.justsecurity.org/134124/costs-china-ai-distillation/
Statement: Context-aware rate limiting achieves 94% reduction in exploitation attempts versus IP-based approaches (non-adaptive evaluation).
Source: 2024 systematic analysis of API rate-limiting mechanisms | https://ijsrcseit.com/index.php/home/article/view/CSEIT241061223
Statement: Filing in OpenEvidence, Inc. v. Pathway Medical, Inc. (D. Mass. 2025) argued that strategic digital manipulation to extract protected AI instructions constitutes improper means rather than lawful reverse engineering under the DTSA.
Source: OpenEvidence, Inc. v. Pathway Medical, Inc., D. Mass. 2025 (court filing; no public docket URL available at time of publication).
Statement: EU AI Act Article 78 confidentiality obligations applicable from August 2, 2025.
Source: EU AI Act, Regulation 2024/1689, Article 78 | https://artificialintelligenceact.eu/article/78/
Statement: EU AI Act Article 99 penalties scale by violation type; the ceiling of 35 million euros or 7% of global annual turnover applies to prohibited AI practices under Chapter II; confidentiality violations under Article 78 fall under a lower penalty tier.
Source: EU AI Act, Regulation 2024/1689, Article 99 | https://artificialintelligenceact.eu/article/99/
Top 5 Sources
1. Carlini et al., ICML 2024 (arXiv:2403.06634)
Authors: Nicholas Carlini, Daniel Paleka, and 12 co-authors at Google Research and affiliated institutions. Authoritative as the first empirically demonstrated extraction of a production language model’s projection layer, with reproducible cost figures. ICML Best Paper 2024.
2. Anthropic, “Detecting and Preventing Distillation Attacks,” February 23, 2026
Primary organizational disclosure from a targeted frontier lab, containing per-campaign operational detail including query volumes, account infrastructure, and detection methodology. No comparable level of operational specificity exists elsewhere in the public record.
3. Google GTIG, “Distillation, Experimentation, and (Continued) Integration of AI for Adversarial Use,” February 12, 2026
Official threat intelligence report from Google’s Threat Intelligence Group and Google DeepMind. Documents the Reasoning Trace Coercion campaign with named metrics and real-time detection evidence. Provides the most detailed public account of how a frontier lab’s detection capability operates against these campaigns.
4. Finlayson, Ren, Swayamditta, COLM 2024 (arXiv:2403.09539)
Provides the formal mathematical basis for the projection layer vulnerability through the softmax bottleneck analysis. Independently corroborates Carlini’s hidden dimension findings through a distinct method.
5. Tramèr, Zhang, Juels, Reiter, Ristenpart, USENIX Association (USENIX) Security 2016
Foundational paper establishing the attack class and the accuracy-versus-fidelity distinction that organizes all subsequent model extraction research. The only source in this list that predates the large language model era; included because it remains the authoritative statement of why model stealing is structurally possible against any pay-per-query inference API.




