The two-brain security model - LLM security through tiered inference to stop prompt injection
Route cheap models for known threats, expensive ones for ambiguous calls. The two-tier security architecture cuts LLM inference costs 76% without sacrificing attack coverage.
TL;DR
Your LLM pipeline has a security problem you haven’t priced yet.
Right now, you’re probably doing one of two things: routing every request through an expensive frontier model (which works, but burns budget at a rate that doesn’t survive contact with scale) or trusting a cheap classifier to catch everything (which works, until a researcher publishes the attack that proves it doesn’t).
The empirical attack literature is unambiguous. Zou et al. demonstrated adversarial suffixes that transfer across model families at an 87.9% success rate. Greshake et al. showed that your RAG pipeline is an injection surface you probably haven’t fully secured. Perez and Ribeiro found that more capable models are more susceptible to injection, not less.
There’s a third path. It’s a 30-year-old idea from network security, now applied to LLM inference: a cheap, fast classifier at the door and an expensive reasoning model handling only the cases the first brain can’t resolve.
The cost math works out to a 76% reduction. The security math works out in the attack research.
Both are inside.
The Itch: Why This Matters Right Now
Here’s a confession most security engineers won’t make out loud: the LLM pipeline you built six months ago is probably being checked the wrong way.
Maybe you routed everything to your most capable frontier model, reasoning that if a threat slips through, you’d rather pay more than explain a breach. Totally rational. Also financially unsustainable. At a million requests per month, full-frontier routing to a top-tier model runs around $5,000 monthly for security checks alone, before a single token of actual application inference.
So maybe you went the other direction. You put a cheap classifier at the front door, called it a guardrail, and moved on. Also rational. Also dangerous in ways you won’t discover until a researcher publishes a paper about an attack that worked on your system.
The real pressure isn’t a budget problem or a security problem. It’s both, colliding at the exact moment your pipeline hits production scale. You need to screen millions of inputs per month without going broke, and you need to catch adversarial inputs specifically engineered to look innocent.
I’ve been digging into the empirical attack research, verified pricing data, and 30 years of defense-in-depth doctrine to map out what actually works here. The answer isn’t one model or two models. It’s two models with the right job description for each.
This is the Two-Brain Security Model: a cheap, fast classifier at the door and an expensive reasoning model waiting in the back office. Let me walk you through why the architecture exists, what the attack research demands of it, and exactly when you should trust it, and when you shouldn’t.
The Deep Dive: The Struggle for a Solution
The first thing to understand is that this architecture isn’t new. It’s 30 years old.
A WAF in a new coat
Network security engineers solved this exact problem in the mid-1990s. The doctrine they landed on, layered defense, is documented in the U.S. National Security Agency’s Information Assurance Technical Framework from 1998. The principle: put cheap, fast filters at the perimeter. Reserve expensive, stateful inspection for traffic that makes it through the first screen.
Web Application Firewalls execute this principle to the letter. A modern WAF runs regex-based signature matching and IP reputation checks at sub-millisecond latency, near-zero marginal cost per request. Anything suspicious gets escalated to deeper behavioral analysis: session reconstruction, payload semantics, application-layer protocol inspection. The cost differential between those two tiers routinely exceeds an order of magnitude per request.
Security architects didn’t build this system because cheap detection was sufficient. They built it because routing 100% of traffic through the expensive tier was operationally impossible at scale.
LLM security is rediscovering the same constraint in a new medium. The two-tier pattern isn’t clever; it’s necessary.
What your cheap brain can actually handle
Your Tier 1 model is the reflex brain: fast, decisive, and opinionated about the things it already knows. Think of it as the bouncer who memorized a very specific list. It’s not there to think; it’s there to recognize.
Regex-based injection detection catches the syntactic fingerprints of known attacks: variations of “ignore previous instructions,” common jailbreak templates, explicit system prompt extraction requests. Embedding-based classifiers project input text into a vector space and measure distance from known-benign and known-malicious clusters, adding single-digit millisecond latency per request. PII scanners apply pattern matching for Social Security numbers, credit card formats, email addresses, and phone numbers, all without touching model inference. Topic boundary enforcement uses keyword and n-gram classifiers to reject inputs outside the application’s permitted domain. Output format validators confirm that model responses conform to expected schemas.
NVIDIA’s NeMo Guardrails framework runs all of these as parallel rails simultaneously, adding roughly 0.5 seconds of latency for five concurrent checks, achieving a 1.4x improvement in detection rate over single-rail configurations. The parallel execution model mirrors the WAF pattern exactly: run multiple cheap checks concurrently, escalate the survivors.
Here’s the catch. Your bouncer is working from a fixed list. That list gets stale.
What the attack research actually tells you
Perez and Ribeiro (2022) built the PROMPTINJECT framework and tested it against a production OpenAI model. They achieved a 58.6% success rate for goal hijacking and a 23.6% success rate for prompt leaking. Those aren’t low numbers. More unsettling was their finding about model capability: more capable models were more susceptible to injection, because superior instruction-following extends equally to malicious instructions. Your Tier 1 classifier can catch the syntactic patterns Perez and Ribeiro used. That’s the good news. The bad news starts with Greshake.
Greshake et al. (2023) introduced indirect prompt injection, and it changes the threat model entirely. The attacker doesn’t interact with your LLM directly. Instead, they plant malicious instructions inside documents, web pages, or emails that your system retrieves during inference. Your RAG pipeline fetches what looks like a legitimate document. Inside that document, invisible to human readers but perfectly parseable by your model, sits an instruction that redirects behavior. The researchers demonstrated live attacks against a GPT-4-powered production search interface and against synthetic LangChain applications. Their framing is precise: retrieved prompts function as arbitrary code inside LLM-integrated applications.
The implication is structural. Any external data that enters your pipeline must be treated as untrusted input, processed through a quarantined path that holds no tool access, before anything it contains reaches a privileged model. Tier 1 can scan retrieved content for known injection patterns. Detecting a novel indirect injection embedded in an otherwise legitimate document requires Tier 2 semantic reasoning. OWASP’s 2025 LLM Top 10 added a dedicated entry for Vector and Embedding Weaknesses (LLM08) to address exactly this surface, noting that 53% of companies using RAG are exposed to embedding poisoning and similarity attacks.
Then Zou et al. (2023) arrived with the finding that should give every Tier 1 engineer genuine pause. Their Greedy Coordinate Gradient algorithm, optimized on small open-source models, achieved an 87.9% attack success rate against GPT-3.5 and a 53.6% success rate against GPT-4 in transfer attacks. On said study, even Claude, the most resistant model tested, showed a 2.1% success rate. Attacks generated to target a 7-billion-parameter open-source model still occasionally succeeded against production frontier models that were never part of the training target.
This is the core failure mode of a Tier 1-only approach. Your classifier is trained on a fixed corpus. Zou et al.’s result means the attack distribution is not fixed. New adversarial suffix families emerge continuously, each one a blind spot until you retrain. A Tier 1 system without Tier 2 escalation is a bouncer who only memorized the old fake IDs and doesn’t know anyone’s been printing new ones.
Toyer et al. (2023) quantified how creative human attackers actually are. Their TensorTrust benchmark collected 126,808 human-generated adversarial attacks and 46,457 defenses, the largest public dataset of its kind. Attacks designed for GPT-3.5 Turbo generalized to Claude Instant 1.2 and PaLM 2. Human creativity transfers across model families even without algorithmic optimization.
What your expensive brain is actually for
Tier 2 is the reasoning model waiting in the back office: patient, contextual, and expensive enough that you only want to involve it when the bouncer genuinely can’t decide.
Multi-turn semantic analysis evaluates whether a sequence of individually benign messages constitutes a coordinated injection across conversation turns. Contextual intent disambiguation determines whether a request like “show me how the system processes passwords” is a legitimate developer query or a social engineering probe; that judgment requires understanding the user’s role, session history, and application context. Novel attack pattern detection addresses inputs that match no known signature but exhibit anomalous structure. Agentic action authorization evaluates whether a proposed tool invocation is consistent with user intent and within the agent’s permitted scope, corresponding to OWASP LLM06 (Excessive Agency).
Simon Willison formalized this architecture in 2023 with his Dual LLM pattern: a Privileged LLM that accepts input only from trusted sources and holds tool access, and a Quarantined LLM that processes untrusted content but holds no tool access. The controller between them enforces symbolic variable references so the Privileged model never sees raw untrusted content, only abstracted placeholders. Google DeepMind’s CaMeL system (2025) extended this further with data flow analysis through a custom Python interpreter, tracking taint propagation across every variable and enabling capability-based access control rather than binary trust/distrust boundaries.
The routing asymmetry between tiers is where the security logic lives. In this context, the cost of a missed attack vastly exceeds the cost of escalating a benign request unnecessarily. A single Tier 2 escalation on a 500-token input costs approximately $0.003. IBM’s annual Cost of a Data Breach Report consistently places average breach cost above $4 million. Even a conservative $10,000 estimate for a single successful injection incident represents the equivalent of over three million unnecessary Tier 2 escalations. The asymmetry is not marginal; it is six orders of magnitude. Your router should be calibrated to over-escalate rather than under-escalate, accepting higher Tier 2 invocation rates in exchange for lower false-negative rates.
The cost arithmetic
Anthropic’s official API pricing documents the economics precisely. Claude Haiku 4.5 runs at $1 per million input tokens and $5 per million output tokens, operating 4 to 5 times faster than Sonnet 4.5. Sonnet 4.5 runs at $3 per million input tokens and $15 per million output tokens. Opus 4.5 reaches $5 per million input tokens and $25 per million output tokens.
For a pipeline processing 1 million requests per month, routing everything to Opus 4.5 (assuming 500 input tokens and 100 output tokens per security check) runs to $5,000 monthly. A 90/10 split routing the bulk of traffic to Haiku and escalating a tenth to Sonnet reduces that to $1,200 monthly: a 76% cost reduction, saving $3,800 per month. That’s not rounding error. That’s a separate engineering hire.
The Resolution: Your New Superpower
The Two-Brain Security Model gives you a defensible architecture, and that word, defensible, is doing real work.
The OWASP LLM Top 10 2025, while not a formal regulation, functions as a de facto compliance baseline for organizations deploying LLM applications. Mapping your tier coverage to its ten entries produces an audit artifact. For LLM01 (Prompt Injection), Tier 1 handles known patterns while Tier 2 covers novel and indirect variants. LLM02 (Sensitive Information Disclosure) splits similarly: PII regex scanning in Tier 1 for single-turn outputs, Tier 2 for multi-turn conversational contexts where disclosure emerges from conversational dynamics. LLM05 (Improper Output Handling) falls almost entirely to Tier 1 through deterministic format validation. LLM06 (Excessive Agency) belongs to Tier 2, because evaluating an agent’s proposed tool invocations requires contextual reasoning no classifier can replicate. LLM10 (Unbounded Consumption) falls back to Tier 1: rate limiting and token counting are exactly the deterministic checks a cheap model handles cleanly.
As EU AI Act requirements mature through 2026, this tier-to-OWASP mapping will increasingly function as a compliance prerequisite rather than a voluntary internal review.
The architecture’s maintenance obligation is real and non-negotiable. Your tier boundary degrades as adversarial techniques evolve. Four metrics tell you whether your boundary is healthy: your Tier 1 false-negative rate, your Tier 2 false-positive rate, escalation rate drift over time, and the adversarial perplexity of inputs arriving at Tier 2. A rising escalation rate may signal that your Tier 1 classifiers are losing discriminative power against evolving attack distributions. Review the tier boundary monthly for high-traffic pipelines, not in response to observed incidents. Waiting for an incident means classifier degradation you could have caught analytically already enabled an attack.
The architecture is also not universally correct to implement. For pipelines handling fewer than 100,000 requests per month without agentic capabilities or external data retrieval, a single frontier model may be equally effective and substantially simpler to operate and secure. Willison himself acknowledged that the quarantine boundary is not a complete isolation mechanism: data corruption in a Quarantined LLM can still enable exfiltration if the Privileged LLM acts on corrupted output. CaMeL addresses this through data flow analysis, but at significant added complexity, including a custom Python interpreter and explicitly defined user security policies. More components means more configuration surfaces and more potential failure points.
The Two-Brain Security Model is conditionally optimal. The conditions: pipelines exceeding 100,000 requests per month, with agentic capabilities or RAG accessing external data, where cost reduction and OWASP coverage both matter. Under those conditions, the tiered architecture reduces inference costs by 60 to 80 percent relative to full-frontier routing while maintaining comparable security coverage, provided you continuously monitor Tier 1 false-negative rates and recalibrate the boundary on a regular schedule.
Your next step is an audit, not a build. Map your current pipeline’s request volume and risk profile against the four routing health metrics. If you’re above 100,000 requests per month and running monolithic, the math already told you the answer. The question is just when you start acting on it.
Peace. Stay curious! End of transmission.
References
Zou et al. (2023), “Universal and Transferable Adversarial Attacks on Aligned Language Models” — Carnegie Mellon University, Google DeepMind, Center for AI Safety; 2,230+ citations; responsible disclosure to OpenAI, Google, Meta, and Anthropic prior to publication. (https://arxiv.org/abs/2307.15043)
Greshake et al. (2023), “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection” — CISPA Helmholtz Center for Information Security and Saarland University; demonstrated live attacks on production systems. (https://arxiv.org/abs/2302.12173)
Perez and Ribeiro (2022), “Ignore Previous Prompt: Attack Techniques For Language Models” — NeurIPS ML Safety Workshop, Best Paper Award; introduced the PROMPTINJECT framework and the foundational attack taxonomy. (https://arxiv.org/abs/2211.09527)
Toyer et al. (2023), “Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game” — UC Berkeley, Georgia Tech, Harvard; publicly released dataset of 126,808 human-generated adversarial attacks. (https://arxiv.org/abs/2311.01011)
OWASP Top 10 for LLM Applications 2025 — OWASP Foundation; the primary industry standard taxonomy for LLM-specific security risks, used as the mapping framework for tier-boundary compliance auditing. (https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-v2025.pdf)




