When the Research Tool and the Attack Tool Are the Same System
AI agents are now automating exploit chain construction at production scale. Learn how this shifts the economics of vulnerability triage and remediation.
A note on cadence: I am reducing the publishing frequency for this journal. Behind each article sits a research phase, a full monograph, source vetting, multiple editorial review passes, and the lab time that grounds the analysis in something real. Two articles per week at that depth is not sustainable without sacrificing the very work that makes the articles worth writing. Going forward, I am targeting one article per week, starting this week. Publication moves to Wednesdays from next week onward. Thank you for your understanding.
Disclaimer
This article is intended for informational purposes and reflects the state of published research and industry practice as of early 2026. It is not professional security advice. Your specific environment, threat model, and regulatory obligations will shape how these principles apply to your situation.
For Security Leaders
AI agents have transitioned from discovering isolated bugs to constructing working exploit chains at production scale. This shift automates the most time-intensive part of vulnerability research, providing attackers with actionable proofs while defenders struggle with collapsing exploitation windows. The structural risk is that your remediation capacity remains fixed while the volume and quality of actionable findings against your codebase increase exponentially.
What this means for your organization:
Vulnerability triage economics have shifted. Findings that arrive with automated proofs of exploitability bypass traditional triage bottlenecks and demand immediate prioritization.
The exploitation window is now negative. Attackers are routinely exploiting vulnerabilities before patches are available, making traditional patch speed an insufficient primary defense.
Dependency reachability is the new signal. At AI scale, scanning transitive dependencies without automated reachability analysis creates an unmanageable volume of noise for development teams.
What to tell your teams:
Deploy adversarial validation harnesses. Use review-only agent instances to filter hunting agent findings, reducing false positive rates before they reach human reviewers.
Prioritize reachability over discovery. Focus remediation efforts on vulnerabilities where tracing agents confirm that untrusted input can reach the flaw through your call graph.
Invest in exploitability reduction. Strengthen network-layer controls and isolation boundaries to block exploit primitives, rather than relying solely on regression-sensitive patch cycles.
Log and monitor model refusals. Treat silent model refusals in security harnesses as coverage gaps that must be logged and addressed through pre-request routing logic.
Exploit Chain Construction Has Crossed Into Production, and That Reframes What a Confirmed Finding Actually Costs to Validate
The capability trail starts in October 2024, when Google Project Zero’s Big Sleep project, extending an earlier research harness called Naptime, published the first publicly confirmed case of an AI agent discovering a previously unknown exploitable memory-safety vulnerability in widely used production software. The target was SQLite. Using Gemini 1.5 Pro running a hypothesis-driven research loop, the agent identified a stack buffer underflow in the seriesBestIndex function that American Fuzzy Lop (AFL), a widely used mutation fuzzer that shares its name with a breed of rabbit, had not found after 150 CPU-hours. The SQLite developers patched it the same day it was reported. Naptime’s published benchmark figures reflect the best result across 20 independent runs, not single-pass performance.
The shift I would mark in 2026 is not another benchmark result. It is that this class of capability moved into a production harness against real infrastructure, and the operational findings from that deployment are now documented: Cloudflare’s internal deployment found 2,000 total bug reports across its production codebases, 400 of them high or critical severity, while Anthropic’s broader open-source scanning program reviewed more than 1,000 projects and identified an estimated 6,202 high and critical-severity vulnerabilities. Glasswing findings I discuss are sourced from Cloudflare’s published May 2026 report and Anthropic’s initial project update, both published in May 2026. No independent testing of or access to the system was conducted.
The finding that matters most from Cloudflare’s Project Glasswing is not the bug count. Previous general-purpose frontier models found a substantial fraction of the same underlying bugs. The threshold crossed is exploit chain construction: taking multiple individually low-severity vulnerability primitives and reasoning through how to combine them into a single working proof of concept. A use-after-free becomes an arbitrary read/write primitive. That enables control-flow hijacking. From there, a return-oriented programming (ROP) chain can take full system control. Previous models identified the individual components and wrote accurate descriptions of why they mattered, then stopped before stitching the chain. Mythos Preview closes that gap and does so with a working proof: the model writes the triggering code, compiles it, executes it, reads the failure output if the first attempt fails, revises the hypothesis, and reruns.
This distinction has a direct operational consequence. A finding that arrives with a working proof is an actionable finding. A finding without one requires a researcher to independently verify exploitability before it can enter the remediation queue. At scale, the difference between these two states is the difference between a manageable queue and an unmanageable one. Triage economics shift the moment proof generation becomes automated. For a development team without a dedicated security engineering function, that shift is precisely what makes the build question worth asking.
Meaningful Coverage Requires a Harness, Not a Better Prompt, and the Triage Problem Is What That Harness Exists to Solve
A single agent session against a 100,000-line codebase covers roughly one-tenth of one percent of the attack surface before context window saturation discards earlier findings. This is a structural constraint, not a model quality issue. Vulnerability research is narrow and parallel by nature: a researcher picks one attack class in one code region, exhausts it, and moves to the next, running many such threads simultaneously across the codebase. A single agent operates sequentially, holds one hypothesis at a time, and hits the context ceiling well before the codebase does.
AI-assisted vulnerability research does not displace the tools teams already run. Static analysis tools perform syntactic pattern matching against known vulnerability signatures without executing code. Coverage-guided fuzzing generates malformed inputs at machine speed but cannot reason about program semantics or variant conditions. Traditional penetration testing applies human hypothesis-driven investigation but does not parallelize at codebase scale. AI-assisted vulnerability research occupies the intersection of semantic reasoning, automated proof generation, and parallel execution. It does not substitute for the signal each prior approach produces; it addresses the coverage and proof-generation gaps they leave.
The harness model is the response to the coverage constraint, and I would separate two problems it has to solve in sequence: coverage scope first, then false positive volume. Getting coverage wrong makes the false positive problem moot.
Reachability Across the Dependency Chain Is What Separates a Signal From Noise
Cloudflare’s production deployment runs approximately fifty concurrent hunting agents during a scan. Each agent is scoped to one attack class paired with one code region and is supplied with an architecture document covering build commands, trust boundaries, entry points, and prior coverage. That document is generated first by a root reconnaissance agent, before any hunting begins. Every downstream agent starts with shared context rather than spending turns reconstructing what the codebase does.
The scope problem extends past the first-party boundary. An audit of more than 1,000 commercial codebases across 17 industries found that 84% contained at least one known open source vulnerability and 74% contained high-risk vulnerabilities, a 54% increase from the prior year. Transitive dependency vulnerabilities are a documented, measurable component of most production software, not an edge case.
The architectural response to this is a dedicated tracing stage. For each confirmed finding in a shared library, a tracer agent spawns one instance per consumer repository, uses a cross-repository symbol index, and determines whether attacker-controlled input actually reaches the flaw from outside the system. Without this, dependency scanning produces volume without clarity. A finding in a third-party library without a reachability verdict occupies the same queue position as a confirmed exploitable vulnerability in first-party code but carries fundamentally different remediation priority. Deferring the Trace stage in an initial deployment is a reasonable decision under resource constraints, but it should be a deliberate one: a harness without it accepts a known coverage gap in the dependency surface of the majority of production codebases.
The False Positive Rate Is the Operational Constraint That Determines Whether a Development Team Can Run This Without a Security Function
Production deployments show what the feasibility argument requires. At Trail of Bits, AI-augmented auditors report finding 200 bugs per week on qualifying engagements, up from a baseline of approximately 15 per week before their AI-native transformation, with roughly 20% of reported findings initially discovered by AI systems. That figure is a ceiling under well-matched codebase conditions, not a typical result for an arbitrary codebase or a team without established AI tooling. That throughput is achievable because of the harness architecture, not despite it. Anthropic’s Project Glasswing open-source scanning provides a production-scale validation figure: of 1,752 high and critical-severity candidates independently assessed by six security firms, 90.6% were confirmed as valid true positives, a 9.4% false positive rate after structured triage. That figure reflects a triage process with independent verification, not raw model output. The Trail of Bits throughput ceiling also sets the scale of the triage problem: at that volume, a team without structured false positive reduction will spend most of its capacity on noise rather than on findings.
The false positive problem predates AI-assisted discovery. A union of four established static analysis tools against Java codebases produces a false positive detection rate exceeding 92%: more than 92 out of every 100 flagged items are not real vulnerabilities. Large language model (LLM)-assisted filtering in the best documented configuration reduces that figure to 6.3%. The cost of that filtering pass: $0.047 to $0.187 per finding in model spend. At 1,000 findings, the total is $47 to $187. A standard security analyst in the US market costs $75 to $200 per hour at published industry rates; at $0.187 per AI filtering pass, a team processes more than 500 findings per analyst-hour-equivalent in model spend. (The analyst rate comparison is a derived estimate from industry compensation benchmarks, not a cited statistic.) The economics of automated filtering are not controversial at any realistic cost ratio. The question is which vulnerability classes it can be trusted to filter.
The figure that matters more than the headline reduction is the trade-off inside it. The best filtering configuration retained only 77.7% of true vulnerabilities, incorrectly suppressing 22.3% of genuine findings. That miss rate is not uniformly distributed. Command injection and SQL injection see near-ceiling filtering performance. Weak cryptography and domain-knowledge-dependent categories see miss rates above 77%. A team deploying LLM-based false positive filtering has to account for which vulnerability classes will be systematically discarded before findings reach human review, and build the risk acceptance posture accordingly.
Memory-unsafe codebases compound the noise. C and C++ permit bug classes that memory-safe languages eliminate at compile time; a model scanning C or C++ code will correctly identify patterns that would be vulnerabilities given the language’s semantics, producing higher noise rates than an equivalent scan of Rust or Go. Cloudflare observed this pattern directly across the Glasswing repositories. The 92% and 6.3% false positive rate figures cited above apply specifically to Java codebases; no published peer-reviewed quantitative characterization of these rates for C or C++ codebases exists at time of writing, and the 6.3% floor should not be assumed to transfer to memory-unsafe scans.
A field observation from curl provides directional evidence. When Mythos Preview scanned curl in May 2026, lead maintainer Daniel Stenberg documented the outcome publicly: one confirmed vulnerability (low severity), three false positives from API limitations, and one genuine bug that was not a security issue, from five items reported as findings. That is an 80% false positive rate against a C codebase with decades of active security review. Stenberg’s broader observation cuts against dismissing the tool category outright: “Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws.” A codebase at the curl end of the maintenance spectrum will see a different result than the Trail of Bits ceiling case, and treating the ceiling figure as a deployment estimate for a well-maintained codebase would produce an incorrect budget and an incorrect coverage expectation.
The adversarial validation stage addresses the noise that language-level filtering leaves behind. A second independent agent reads the code identified by the hunter and attempts to disprove the finding. The validator operates on a different prompt, uses a different model instance, and has no ability to emit new findings of its own. Two agents in deliberate disagreement catch a meaningful fraction of the noise a self-reviewing agent would pass through, because the validator’s prior is configured toward dismissal rather than confirmation. This structural separation does not require sophisticated prompt engineering; it requires treating the finding agent and the review agent as adversaries by design. One distinction matters for evaluating the evidence: the quantitative false positive rate reduction data (Song et al., Java codebases, LLM agent frameworks) measures a filtering-agent approach, not an adversarial-pair design. The adversarial pair architecture is documented from Cloudflare’s Glasswing deployment as an operational observation without a published baseline comparison. Both approaches reduce noise; the evidence base for each is distinct.
Model Refusals Create Coverage Gaps That Do Not Appear in Finding Counts
The harness introduces a constraint that does not show up in false positive rates or queue depth. The model will sometimes refuse tasks, and it will not do so consistently. The same task, framed differently or presented after an unrelated environmental change, can produce opposite outcomes across independent runs. The probabilistic nature of the model means that semantically equivalent requests can produce opposite behaviors depending on how and when they are presented.
The operational consequence is specific. A harness that relies on organic model refusals as a safety boundary will have unpredictable coverage gaps. Refusals fail in both directions: blocking legitimate research tasks and permitting tasks that should be blocked, depending on framing. Silent refusals that produce no output and no log entry create invisible gaps that will not appear in finding counts unless the harness explicitly logs refusals as a first-class metric.
The architectural compensation is pre-request routing logic that validates each task against defined authorization parameters before model invocation, rather than delegating that determination to the model at runtime. Cloudflare’s Glasswing deployment found organic guardrails real but not reliable enough to serve as the primary safety boundary in a production pipeline. Anthropic’s initial project update states the systemic position directly: no company currently possesses safeguards sufficient to prevent misuse of models with Mythos-level capabilities. The implication for a production harness is that the safety boundary cannot be delegated to the model. It must be built into the routing layer before the model is invoked.
The Data Points to Exploitability Reduction as the Primary Lever, Because the Patch Window Has Already Closed for a Meaningful Fraction of Disclosures
Reading the 2026 threat intelligence reports in sequence, the figure I find most consequential is not any single statistic but the causal chain they form together. Mandiant M-Trends 2026, drawn from 500,000 hours of incident response, places the estimated mean time to exploit at negative seven days across actively exploited vulnerabilities: exploitation is routinely preceding patch availability. That negative gap is confirmed by the volume side: Rapid7’s 2026 report documents a 105% year-over-year increase in confirmed exploitation of high and critical-severity vulnerabilities, from 71 to 146, with the median time from vulnerability publication to inclusion in the Cybersecurity and Infrastructure Security Agency (CISA) Known Exploited Vulnerabilities catalog dropping from 8.5 days to 5. The forward indicator is the pre-disclosure rate: CrowdStrike documents a 42% year-over-year increase in zero-days exploited before public disclosure (the absolute counts behind this figure are not reported in the source, so the trend direction is reliable but the absolute scale is uncharacterized), meaning a growing fraction of the attacker timeline begins before any catalog signal exists for defenders.
These figures predate widespread adversarial deployment of AI-assisted exploit generation at scale. The same capability Cloudflare deployed in a controlled research context is technically accessible to adversaries without authorization constraints, and the exploitation timelines above are already compressing without it.
The instinct in response to a compressed exploitation window is to compress the patch cycle. Security leaders are operating under two-hour service-level agreement (SLA) targets from vulnerability disclosure to patch in production. The structural problem is that the patch pipeline has a minimum duration determined by regression testing. A two-hour SLA can only be met by skipping it. Patches that bypass regression testing introduce new vulnerabilities at a rate that makes the intervention comparable in risk to the original finding. Cloudflare observed this directly when model-generated patches fixed the original vulnerability while breaking code dependencies. Automated patch generation introduces its own accuracy problem: in the Defense Advanced Research Projects Agency (DARPA) AI Cyber Challenge analysis, between 37% and 45% of patches from the two best-documented finalist systems passed automated validation but failed manual review under competition conditions.
The architectural alternative targets exploitability rather than patch timing. With mean time to exploit measuring in negative days for a meaningful fraction of disclosures, the vulnerability window cannot be closed by patch speed alone for any alert that arrives after exploitation has already begun. Network-layer controls that block exploit primitives before they reach application code reduce the window of exploitability independently of when the patch ships. Isolation boundaries within the application that prevent a flaw in one component from granting access to adjacent components limit blast radius when an exploit does land. Deployment infrastructure capable of pushing a fix to every running instance simultaneously, rather than waiting on per-team deployment cycles, determines how much of the window closes when the patch is ready.
These are investments that compound with the harness rather than competing with it. A harness that surfaces vulnerabilities faster produces findings into a queue where the exploitation timeline may already have elapsed if the surrounding architecture does not reduce exploitability in the interim. The cost barrier on the discovery side has already fallen: Trail of Bits’ Buttercup found 28 vulnerabilities across 20 Common Weakness Enumeration (CWE) categories at greater than 90% accuracy at $181 per finding in AIxCC competition conditions, running on laptop-class hardware, and the DARPA AI Cyber Challenge baseline places autonomous discovery at $152 per task across 54 million lines of code. A discovery capability that until recently required a dedicated security engineering function is now within reach of a development team with API access and the harness architecture this article describes. The constraint has shifted from access to architecture. Building the conditions in which a finding can be acted on before the window closes is the work that remains, and it does not get easier as exploitation timelines continue to compress.
Peace. Stay curious! End of transmission.
Fact-Check Appendix
Source note: Project Glasswing findings cited in this article are sourced from Cloudflare’s published May 2026 reporting (Source [1]) and Anthropic’s initial project update (Source [15]). No independent testing of or access to the Glasswing system was conducted.
Statement: American Fuzzy Lop (AFL), a widely used mutation fuzzer, required 150 CPU-hours against the SQLite code that Big Sleep identified in a single session.
Source: [3] Glazunov & Brand, Google Project Zero, “From Naptime to Big Sleep” | https://projectzero.google/2024/10/from-naptime-to-big-sleep.html
Statement: Naptime’s published benchmark figures reflect the best result across 20 independent runs, not single-pass performance.
Source: [2] Glazunov & Brand, Google Project Zero, “Project Naptime: Evaluating the Capabilities of LLMs as Vulnerability Researchers” | https://projectzero.google/2024/06/project-naptime.html
Statement: Cloudflare’s internal Project Glasswing deployment produced 2,000 total bug reports across its production codebases, 400 of them high or critical severity, with false positive rates described as “better than human testers.”
Source: [15] Anthropic, “Project Glasswing: Initial Update” (May 2026) | https://www.anthropic.com/research/glasswing-initial-update
Statement: Anthropic’s Project Glasswing open-source scanning program reviewed more than 1,000 projects and identified an estimated 6,202 high and critical-severity vulnerabilities. Of 1,752 candidates independently assessed by six security firms, 90.6% were confirmed as valid true positives, a 9.4% false positive rate after structured triage.
Source: [15] Anthropic, “Project Glasswing: Initial Update” (May 2026) | https://www.anthropic.com/research/glasswing-initial-update
Statement: No company currently possesses safeguards sufficient to prevent misuse of models with Mythos-level capabilities.
Source: [15] Anthropic, “Project Glasswing: Initial Update” (May 2026) | https://www.anthropic.com/research/glasswing-initial-update
Statement: A single agent session against a 100,000-line codebase covers roughly one-tenth of one percent of the attack surface before context window saturation discards earlier findings.
Source: [1] Grant Bourzikas, Cloudflare Blog | https://blog.cloudflare.com/cyber-frontier-models/
Statement: Cloudflare’s production deployment runs approximately fifty concurrent hunting agents during a scan.
Source: [1] Grant Bourzikas, Cloudflare Blog | https://blog.cloudflare.com/cyber-frontier-models/
Statement: 84% of audited commercial codebases contained at least one known open source vulnerability; 74% contained high-risk vulnerabilities, a 54% increase from the prior year.
Source: [12] Synopsys / Black Duck, OSSRA 2024 | https://investor.synopsys.com/news/news-details/2024/New-Synopsys-Report-Finds-74-of-Codebases-Contained-High-Risk-Open-Source-Vulnerabilities-Surging-54-Since-Last-Year/default.aspx
Statement: At Trail of Bits, AI-augmented auditors report finding 200 bugs per week on qualifying engagements, up from a baseline of approximately 15 per week before their AI-native transformation, with roughly 20% of reported findings initially discovered by AI systems.
Source: [4] Dan Guido, Trail of Bits Blog, “How we made Trail of Bits AI-native (so far)” | https://blog.trailofbits.com/2026/03/31/how-we-made-trail-of-bits-ai-native-so-far/
Statement: A union of four established static analysis tools against Java codebases produces a false positive detection rate exceeding 92%; Large language model (LLM)-assisted filtering in the best documented configuration reduces that figure to 6.3%.
Source: [11] Song et al., arXiv 2601.22952, “Sifting the Noise” | https://arxiv.org/abs/2601.22952
Statement: The best filtering configuration retained only 77.7% of true vulnerabilities, incorrectly suppressing 22.3% of genuine findings.
Source: [11] Song et al., arXiv 2601.22952 | https://arxiv.org/abs/2601.22952
Statement: The cost of that filtering pass: $0.047 to $0.187 per finding in model spend.
Source: [11] Song et al., arXiv 2601.22952 | https://arxiv.org/abs/2601.22952
Statement: Mandiant M-Trends 2026, drawn from 500,000 hours of incident response, places the estimated mean time to exploit at negative seven days across actively exploited vulnerabilities.
Source: [9] Mandiant (Google Cloud), M-Trends 2026 | https://cloud.google.com/blog/topics/threat-intelligence/m-trends-2026
Statement: Rapid7’s 2026 report documents a 105% year-over-year increase in confirmed exploitation of high and critical-severity vulnerabilities, from 71 to 146.
Source: [8] Rapid7, 2026 Global Threat Landscape Report | https://www.rapid7.com/about/press-releases/rapid7-2026-global-threat-landscape-report-shows-exploited-high-and-critical-severity-vulnerabilities-surged-105-as-attack-timelines-collapsed/
Statement: The median time from vulnerability publication to inclusion in the CISA Known Exploited Vulnerabilities catalog dropped from 8.5 days to 5 days.
Source: [8] Rapid7, 2026 Global Threat Landscape Report | https://www.rapid7.com/about/press-releases/rapid7-2026-global-threat-landscape-report-shows-exploited-high-and-critical-severity-vulnerabilities-surged-105-as-attack-timelines-collapsed/
Statement: CrowdStrike documents a 42% year-over-year increase in zero-days exploited before public disclosure.
Source: [13] CrowdStrike, 2026 Global Threat Report | https://www.crowdstrike.com/en-us/global-threat-report/
Statement: Between 37% and 45% of patches from the two best-documented finalist systems passed automated validation but failed manual review under competition conditions.
Source: [6] Georgia Tech et al., SoK arXiv 2602.07666 | https://arxiv.org/abs/2602.07666
Statement: Trail of Bits’ Buttercup found 28 vulnerabilities across 20 Common Weakness Enumeration (CWE) categories at greater than 90% accuracy at $181 per finding in AIxCC competition conditions, running on laptop-class hardware.
Source: [7] Trail of Bits, “Buttercup wins 2nd place in AIxCC Challenge” | https://blog.trailofbits.com/2025/08/09/trail-of-bits-buttercup-wins-2nd-place-in-aixcc-challenge/
Statement: The DARPA competition baseline places autonomous vulnerability discovery at $152 per task across 54 million lines of code.
Source: [5] DARPA, “AI Cyber Challenge marks pivotal inflection point for cyber defense” | https://www.darpa.mil/news/2025/aixcc-results
Statement: When Mythos Preview scanned curl in May 2026, the documented outcome was one confirmed low-severity vulnerability, three false positives from API limitations, and one genuine bug that was not a security issue, from five reported findings. Lead maintainer Daniel Stenberg documented this publicly and noted that projects without prior AI scanning would likely find a large number of flaws.
Source: [14] Daniel Stenberg, “Mythos finds a curl vulnerability” (May 2026) | https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-vulnerability/
Top 6 Authoritative Sources
Anthropic, “Project Glasswing: Initial Update” (May 2026) - https://www.anthropic.com/research/glasswing-initial-update
Grant Bourzikas, Cloudflare Blog, “Project Glasswing: what Mythos showed us” (May 2026) - https://blog.cloudflare.com/cyber-frontier-models/
Glazunov & Brand, Google Project Zero, “From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code” (October 2024) - https://projectzero.google/2024/10/from-naptime-to-big-sleep.html
Mandiant (Google Cloud), “M-Trends 2026: Data, Insights, and Strategies From the Frontlines” (2026) - https://cloud.google.com/blog/topics/threat-intelligence/m-trends-2026
Song et al., “Sifting the Noise: A Comparative Study of LLM Agents in Vulnerability False Positive Filtering,” arXiv 2601.22952 (January 2026) - https://arxiv.org/abs/2601.22952
Georgia Tech et al., “SoK: DARPA’s AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned,” arXiv 2602.07666 (February 2026) - https://arxiv.org/abs/2602.07666




