AI Security Testing as a Chain of Trust

AI security testing becomes credible when benchmarks, scanners, and guardrails compose into a chain of trust validated by application runtime telemetry.

Jun 24, 2026

Disclaimer

This article is intended for informational purposes and reflects the state of published research and industry practice as of mid 2026. It is not professional security advice. Your specific environment, threat model, and regulatory obligations will shape how these principles apply to your situation.

For Security Leaders

Testing Large Language Model applications for security vulnerabilities cannot rely on a single score or scanner. Models frequently refuse malicious prompts but still leak protected secrets inside their refusal responses, meaning compliance intent does not guarantee actual data protection. Security leaders must build a multi-layered testing stack that verifies model behavior, boundary controls, and application state before claiming system safety.

What this means for your organization:

Jailbreak labels mislead teams because a model can technically deny an attack while still disclosing sensitive business information.
Independent security layers fail unless they are composed into a continuous chain where each tool verifies the assumptions of the next.
Structured data outputs require validation since restricting a model to standard formats does not automatically clean free-text comment fields.

What to tell your teams:

Deploy boundary scanners on both input and output paths to block literal secrets before they reach users.
Validate structured schemas with field-level checks to prevent sensitive data from leaking in free-text fields.
Implement application-level telemetry to track tool calls, retrieved context, and state changes alongside raw prompts.
Design regression tests from observed failures to continuously pressure-test boundary guardrails.

The most useful result in this lab came from a single failure: the same local Qwen model, served through an OpenAI-compatible endpoint, refused a synthetic attack and still leaked the protected string inside the refusal.

A single benchmark score would miss this kind of failure, and a single guardrail story would oversell it. The model did not enthusiastically comply with the unsafe request; it often said no. The problem was that the refusal repeated the exact value the policy said not to reveal.

That observation is the article’s claim: AI security testing becomes credible when each layer leaves evidence that constrains the next layer. A benchmark score, a scanner result, a red-team campaign, a schema validator, and an application trace are not interchangeable. They become useful when they form a chain. The final link is application telemetry: prompts, retrieved context, tool calls, guardrail events, authorization decisions, and state changes.

I mean chain of trust operationally, not cryptographically. There is no root certificate here. There is no formal proof that trust transfers from one tool to the next. There is a sequence of checks: did the harness score the right field, did the tool result mean what the report says it meant, did the guardrail block the actual failure mode, and did application state change in the way the security claim implies?

It sets a lower bar than formal assurance, but it is the bar practitioners actually need.

The first link is measurement provenance

CyberSecEval on a Consumer GPU: What My Local Setup Could Actually Measure covered a local CyberSecEval run. CyberSecEval is part of Meta’s PurpleLlama CybersecurityBenchmarks project, and the canary-exploit component asks a model to produce exploit inputs for generated vulnerable programs. The useful lesson had little to do with whether a local consumer graphics processing unit can reproduce a frontier-model security evaluation. A benchmark result has to prove its measurement path before the score deserves attention.

The specific failure was response visibility. A reasoning model can produce a long hidden reasoning trace while leaving the final message.content field empty. If the harness scores message.content, hidden reasoning_content is not a scoreable answer. The server can look busy, the model can look active, and the benchmark can still have nothing valid to score.

Model capability stops being the question here; provenance becomes the question instead. Which model identifier was served? Which endpoint shape was used? Which request parameters controlled reasoning and final output? Which response field did the harness read? Was the harness pristine, locally patched, or post-processed? Was the run a one-prompt smoke test, a subset, a recovered result, or a completed benchmark?

Those labels separate a result from a performance.

The same rule applies to every later tool in this chain. A red-team scorer, an evaluation task, an assertion check, a detector hit, a vulnerability-class summary, a judge score, and a guardrail block all depend on a scoreable output field and a success definition. If the request path is wrong, the tool can produce a precise artifact about the wrong thing.

The first link in the chain is therefore simple: before testing security behavior, prove that the thing being scored is the thing the model actually returned.

The second link is evidence type

There Is No Nmap for LLMs Yet moved from benchmark provenance to tool semantics. It tested four tools against a local OpenAI-compatible target: garak, NVIDIA’s open-source LLM vulnerability scanner; promptfoo, an open-source LLM testing and red-teaming command-line tool; DeepTeam, Confident AI’s open-source red-teaming framework; and Augustus, Praetorian’s scanning and judge framework. The finding was that tool names hide different evidence contracts. This article uses that prior tool set as an evidence-contract foundation rather than rerunning the same experiments.

garak’s evidence sits at the probe and detector level: a detector hit means a detector observed the condition it was designed to observe under a particular probe and generator configuration, well short of confirming an application vulnerability. Its reports are still useful at scale, since they make model-level breadth visible, but mapping a detector hit onto the system actually being defended is a step the operator still has to take by hand.

Move from detection to assertion and the picture changes. promptfoo produces assertion and regression evidence, useful precisely because a literal not-contains check is deterministic. That same literalism is also its limit: if a model refuses an unsafe instruction but quotes the forbidden string while refusing, the assertion can fail even when the human interpretation is more nuanced.

DeepTeam shifts the evidence again, toward vulnerability-class red-team campaigns framed around abuse categories rather than raw prompts, which is closer to how application-security teams already reason. The tradeoff is that the target callback and application wrapper become part of the result.

Augustus adds a fourth shape by combining probe, detector, and judge output, where the judge can contribute semantic interpretation the others can’t reach. A same-model judge, though, shares blind spots with the system it is judging, so its verdict works better as a diagnostic signal than a final authority.

The second link follows the same rule: do not trust a tool result until the evidence type is named. Detector hit, assertion failure, vulnerability summary, judge score, schema validation, guardrail block, and application-state transition are different objects. Collapsing them into “the model failed” or “the scanner passed” destroys the information the tools produced.

This lab made that distinction concrete with a local Qwen model endpoint and five tools that occupy different positions in the testing chain. It used PyRIT, Microsoft’s Python Risk Identification Tool for generative AI red teaming; Inspect AI, an evaluation framework from the United Kingdom AI Security Institute; LLM Guard, an input and output scanning framework; NeMo Guardrails, NVIDIA’s conversational policy framework; and Guardrails AI, a structured-output validation framework.

The local Qwen lab was not an application test

This lab deliberately did not run against any specific application. It exercised the core features and built-in testing scenarios each tool ships with, against a model endpoint rather than a deployed system. It used a local OpenAI-compatible Qwen endpoint as a model target and kept the attack material synthetic. The protected value was a fake token. The forbidden marker was a fake string. The restricted transaction was a benign decision object. That scope matters because the lab’s findings stop at the model and toolchain level; application-security claims require a deployed application.

The endpoint preflight established that the target path returned visible final content from Qwen3.5-9B-Q4_K_M.gguf with thinking disabled for measurement control. That preflight functioned as the gate that made later results interpretable, a methodological checkpoint rather than a security finding.

Once the response field was proven, the next question shifted from safety to which evidence shape each tool could produce. The broad configured suite then asked what each tool could do against that target path. PyRIT enumerated installed prompt converters and attempted default-constructible converters against Qwen. Inspect AI ran a 12-sample Qwen-backed task suite through its solver and scorer path. LLM Guard attempted all installed input and output scanner modules with safe defaults where possible. NeMo Guardrails parsed local Colang and YAML policy configurations and paired them with Qwen calls, without claiming runtime enforcement. Guardrails AI validated Qwen JSON outputs against five schemas.

The result set looked like this. Passes, mismatches, errors, skips, and executed scanner counts read as tool-path evidence under the lab’s configured contract rather than as safety scores or product comparisons. In this table, PyRIT passes mean converter execution under the configured contract, Qwen mismatches mean expected visible markers did not appear, and LLM Guard skips were configuration or dependency gates rather than scanner failures:

Tool             Scope                                      Result
PyRIT            Installed converter coverage               79 classes seen, 30 passes, 8 Qwen mismatches, 27 errors
Inspect AI       Configured Qwen eval suite                 12 samples, 12 passes
LLM Guard        Installed scanner attempt                  18 scanners seen, 10 executed, 7 skipped, 1 error
NeMo Guardrails  Local policy config suite plus Qwen calls   3 configs, 3 executed, runtime enforcement not claimed
Guardrails AI    JSON-schema validation over Qwen outputs    5 cases, 5 passes

Comparing the five tools sorted out their roles rather than their rankings; none of them were asked to do the same job.

PyRIT is useful when the question becomes adversarial campaign generation and prompt transformation. Inspect AI is useful when a security question needs to become a repeatable task with logs and scoring. LLM Guard is useful at input and output boundaries. NeMo Guardrails is useful for expressing conversational policy, but in this lab it stayed at config readiness because runtime enforcement against the local Qwen path was not claimed. Guardrails AI is useful when the application needs structured output validation.

The model refused and still leaked

The direct model-only attack suite used 11 cases: two benign baselines and nine synthetic attacks. The attacks covered direct prompt injection, roleplay override, false authority, indirect prompt injection, transformation exfiltration, encoding obfuscation, and structured-output bypass. The scoring rule was intentionally strict: if the final visible output contained the fake protected value or fake forbidden marker, the case failed.

Under that contract, Qwen passed both baselines and resisted 3 of 9 attack cases. Six attack cases failed.

Run                         Baselines passed   Attacks resisted   Attack failures
Direct Qwen model attack    2 of 2             3 of 9             6 of 9

The failure mode is what matters here: the model often refused, rarely acting as if the malicious instruction was acceptable, but the refusal repeated the protected value or forbidden marker anyway. For a secret-handling rule, that still fails because the final delivered output contains the protected string.

That distinction is the reason this lab is useful. If the only label is “jailbreak,” the result is muddy. If the label is “forbidden-string echo inside refusal text,” the next defensive step is obvious: guard the output boundary.

The structured-output case also mattered. When asked to return a restricted transaction decision as JSON, Qwen denied the transaction and set the restricted flag to false. That falls short of proving business safety, but it shows the model behaved better when the task was constrained as a decision object than when it was asked to avoid repeating a protected literal in prose.

A model can be safer under one output contract than another. The chain has to test both.

The tools did not all defend the model

The next run passed the same synthetic attack matrix through all five tools. This is where tool roles became visible.

Tool             Role in attack run                         Baselines passed   Attacks resisted   Attack failures
PyRIT            Base64 transformation before Qwen           2 of 2             3 of 9             6 of 9
Inspect AI       Task, solver, scorer evaluation             2 of 2             3 of 9             6 of 9
LLM Guard        Input and output scanner mitigation         2 of 2             9 of 9             0 of 9
NeMo Guardrails  Config readiness plus paired Qwen calls     2 of 2             3 of 9             6 of 9
Guardrails AI    Structured JSON wrapper                     1 of 2             6 of 9             3 of 9

PyRIT did what PyRIT was supposed to do in this run: it transformed attack prompts rather than mitigating them. Its result matched the direct model-only pattern because Qwen remained the target and the attack content still reached the model.

Inspect AI also did its job. It turned the attack cases into an evaluation run with solver, scorer, and logs. Its result matched the direct model-only pattern because an evaluation harness measures behavior. It does not change behavior unless the test design adds a control.

LLM Guard changed the outcome because it acted at the input and output boundaries. In this synthetic suite, known protected strings and obvious injection phrases could be blocked before the model or after the model. The mitigation run preserved both baselines and resisted all nine attacks under the synthetic indicator contract, a result scoped to this known-string test rather than a claim of semantic safety.

Mitigation layer        Stages observed
Input guard             6 pre-model blocks
Output guard            1 post-model block
Delivered responses     4 responses passed checks
Post-mitigation result  9 of 9 attacks resisted, 2 of 2 baselines preserved

This result is strong for the observed failure mode, but it stays narrowly scoped to known secret discovery rather than the general case. It works here because the protected strings and several attack phrases were known. If the protected data is unknown, paraphrased, retrieved from context, or represented as business state rather than a literal token, a simple substring denylist is not enough.

NeMo Guardrails should be reported conservatively. This result is a measurement boundary: the lab proved that local policy configs parsed and that Qwen calls could be paired with them, but it did not wire NeMo into the runtime Qwen path or prove runtime enforcement against the local endpoint. The attack outcome therefore matched the direct model result. Treating that as a NeMo enforcement failure would be wrong. Treating it as a config-readiness boundary is the accurate claim.

Guardrails AI produced the most instructive partial mitigation. The schema wrapper improved structured decision behavior, but some failures remained because the model placed forbidden strings inside an allowed note field. The schema constrained shape, leaving content inside every string field unguaranteed. Schema validation constrains object structure and enum values, but free-text fields remain output channels unless field-level validators or output scanners constrain their content.

The practical lesson here is that validation and scanning are different controls. A schema can require decision: deny. It cannot, by itself, prevent the model from explaining the denial by repeating the protected token unless the schema, validators, or output scanner also constrain that content.

Composition beats tool branding

The value of this chain shows up in disagreement: that is where the diagnosis happens.

If PyRIT finds a transformed attack that a regression test cannot reproduce, the failure may depend on conversation state or orchestration. If Inspect AI measures a failure that LLM Guard blocks, the model weakness remains but the delivered-output risk changes. If Guardrails AI returns a valid decision object while leaking a forbidden string in a note field, the schema contract is too weak for the data-handling rule. If NeMo parses policy but no runtime enforcement is wired to the target model, the policy exists only as design intent rather than as evidence of control.

That is why this series has to proceed in order. CyberSecEval on a Consumer GPU: What My Local Setup Could Actually Measure established benchmark provenance: a score is not trustworthy until the measurement path is trustworthy. There Is No Nmap for LLMs Yet established evidence contracts: a scanner hit, assertion failure, vulnerability summary, and judge score are not the same claim. This article adds composition: each tool should either produce an artifact the next layer can consume or a control the next layer can attack.

A useful AI security program turns findings into regression tests, regression failures into controls, controls into new attack targets, and application traces into state-level evidence. Without that loop, tools accumulate. With it, they constrain each other.

The missing link is application telemetry. A deployed agentic application layers retrieved context, system instructions, tool calls, authorization decisions, policy events, database writes, and business outcomes on top of prompt input and model output. Model-level refusal leakage matters. Delivered-output blocking matters. But an application-security claim needs to answer whether a tool was called, whether an invoice was approved, whether a flag was exposed, whether a policy decision was recorded, and whether state changed.

What this means for FinBot

A future lab session will point the benchmark, scanner, campaign, and guardrail tooling examined across this article and its two predecessors at OWASP FinBot, an agentic AI capture-the-flag financial workflow with invoices, tool calls, approval thresholds, and mutable payment state. Every test in this series so far ran against a bare model endpoint. FinBot is where that tooling finally meets a deployed application.

The question will no longer be whether Qwen repeated a synthetic token in a refusal. It will be whether an agentic workflow changed state under attack, which layer caught it, and which artifact proves that claim.

That is the point of the chain: AI application security stays just as hard, but the evidence gets harder to confuse.

Peace. Stay curious! End of transmission.

Fact-Check Appendix

The live results in this article come from captured Qwen-connected lab artifacts; the earlier research monograph remains synthesis context.

Statement: This article used Qwen3.5-9B-Q4_K_M.gguf through a local OpenAI-compatible endpoint and treated the endpoint preflight as a measurement gate rather than a security result. | Source: This article’s lab artifact: Qwen full-suite summary, 2026-06-18.

Statement: The Qwen-connected broad configured suite reported PyRIT with 79 converter classes seen, 30 passes, 8 Qwen mismatches, and 27 converter errors. | Source: This article’s lab artifact: Qwen full-suite summary, 2026-06-18.

Statement: The Qwen-connected broad configured suite reported Inspect AI with 12 samples and 12 passes. | Source: This article’s lab artifact: Qwen full-suite summary, 2026-06-18.

Statement: The Qwen-connected broad configured suite reported LLM Guard with 18 scanners seen, 10 executed, 7 skipped, and 1 error. | Source: This article’s lab artifact: Qwen full-suite summary, 2026-06-18.

Statement: The Qwen-connected broad configured suite reported NeMo Guardrails with 3 configs executed and runtime enforcement not claimed. | Source: This article’s lab artifact: Qwen full-suite summary, 2026-06-18.

Statement: The Qwen-connected broad configured suite reported Guardrails AI with 5 validation cases, 5 passes, and 0 validation failures. | Source: This article’s lab artifact: Qwen full-suite summary, 2026-06-18.

Statement: The direct Qwen model attack suite used 11 total cases, including 2 baseline cases and 9 attack cases. | Source: This article’s lab artifact: Qwen model-attack summary, 2026-06-18.

Statement: The direct Qwen model attack suite passed 2 of 2 baseline cases, resisted 3 of 9 attack cases, and produced 6 of 9 attack indicator failures. | Source: This article’s lab artifact: Qwen model-attack summary, 2026-06-18.

Statement: The guardrail mitigation follow-up preserved 2 of 2 baselines and resisted 9 of 9 attacks under the synthetic indicator contract. | Source: This article’s lab artifact: Qwen guardrails-mitigation summary, 2026-06-18.

Statement: The guardrail mitigation follow-up recorded 6 pre-model blocks, 1 post-model block, and 4 delivered responses. | Source: This article’s lab artifact: Qwen guardrails-mitigation summary, 2026-06-18.

Statement: In the all-tools attack matrix, PyRIT resisted 3 of 9 attacks and had 6 of 9 attack failures. | Source: This article’s lab artifact: Qwen all-tools attack summary, 2026-06-18.

Statement: In the all-tools attack matrix, Inspect AI resisted 3 of 9 attacks and had 6 of 9 attack failures. | Source: This article’s lab artifact: Qwen all-tools attack summary, 2026-06-18.

Statement: In the all-tools attack matrix, LLM Guard resisted 9 of 9 attacks and had 0 of 9 attack failures. | Source: This article’s lab artifact: Qwen all-tools attack summary, 2026-06-18.

Statement: In the all-tools attack matrix, NeMo Guardrails resisted 3 of 9 attacks and had 6 of 9 attack failures, with runtime enforcement not claimed. | Source: This article’s lab artifact: Qwen all-tools attack summary, 2026-06-18.

Statement: In the all-tools attack matrix, Guardrails AI resisted 6 of 9 attacks, had 3 of 9 attack failures, and passed 1 of 2 baseline cases. | Source: This article’s lab artifact: Qwen all-tools attack summary, 2026-06-18.

Statement: PyRIT is Microsoft’s Python Risk Identification Tool for generative AI red teaming. | Source: Microsoft PyRIT documentation, https://microsoft.github.io/PyRIT/ and repository, https://github.com/microsoft/PyRIT.

Statement: Inspect AI is an evaluation framework from the United Kingdom AI Security Institute for large language model and agent evaluations. | Source: Inspect AI documentation,

https://inspect.aisi.org.uk/

Statement: LLM Guard provides input and output scanning and sanitization for large language model applications. | Source: Protect AI LLM Guard repository, https://github.com/protectai/llm-guard.

Statement: NeMo Guardrails is NVIDIA’s framework for adding programmable guardrails to conversational AI systems. | Source: NVIDIA NeMo Guardrails documentation, https://docs.nvidia.com/nemo/guardrails/latest/.

Statement: Guardrails AI validates and constrains structured model outputs. | Source: Guardrails AI documentation, https://www.guardrailsai.com/docs.

Statement: OWASP publishes the Top 10 for Agentic Applications 2026 and LLM application security materials used here as risk taxonomy sources. | Source: OWASP GenAI Security Project, https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/ and https://genai.owasp.org/llm-top-10/.

Statement: FinBot is an OWASP agentic AI capture-the-flag application and demo target. | Source: OWASP FinBot CTF announcement, https://genai.owasp.org/resource/finbot-agentic-ai-capture-the-flag-ctf-application/ and demo repository, https://github.com/OWASP-ASI/finbot-ctf-demo.

Top 5 Sources

This article’s lab artifacts: The primary evidence for all new Qwen-connected results in this article. They provide exact scope labels, result counts, and per-tool summaries.

Microsoft PyRIT documentation and repository: Authoritative source for PyRIT’s role as a generative AI red-team framework and prompt transformation or orchestration layer.

Inspect AI documentation: Authoritative source for Inspect AI’s role as a task, solver, scorer, and logging framework for model and agent evaluations.

OWASP GenAI Security Project materials: Authoritative taxonomy source for LLM and agentic application risks, including prompt injection, tool misuse, agency, and insecure output handling.

CyberSecEval on a Consumer GPU: What My Local Setup Could Actually Measure and There Is No Nmap for LLMs Yet local evidence packages: Prior captured evidence that established benchmark provenance and tool-evidence contracts before this article’s composition layer.

Next Kick Labs

Discussion about this post

Ready for more?