Skip to main content

Adversarial Red Team Swarm (ARS)

Feature rows 20 + 21 — Sprints ARS / ARS.1

ARS is the internal lab that runs continuously on demo.behavry.ai. Its findings feed the Red Team → Policy Automation Loop.

What this is

Adversarial Red Team Swarm is Behavry's continuous internal red team. It runs 25 evasion techniques across 7 categories against a live instance of the product and scores how often each one slips through detection. The sprint ARS.1 extension adds an Adaptive Adversarial Evolution loop — a mutation engine that takes techniques ARS has failed on and generates variants until one works, then feeds the new variant back into the corpus.

The goal is to keep the product's detection surface ahead of what real attackers will try. Every week of ARS runs is a dataset of known-effective attacks we can use to test policy changes, regression-test detection, and publish into the community policy library as attack fingerprints.

The 25 techniques, grouped

CategoryTechniques
Imperative overrideDirect ignore, delayed ignore, role reset, step replacement
Context smugglingHidden tool envelopes, Markdown poisoning, code-block poisoning
Trust exploitationTrusted-source spoofing, domain-tag forgery
ExfiltrationURL param leak, image-src leak, DNS rebinding, Markdown-image leak
Tool hijackFake tool name, fake tool args, schema confusion
Behavioral driftGradual imperative, style conditioning, persona drift
Credential solicitationDirect key ask, debug ruse, environment dump

Each technique is a Python class under backend/behavry/adversarial/ with:

  • A probe() method that generates one attack payload
  • A success_criteria() method that decides, given the agent's response and the audit trail, whether the attack succeeded
  • Metadata: category, difficulty, references

How a run works

An APScheduler job (backend/behavry/adversarial/service.py) fires every 15 minutes:

  1. Pick targets — choose an agent persona from the demo fleet; pick one technique from the corpus
  2. Execute — run the probe against the live proxy as if it were a real request
  3. Score — compare the response against success_criteria; record the outcome
  4. Record — write ars.probe and ars.result events with full payloads
  5. Evolve (ARS.1) — on failure, pass the technique + the detection signature to the mutation engine, which returns a variant; queue the variant for the next run

Adaptive evolution loop

The evolution engine (sprint ARS.1) turns ARS from a static test suite into an autonomous adversary:

  • Input: a failed technique + the detection signature that caught it (rule IDs, severity, matched span)
  • Transformation: language model generates a semantically equivalent variant that avoids the flagged surface — synonym substitution, encoding changes, structural rearrangement, added noise
  • Output: a new probe with a parent pointer back to the original technique

Evolved probes are tagged evolution_generation > 0 so you can always tell a synthesized variant from a hand-written one. Successful evolved probes are promoted into the canonical corpus; unsuccessful ones are retained for trend analysis.

Dashboard

demo.behavry.ai exposes an ARS dashboard at Security → ARS with:

  • Per-technique success rate over time
  • "New effective techniques" feed — variants the evolution engine has found that now work
  • Coverage matrix — techniques × categories × agent personas
  • "Policies this would have caught" — a dry-run against the tenant's current policy set

Where results go

  • Audit log — every probe + result, normal audit_events schema
  • Community library — high-signal attacks publish as DLP patterns / injection patterns / policy templates
  • Red Team → Policy Automation Loop — auto-generated candidate policies that would have blocked the successful attacks, pending review by policy_author