Skip to main content

Intent Drift Detection

Feature row 16 — Sprint AOC-1.5

Intent Drift Detection is included on the Professional and Enterprise plans.

The threat

A single prompt injection attempt is loud. An attacker who knows Behavry is watching will instead spread the payload across many tool responses, each with a low-severity pattern that falls below the per-response threshold. Over hundreds of calls, the agent's behavior shifts a few degrees at a time until it's taking actions the operator never approved.

This is intent drift: not one injection event, but the accumulation of tiny biases across a session or workflow.

How it works

Intent Drift Detection sits one layer above the per-response inbound scanner. For every inbound tool response, the scanner already produces:

  • A per-pattern match list
  • A per-response severity
  • A per-pattern-class count

Intent drift keeps a rolling window per agent and per workflow of:

  • How many low-severity matches have fired in the last N responses
  • Which pattern classes are recurring
  • Which content trust domains the matches came from

When the rolling count crosses a threshold — or when the class distribution shifts toward higher-risk classes — the detector raises an intent_drift.detected event even though no single response would have fired on its own.

Scoring

The drift score is a weighted sum over the rolling window:

drift = Σ (severity_weight × trust_domain_penalty × recency_weight)
  • severity_weight — low = 1, medium = 4, high = 12
  • trust_domain_penalty — trusted source = 0.5, untrusted = 1.0, blocked = 2.0
  • recency_weight — decays linearly over the window (newer matches count more)

Drift thresholds default to warn >= 6, alert >= 12, escalate >= 24 and are tunable per tenant.

Tie to content trust domains

Content Trust Domain Tagging (row 15) tags every inbound response with a trust tier. Intent drift weights those tags directly: the same pattern coming from a trusted internal server contributes less drift than one from an untrusted web source, even when the pattern match is identical.

This matters because the attack pattern of interest is exactly low-severity matches from untrusted sources, accumulating over time.

Response

When drift crosses the alert threshold, the detector:

  1. Writes an intent_drift.detected event with the full rolling window
  2. Pushes an alert to Behavior → Alerts with one-click pivot to the responsible workflow
  3. Bumps the agent's behavioral risk score
  4. Optionally (policy-controlled) puts the agent into Restricted Mode with the investigation profile

escalate threshold adds an HITL escalation item that blocks the next tool call until a human decides.

Resets

The rolling window resets on:

  • Explicit operator Agents → (agent) → Reset Drift
  • A new session (drift is per-session by default; tenants can opt into workflow-scoped drift instead)
  • A successful HITL clear that an operator marks as "false positive"

Resets are audited.