Intent Drift Detection
Intent Drift Detection is included on the Professional and Enterprise plans.
The threat
A single prompt injection attempt is loud. An attacker who knows Behavry is watching will instead spread the payload across many tool responses, each with a low-severity pattern that falls below the per-response threshold. Over hundreds of calls, the agent's behavior shifts a few degrees at a time until it's taking actions the operator never approved.
This is intent drift: not one injection event, but the accumulation of tiny biases across a session or workflow.
How it works
Intent Drift Detection sits one layer above the per-response inbound scanner. For every inbound tool response, the scanner already produces:
- A per-pattern match list
- A per-response severity
- A per-pattern-class count
Intent drift keeps a rolling window per agent and per workflow of:
- How many low-severity matches have fired in the last N responses
- Which pattern classes are recurring
- Which content trust domains the matches came from
When the rolling count crosses a threshold — or when the class distribution shifts toward higher-risk classes — the detector raises an intent_drift.detected event even though no single response would have fired on its own.
Scoring
The drift score is a weighted sum over the rolling window:
drift = Σ (severity_weight × trust_domain_penalty × recency_weight)
- severity_weight — low = 1, medium = 4, high = 12
- trust_domain_penalty — trusted source = 0.5, untrusted = 1.0, blocked = 2.0
- recency_weight — decays linearly over the window (newer matches count more)
Drift thresholds default to warn >= 6, alert >= 12, escalate >= 24 and are tunable per tenant.
Tie to content trust domains
Content Trust Domain Tagging (row 15) tags every inbound response with a trust tier. Intent drift weights those tags directly: the same pattern coming from a trusted internal server contributes less drift than one from an untrusted web source, even when the pattern match is identical.
This matters because the attack pattern of interest is exactly low-severity matches from untrusted sources, accumulating over time.
Response
When drift crosses the alert threshold, the detector:
- Writes an
intent_drift.detectedevent with the full rolling window - Pushes an alert to Behavior → Alerts with one-click pivot to the responsible workflow
- Bumps the agent's behavioral risk score
- Optionally (policy-controlled) puts the agent into Restricted Mode with the
investigationprofile
escalate threshold adds an HITL escalation item that blocks the next tool call until a human decides.
Resets
The rolling window resets on:
- Explicit operator Agents → (agent) → Reset Drift
- A new session (drift is per-session by default; tenants can opt into workflow-scoped drift instead)
- A successful HITL clear that an operator marks as "false positive"
Resets are audited.
Related
- Inbound Injection Scanner — the per-response detector this sits above
- Content Trust Domain Tagging — what weights the drift score
- Behavioral Monitor — the companion detector that catches drift in actions rather than content