Skip to main content

Data Protection Pipeline

Feature row 50 — Sprint DP (Partial)

Data Protection is included on the Enterprise plan. Phase 1 (classify / redact) ships today; Phase 2 (full KMS-backed encryption, additional providers) is in progress.

What this is

By default, Behavry stores metadata about every tool call — who did it, what tool, against what target, what policy decision — but does not store the payload itself. That's the right default for most tenants: less data at rest, less to leak, faster audit log.

Some tenants want different defaults. The Data Protection Pipeline lets you choose, per tenant (or per agent class), how payloads are handled:

ModePayload at rest
fullComplete payload stored in audit log
metadata_onlyDefault — only metadata, no payload
redactedPayload stored with DLP-matched fields replaced by [redacted] placeholders
encryptedFull payload stored, encrypted at rest with a tenant-held key (KMS-backed)

The four stages

The pipeline (backend/behavry/proxy/dp_pipeline.py) runs four stages in order for every in-flight payload:

1. Classify

Detect data categories in the payload:

  • Run the DLP scanner to tag segments (pii, pci:*, hipaa:phi, gdpr:*, etc.)
  • Run the injection scanner to tag adversarial content
  • Record the tag set in the audit event's dlp_findings

2. Redact

If the mode is redacted:

  • Replace matched segments with [redacted:{category}] placeholders
  • Preserve length where possible so downstream consumers don't break
  • Keep a redaction map in memory for the encrypt stage

3. Dispose

Decide what to do with the redacted-or-not payload:

  • metadata_only → drop the payload entirely before write
  • full → keep it as-is
  • redacted → keep the redacted copy
  • encrypted → hand off to stage 4

4. Encrypt

For encrypted mode:

  • Request an envelope-encrypted payload from the KMS client (backend/behavry/proxy/kms_client.py)
  • The key is tenant-scoped — Behavry never holds plaintext for tenants in encrypted mode
  • Write the ciphertext + key reference to the audit row
  • Decryption requires the same KMS key and is audited

KMS providers

Today the pipeline supports:

  • Local (dev only) — symmetric AES-256 with a 32-byte key from BEHAVRY_LOCAL_ENCRYPTION_KEY
  • AWS KMS — production-ready, uses kms:Encrypt / kms:Decrypt with a Customer Master Key (CMK)

Azure Key Vault and GCP Cloud KMS are scoped for a follow-up phase — this is the "Partial" in the Sprint DP status.

Configuration

Settings → Limits → Data protection.

  • Mode — the four options above
  • Override by agent class — e.g. healthcare-agents → redacted, research-agents → metadata_only
  • KMS provider — local (dev) / AWS KMS (production)
  • Key alias — the CMK alias to use (alias/behavry-{tenant-slug} by default)
  • Retention override — data-protection mode can set a shorter retention window than the tenant default

Costs

Encrypted mode carries a KMS request cost (one envelope-encryption call per audit write) and a small latency overhead (tens of milliseconds). Redacted mode has no runtime cost beyond the DLP scan that already happens. Full mode has no runtime cost but the largest at-rest data size.

Migration path

Tenants typically start with the default (metadata_only) and move to redacted or encrypted as part of a compliance roll-out (HIPAA, PCI, GDPR). Switching modes takes effect on the next audit write; existing events are left in their original form.