Data Protection

AI agents exchange sensitive data with the tools they use -- credentials in API calls, personal information in database queries, financial data in payment workflows. Behavry's data protection pipeline ensures that audit trails capture the operational context needed for compliance and forensics without unnecessarily retaining raw sensitive content.

The pipeline processes every tool call payload through four stages before the audit record is written, giving security teams granular control over what is stored, how it is protected, and when it expires.

Pipeline overview

Tool Call Payload (request + response)
         |
   Stage 1: CLASSIFY
   DLP findings --> data class tags
   (CREDENTIAL, FINANCIAL, PII, PHI, GOVERNMENT_ID)
         |
   Stage 2: REDACT
   DLP span replacement, JSON field masking,
   identifier pseudonymization (HMAC-SHA256)
         |
   Stage 3: DISPOSE
   full      --> store as-is
   metadata  --> strip all payloads
   redacted  --> store redacted version
   encrypted --> pass to Stage 4
         |
   Stage 4: ENCRYPT
   KMS envelope encryption (AES-256-GCM)
         |
   audit_events row written
         |
   Nightly: RETENTION PURGE
   NULL out payloads older than retention_days

Four protection modes

The payload_mode setting controls how much payload data is retained in the audit trail. Each mode builds on the previous stage's output.

Mode	What is stored	Use case
`full`	Raw request and response bodies	Development, non-regulated environments
`metadata_only`	No payload content; metadata, timestamps, and policy decisions only	Maximum privacy, high-sensitivity tenants
`redacted`	DLP-matched spans replaced with `[REDACTED:pattern_name]` tokens; configured fields masked	Compliance environments that need audit context without raw secrets
`encrypted`	Redacted payload encrypted with AES-256-GCM via KMS envelope encryption	Regulated environments requiring encryption at rest with key separation

When encrypted mode is configured but KMS is unavailable, the pipeline falls back to metadata_only rather than storing unprotected content.

Stage 1: Classification

Classification maps DLP scan findings to data class tags. These tags are stored on every audit event regardless of the protection mode, enabling compliance reporting without accessing payload content.

The classification stage reuses the DLP findings already produced by the proxy engine -- it does not re-scan the payload.

Data class mapping

Behavry maps 24 DLP patterns to five data classes:

Data class	DLP patterns
`CREDENTIAL`	`aws_access_key`, `openai_api_key`, `anthropic_api_key`, `jwt_token`, `stripe_api_key`, `pgp_private_key`, `azure_sas_token`, `gitlab_token`, `twilio_api_key`, `gcp_service_account`, `sendgrid_api_key`, `slack_webhook`, `discord_webhook`, `docker_auth`, `credential_assignment`
`FINANCIAL`	`credit_card`, `bank_account`
`GOVERNMENT_ID`	`ssn`, `passport`, `drivers_license`
`PII`	`email_address`, `phone_number`, `ip_address`
`PHI`	`health_record`

The data_classes array is written to the audit_events table as a TEXT[] column, making it queryable for compliance dashboards and SIEM filters without payload access.

Stage 2: Redaction

When the protection mode is redacted or encrypted, the redaction engine processes both request and response payloads through three mechanisms.

DLP span redaction

Matched DLP values in string fields are replaced with tokens:

"Please send to john@acme.com"
-->
"Please send to [REDACTED:email_address]"

This is controlled by the redact_dlp_matches policy flag (default: true).

JSON field path redaction

Administrators can specify JSON paths to always redact, regardless of DLP matches:

{
  "redact_fields": [
    "$.messages[*].content",
    "$.auth.password",
    "$.response.body"
  ]
}

Paths support dot notation and array wildcards ([*]). Matched fields are replaced with [REDACTED:path].

Identifier pseudonymization

When hash_identifiers is enabled with an identifier_salt, real identifiers (agent IDs, session IDs, requester IDs) are replaced with deterministic HMAC-SHA256 pseudonyms:

agent_id: "agent-abc-123"
-->
agent_id: "pseudo_7a3b9f2e1c4d8a6b"

Pseudonymization is deterministic -- the same input with the same salt always produces the same pseudonym, preserving correlation across events without exposing real identifiers.

Stage 3: Disposition

Disposition determines what payload content is persisted to the database:

full -- raw request and response bodies are stored in request_body and response_body JSONB columns.
metadata_only -- all payload columns are set to NULL. Only metadata (action, target, policy result, data classes, timestamps) is retained.
redacted -- the redacted payload is stored in the payload_redacted JSONB column. Original payloads are not stored.
encrypted -- the redacted payload is serialized to JSON, encrypted via KMS, and stored in the payload_encrypted BYTEA column.

Stage 4: Envelope encryption

When payload_mode is encrypted, the pipeline encrypts the redacted payload using envelope encryption before writing the audit record.

KMS providers

Behavry supports pluggable KMS providers via the KMSClient protocol:

class KMSClient(Protocol):
    async def encrypt(self, plaintext: bytes, context: dict) -> bytes: ...
    async def decrypt(self, ciphertext: bytes, context: dict) -> bytes: ...
    async def health_check(self) -> bool: ...

LocalKMSClient

For self-hosted deployments, the local provider uses AES-256-GCM with a customer-provided key.

Algorithm: AES-256-GCM with 96-bit random nonce
Key source: BEHAVRY_LOCAL_ENCRYPTION_KEY environment variable (base64-encoded, 32 bytes)
Storage format: [12-byte nonce][ciphertext + GCM tag]

# Generate a key
openssl rand -base64 32

# Set in environment
export BEHAVRY_LOCAL_ENCRYPTION_KEY="<base64-encoded-32-byte-key>"

AWSKMSClient

For AWS-hosted deployments, the AWS provider uses envelope encryption with AWS KMS.

Algorithm: AWS KMS generates a unique AES-256 data key per encryption operation; the data key encrypts the payload with AES-256-GCM; the encrypted data key is stored alongside the ciphertext
Key source: AWS KMS key ARN configured via kms_key_id
Storage format: [4-byte key length][encrypted data key][12-byte nonce][ciphertext + GCM tag]
Dependency: boto3 (pip install boto3)

The envelope encryption model means the KMS master key never leaves AWS, and each audit event uses a unique data key.

Retention and purge

The retention system enforces time-based payload expiration. When payload_retention_days is configured, a nightly purge task sets all payload columns to NULL for audit events older than the retention window.

The purge preserves:

All metadata (action, target, agent, session, timestamps)
Policy decisions and risk scores
Data class tags
The audit hash chain (integrity is unaffected)

The payload_purged_at timestamp records when the purge occurred, providing an auditable trail of data lifecycle management.

Retention status API

GET /api/v1/audit/retention-status
Authorization: Bearer <admin-jwt>

Response:

{
  "events_with_payload": 14302,
  "purged_last_24h": 891
}

Payload decryption

Encrypted payloads can be decrypted on demand through a dedicated, audit-logged endpoint.

POST /api/v1/audit/events/{event_id}/decrypt
Authorization: Bearer <admin-jwt>

Response:

{
  "event_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "decrypted": {
    "request": { "tool": "read_file", "path": "/etc/config" },
    "response": { "content": "[REDACTED:credential_assignment]" }
  }
}

Every decryption attempt -- successful or failed -- writes an immutable payload_decrypt audit record. This record is always stored in metadata_only mode to prevent recursive payload capture. The audit log includes the admin username, target event ID, and success/failure status.

If the event is not encrypted or the KMS provider is unavailable, the endpoint returns a 400 or 500 error respectively.

Configuration API

Get current policy

GET /api/v1/admin/data-protection
Authorization: Bearer <admin-jwt>

Response:

{
  "payload_mode": "redacted",
  "redact_dlp_matches": true,
  "redact_fields": ["$.messages[*].content"],
  "hash_identifiers": false,
  "encryption_enabled": false,
  "kms_provider": null,
  "kms_key_id_suffix": null,
  "payload_retention_days": 90,
  "strip_payload_from_stream": true
}

The kms_key_id_suffix field returns only the last 8 characters of the KMS key ID. The full key ID is never exposed via the API.

Update policy

PATCH /api/v1/admin/data-protection
Authorization: Bearer <admin-jwt>
Content-Type: application/json

{
  "payload_mode": "encrypted",
  "encryption_enabled": true,
  "kms_provider": "local",
  "payload_retention_days": 30,
  "test_kms_connectivity": true
}

Set test_kms_connectivity to true to verify KMS health before saving. The request fails with 400 if the KMS health check does not pass, preventing misconfiguration.

Validation rules:

encryption_enabled: true requires kms_provider to be set.
payload_mode must be one of: full, metadata_only, redacted, encrypted.
kms_provider must be one of: local, aws, azure, gcp (Azure and GCP are not yet implemented).

Policy model reference

Field	Type	Default	Description
`payload_mode`	`string`	`"full"`	Protection mode: `full`, `metadata_only`, `redacted`, `encrypted`
`redact_dlp_matches`	`boolean`	`true`	Replace DLP-matched spans with redaction tokens
`redact_fields`	`string[]`	`[]`	JSON paths to always redact (e.g., `$.auth.password`)
`hash_identifiers`	`boolean`	`false`	Pseudonymize agent/session/requester IDs
`identifier_salt`	`string`	`null`	HMAC salt for pseudonymization (required when `hash_identifiers` is true)
`encryption_enabled`	`boolean`	`false`	Enable KMS envelope encryption
`kms_provider`	`string`	`null`	KMS provider: `local`, `aws`, `azure`, `gcp`
`kms_key_id`	`string`	`null`	KMS key identifier (ARN for AWS, ignored for local)
`payload_retention_days`	`integer`	`null`	Days before payload purge (`null` = retain indefinitely)
`strip_payload_from_stream`	`boolean`	`true`	Exclude payload fields from SSE dashboard stream

Audit event columns

The data protection pipeline writes to eight dedicated columns on the audit_events table:

Column	Type	Description
`request_body`	`JSONB`	Raw or redacted request payload (`full` mode only)
`response_body`	`JSONB`	Raw or redacted response payload (`full` mode only)
`payload_redacted`	`JSONB`	Combined redacted request + response (`redacted` mode)
`payload_encrypted`	`BYTEA`	Encrypted payload blob (`encrypted` mode)
`encryption_key_id`	`TEXT`	KMS key identifier used for encryption
`data_classes`	`TEXT[]`	Detected data classes (e.g., `["CREDENTIAL", "PII"]`)
`payload_purged_at`	`TIMESTAMP`	When the retention purge nulled the payload
`dp_mode`	`TEXT`	Protection mode applied: `full`, `metadata_only`, `redacted`, `encrypted`

Isolation guarantees

The data protection pipeline enforces two isolation boundaries to prevent sensitive content from leaking through side channels:

Event bus isolation -- The raw_payload field is removed from events published to the internal event bus. The behavioral monitor, drift detector, and all other subscribers never see payload content.
SSE stream isolation -- When strip_payload_from_stream is true (the default), the to_sse() serializer excludes payload fields from the real-time dashboard stream. Administrators see metadata, policy decisions, and data class tags, but not payload content.

These boundaries ensure that even in full mode, payload data is only accessible through direct database queries or the decryption API -- never through real-time monitoring channels.

Compliance mapping

Framework	Requirement	Behavry capability
GDPR Art. 32	Encryption of personal data	`encrypted` mode with KMS envelope encryption; `redacted` mode with DLP-based pseudonymization
HIPAA 164.312(a)(2)(iv)	Encryption and decryption of ePHI	AES-256-GCM encryption; audit-logged decryption with admin authentication
HIPAA 164.312(b)	Audit controls	Immutable audit trail with hash chain; decryption attempts logged
SOC 2 CC6.1	Logical and physical access controls	Data classification tags on every event; role-based decryption access; KMS key separation
SOC 2 CC6.7	Restriction of data in transmission	SSE stream excludes payload fields; event bus strips raw content
GDPR Art. 17	Right to erasure	Configurable retention purge with auditable `payload_purged_at` timestamp

Pipeline overview​

Four protection modes​

Stage 1: Classification​

Data class mapping​

Stage 2: Redaction​

DLP span redaction​

JSON field path redaction​

Identifier pseudonymization​

Stage 3: Disposition​

Stage 4: Envelope encryption​

KMS providers​

LocalKMSClient​

AWSKMSClient​

Retention and purge​

Retention status API​

Payload decryption​

Configuration API​

Get current policy​

Update policy​

Policy model reference​

Audit event columns​

Isolation guarantees​

Compliance mapping​

Pipeline overview

Four protection modes

Stage 1: Classification

Data class mapping

Stage 2: Redaction

DLP span redaction

JSON field path redaction

Identifier pseudonymization

Stage 3: Disposition

Stage 4: Envelope encryption

KMS providers

LocalKMSClient

AWSKMSClient

Retention and purge

Retention status API

Payload decryption

Configuration API

Get current policy

Update policy

Policy model reference

Audit event columns

Isolation guarantees

Compliance mapping