Skip to main content

Data Protection

AI agents exchange sensitive data with the tools they use -- credentials in API calls, personal information in database queries, financial data in payment workflows. Behavry's data protection pipeline ensures that audit trails capture the operational context needed for compliance and forensics without unnecessarily retaining raw sensitive content.

The pipeline processes every tool call payload through four stages before the audit record is written, giving security teams granular control over what is stored, how it is protected, and when it expires.

Pipeline overview

Tool Call Payload (request + response)
|
Stage 1: CLASSIFY
DLP findings --> data class tags
(CREDENTIAL, FINANCIAL, PII, PHI, GOVERNMENT_ID)
|
Stage 2: REDACT
DLP span replacement, JSON field masking,
identifier pseudonymization (HMAC-SHA256)
|
Stage 3: DISPOSE
full --> store as-is
metadata --> strip all payloads
redacted --> store redacted version
encrypted --> pass to Stage 4
|
Stage 4: ENCRYPT
KMS envelope encryption (AES-256-GCM)
|
audit_events row written
|
Nightly: RETENTION PURGE
NULL out payloads older than retention_days

Four protection modes

The payload_mode setting controls how much payload data is retained in the audit trail. Each mode builds on the previous stage's output.

ModeWhat is storedUse case
fullRaw request and response bodiesDevelopment, non-regulated environments
metadata_onlyNo payload content; metadata, timestamps, and policy decisions onlyMaximum privacy, high-sensitivity tenants
redactedDLP-matched spans replaced with [REDACTED:pattern_name] tokens; configured fields maskedCompliance environments that need audit context without raw secrets
encryptedRedacted payload encrypted with AES-256-GCM via KMS envelope encryptionRegulated environments requiring encryption at rest with key separation

When encrypted mode is configured but KMS is unavailable, the pipeline falls back to metadata_only rather than storing unprotected content.

Stage 1: Classification

Classification maps DLP scan findings to data class tags. These tags are stored on every audit event regardless of the protection mode, enabling compliance reporting without accessing payload content.

The classification stage reuses the DLP findings already produced by the proxy engine -- it does not re-scan the payload.

Data class mapping

Behavry maps 24 DLP patterns to five data classes:

Data classDLP patterns
CREDENTIALaws_access_key, openai_api_key, anthropic_api_key, jwt_token, stripe_api_key, pgp_private_key, azure_sas_token, gitlab_token, twilio_api_key, gcp_service_account, sendgrid_api_key, slack_webhook, discord_webhook, docker_auth, credential_assignment
FINANCIALcredit_card, bank_account
GOVERNMENT_IDssn, passport, drivers_license
PIIemail_address, phone_number, ip_address
PHIhealth_record

The data_classes array is written to the audit_events table as a TEXT[] column, making it queryable for compliance dashboards and SIEM filters without payload access.

Stage 2: Redaction

When the protection mode is redacted or encrypted, the redaction engine processes both request and response payloads through three mechanisms.

DLP span redaction

Matched DLP values in string fields are replaced with tokens:

"Please send to john@acme.com"
-->
"Please send to [REDACTED:email_address]"

This is controlled by the redact_dlp_matches policy flag (default: true).

JSON field path redaction

Administrators can specify JSON paths to always redact, regardless of DLP matches:

{
"redact_fields": [
"$.messages[*].content",
"$.auth.password",
"$.response.body"
]
}

Paths support dot notation and array wildcards ([*]). Matched fields are replaced with [REDACTED:path].

Identifier pseudonymization

When hash_identifiers is enabled with an identifier_salt, real identifiers (agent IDs, session IDs, requester IDs) are replaced with deterministic HMAC-SHA256 pseudonyms:

agent_id: "agent-abc-123"
-->
agent_id: "pseudo_7a3b9f2e1c4d8a6b"

Pseudonymization is deterministic -- the same input with the same salt always produces the same pseudonym, preserving correlation across events without exposing real identifiers.

Stage 3: Disposition

Disposition determines what payload content is persisted to the database:

  • full -- raw request and response bodies are stored in request_body and response_body JSONB columns.
  • metadata_only -- all payload columns are set to NULL. Only metadata (action, target, policy result, data classes, timestamps) is retained.
  • redacted -- the redacted payload is stored in the payload_redacted JSONB column. Original payloads are not stored.
  • encrypted -- the redacted payload is serialized to JSON, encrypted via KMS, and stored in the payload_encrypted BYTEA column.

Stage 4: Envelope encryption

When payload_mode is encrypted, the pipeline encrypts the redacted payload using envelope encryption before writing the audit record.

KMS providers

Behavry supports pluggable KMS providers via the KMSClient protocol:

class KMSClient(Protocol):
async def encrypt(self, plaintext: bytes, context: dict) -> bytes: ...
async def decrypt(self, ciphertext: bytes, context: dict) -> bytes: ...
async def health_check(self) -> bool: ...

LocalKMSClient

For self-hosted deployments, the local provider uses AES-256-GCM with a customer-provided key.

  • Algorithm: AES-256-GCM with 96-bit random nonce
  • Key source: BEHAVRY_LOCAL_ENCRYPTION_KEY environment variable (base64-encoded, 32 bytes)
  • Storage format: [12-byte nonce][ciphertext + GCM tag]
# Generate a key
openssl rand -base64 32

# Set in environment
export BEHAVRY_LOCAL_ENCRYPTION_KEY="<base64-encoded-32-byte-key>"

AWSKMSClient

For AWS-hosted deployments, the AWS provider uses envelope encryption with AWS KMS.

  • Algorithm: AWS KMS generates a unique AES-256 data key per encryption operation; the data key encrypts the payload with AES-256-GCM; the encrypted data key is stored alongside the ciphertext
  • Key source: AWS KMS key ARN configured via kms_key_id
  • Storage format: [4-byte key length][encrypted data key][12-byte nonce][ciphertext + GCM tag]
  • Dependency: boto3 (pip install boto3)

The envelope encryption model means the KMS master key never leaves AWS, and each audit event uses a unique data key.

Retention and purge

The retention system enforces time-based payload expiration. When payload_retention_days is configured, a nightly purge task sets all payload columns to NULL for audit events older than the retention window.

The purge preserves:

  • All metadata (action, target, agent, session, timestamps)
  • Policy decisions and risk scores
  • Data class tags
  • The audit hash chain (integrity is unaffected)

The payload_purged_at timestamp records when the purge occurred, providing an auditable trail of data lifecycle management.

Retention status API

GET /api/v1/audit/retention-status
Authorization: Bearer <admin-jwt>

Response:

{
"events_with_payload": 14302,
"purged_last_24h": 891
}

Payload decryption

Encrypted payloads can be decrypted on demand through a dedicated, audit-logged endpoint.

POST /api/v1/audit/events/{event_id}/decrypt
Authorization: Bearer <admin-jwt>

Response:

{
"event_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"decrypted": {
"request": { "tool": "read_file", "path": "/etc/config" },
"response": { "content": "[REDACTED:credential_assignment]" }
}
}

Every decryption attempt -- successful or failed -- writes an immutable payload_decrypt audit record. This record is always stored in metadata_only mode to prevent recursive payload capture. The audit log includes the admin username, target event ID, and success/failure status.

If the event is not encrypted or the KMS provider is unavailable, the endpoint returns a 400 or 500 error respectively.

Configuration API

Get current policy

GET /api/v1/admin/data-protection
Authorization: Bearer <admin-jwt>

Response:

{
"payload_mode": "redacted",
"redact_dlp_matches": true,
"redact_fields": ["$.messages[*].content"],
"hash_identifiers": false,
"encryption_enabled": false,
"kms_provider": null,
"kms_key_id_suffix": null,
"payload_retention_days": 90,
"strip_payload_from_stream": true
}

The kms_key_id_suffix field returns only the last 8 characters of the KMS key ID. The full key ID is never exposed via the API.

Update policy

PATCH /api/v1/admin/data-protection
Authorization: Bearer <admin-jwt>
Content-Type: application/json

{
"payload_mode": "encrypted",
"encryption_enabled": true,
"kms_provider": "local",
"payload_retention_days": 30,
"test_kms_connectivity": true
}

Set test_kms_connectivity to true to verify KMS health before saving. The request fails with 400 if the KMS health check does not pass, preventing misconfiguration.

Validation rules:

  • encryption_enabled: true requires kms_provider to be set.
  • payload_mode must be one of: full, metadata_only, redacted, encrypted.
  • kms_provider must be one of: local, aws, azure, gcp (Azure and GCP are not yet implemented).

Policy model reference

FieldTypeDefaultDescription
payload_modestring"full"Protection mode: full, metadata_only, redacted, encrypted
redact_dlp_matchesbooleantrueReplace DLP-matched spans with redaction tokens
redact_fieldsstring[][]JSON paths to always redact (e.g., $.auth.password)
hash_identifiersbooleanfalsePseudonymize agent/session/requester IDs
identifier_saltstringnullHMAC salt for pseudonymization (required when hash_identifiers is true)
encryption_enabledbooleanfalseEnable KMS envelope encryption
kms_providerstringnullKMS provider: local, aws, azure, gcp
kms_key_idstringnullKMS key identifier (ARN for AWS, ignored for local)
payload_retention_daysintegernullDays before payload purge (null = retain indefinitely)
strip_payload_from_streambooleantrueExclude payload fields from SSE dashboard stream

Audit event columns

The data protection pipeline writes to eight dedicated columns on the audit_events table:

ColumnTypeDescription
request_bodyJSONBRaw or redacted request payload (full mode only)
response_bodyJSONBRaw or redacted response payload (full mode only)
payload_redactedJSONBCombined redacted request + response (redacted mode)
payload_encryptedBYTEAEncrypted payload blob (encrypted mode)
encryption_key_idTEXTKMS key identifier used for encryption
data_classesTEXT[]Detected data classes (e.g., ["CREDENTIAL", "PII"])
payload_purged_atTIMESTAMPWhen the retention purge nulled the payload
dp_modeTEXTProtection mode applied: full, metadata_only, redacted, encrypted

Isolation guarantees

The data protection pipeline enforces two isolation boundaries to prevent sensitive content from leaking through side channels:

  1. Event bus isolation -- The raw_payload field is removed from events published to the internal event bus. The behavioral monitor, drift detector, and all other subscribers never see payload content.

  2. SSE stream isolation -- When strip_payload_from_stream is true (the default), the to_sse() serializer excludes payload fields from the real-time dashboard stream. Administrators see metadata, policy decisions, and data class tags, but not payload content.

These boundaries ensure that even in full mode, payload data is only accessible through direct database queries or the decryption API -- never through real-time monitoring channels.

Compliance mapping

FrameworkRequirementBehavry capability
GDPR Art. 32Encryption of personal dataencrypted mode with KMS envelope encryption; redacted mode with DLP-based pseudonymization
HIPAA 164.312(a)(2)(iv)Encryption and decryption of ePHIAES-256-GCM encryption; audit-logged decryption with admin authentication
HIPAA 164.312(b)Audit controlsImmutable audit trail with hash chain; decryption attempts logged
SOC 2 CC6.1Logical and physical access controlsData classification tags on every event; role-based decryption access; KMS key separation
SOC 2 CC6.7Restriction of data in transmissionSSE stream excludes payload fields; event bus strips raw content
GDPR Art. 17Right to erasureConfigurable retention purge with auditable payload_purged_at timestamp