Data Protection
AI agents exchange sensitive data with the tools they use -- credentials in API calls, personal information in database queries, financial data in payment workflows. Behavry's data protection pipeline ensures that audit trails capture the operational context needed for compliance and forensics without unnecessarily retaining raw sensitive content.
The pipeline processes every tool call payload through four stages before the audit record is written, giving security teams granular control over what is stored, how it is protected, and when it expires.
Pipeline overview
Tool Call Payload (request + response)
|
Stage 1: CLASSIFY
DLP findings --> data class tags
(CREDENTIAL, FINANCIAL, PII, PHI, GOVERNMENT_ID)
|
Stage 2: REDACT
DLP span replacement, JSON field masking,
identifier pseudonymization (HMAC-SHA256)
|
Stage 3: DISPOSE
full --> store as-is
metadata --> strip all payloads
redacted --> store redacted version
encrypted --> pass to Stage 4
|
Stage 4: ENCRYPT
KMS envelope encryption (AES-256-GCM)
|
audit_events row written
|
Nightly: RETENTION PURGE
NULL out payloads older than retention_days
Four protection modes
The payload_mode setting controls how much payload data is retained in the audit trail. Each mode builds on the previous stage's output.
| Mode | What is stored | Use case |
|---|---|---|
full | Raw request and response bodies | Development, non-regulated environments |
metadata_only | No payload content; metadata, timestamps, and policy decisions only | Maximum privacy, high-sensitivity tenants |
redacted | DLP-matched spans replaced with [REDACTED:pattern_name] tokens; configured fields masked | Compliance environments that need audit context without raw secrets |
encrypted | Redacted payload encrypted with AES-256-GCM via KMS envelope encryption | Regulated environments requiring encryption at rest with key separation |
When encrypted mode is configured but KMS is unavailable, the pipeline falls back to metadata_only rather than storing unprotected content.
Stage 1: Classification
Classification maps DLP scan findings to data class tags. These tags are stored on every audit event regardless of the protection mode, enabling compliance reporting without accessing payload content.
The classification stage reuses the DLP findings already produced by the proxy engine -- it does not re-scan the payload.
Data class mapping
Behavry maps 24 DLP patterns to five data classes:
| Data class | DLP patterns |
|---|---|
CREDENTIAL | aws_access_key, openai_api_key, anthropic_api_key, jwt_token, stripe_api_key, pgp_private_key, azure_sas_token, gitlab_token, twilio_api_key, gcp_service_account, sendgrid_api_key, slack_webhook, discord_webhook, docker_auth, credential_assignment |
FINANCIAL | credit_card, bank_account |
GOVERNMENT_ID | ssn, passport, drivers_license |
PII | email_address, phone_number, ip_address |
PHI | health_record |
The data_classes array is written to the audit_events table as a TEXT[] column, making it queryable for compliance dashboards and SIEM filters without payload access.
Stage 2: Redaction
When the protection mode is redacted or encrypted, the redaction engine processes both request and response payloads through three mechanisms.
DLP span redaction
Matched DLP values in string fields are replaced with tokens:
"Please send to john@acme.com"
-->
"Please send to [REDACTED:email_address]"
This is controlled by the redact_dlp_matches policy flag (default: true).
JSON field path redaction
Administrators can specify JSON paths to always redact, regardless of DLP matches:
{
"redact_fields": [
"$.messages[*].content",
"$.auth.password",
"$.response.body"
]
}
Paths support dot notation and array wildcards ([*]). Matched fields are replaced with [REDACTED:path].
Identifier pseudonymization
When hash_identifiers is enabled with an identifier_salt, real identifiers (agent IDs, session IDs, requester IDs) are replaced with deterministic HMAC-SHA256 pseudonyms:
agent_id: "agent-abc-123"
-->
agent_id: "pseudo_7a3b9f2e1c4d8a6b"
Pseudonymization is deterministic -- the same input with the same salt always produces the same pseudonym, preserving correlation across events without exposing real identifiers.
Stage 3: Disposition
Disposition determines what payload content is persisted to the database:
full-- raw request and response bodies are stored inrequest_bodyandresponse_bodyJSONB columns.metadata_only-- all payload columns are set toNULL. Only metadata (action, target, policy result, data classes, timestamps) is retained.redacted-- the redacted payload is stored in thepayload_redactedJSONB column. Original payloads are not stored.encrypted-- the redacted payload is serialized to JSON, encrypted via KMS, and stored in thepayload_encryptedBYTEA column.
Stage 4: Envelope encryption
When payload_mode is encrypted, the pipeline encrypts the redacted payload using envelope encryption before writing the audit record.
KMS providers
Behavry supports pluggable KMS providers via the KMSClient protocol:
class KMSClient(Protocol):
async def encrypt(self, plaintext: bytes, context: dict) -> bytes: ...
async def decrypt(self, ciphertext: bytes, context: dict) -> bytes: ...
async def health_check(self) -> bool: ...
LocalKMSClient
For self-hosted deployments, the local provider uses AES-256-GCM with a customer-provided key.
- Algorithm: AES-256-GCM with 96-bit random nonce
- Key source:
BEHAVRY_LOCAL_ENCRYPTION_KEYenvironment variable (base64-encoded, 32 bytes) - Storage format:
[12-byte nonce][ciphertext + GCM tag]
# Generate a key
openssl rand -base64 32
# Set in environment
export BEHAVRY_LOCAL_ENCRYPTION_KEY="<base64-encoded-32-byte-key>"
AWSKMSClient
For AWS-hosted deployments, the AWS provider uses envelope encryption with AWS KMS.
- Algorithm: AWS KMS generates a unique AES-256 data key per encryption operation; the data key encrypts the payload with AES-256-GCM; the encrypted data key is stored alongside the ciphertext
- Key source: AWS KMS key ARN configured via
kms_key_id - Storage format:
[4-byte key length][encrypted data key][12-byte nonce][ciphertext + GCM tag] - Dependency:
boto3(pip install boto3)
The envelope encryption model means the KMS master key never leaves AWS, and each audit event uses a unique data key.
Retention and purge
The retention system enforces time-based payload expiration. When payload_retention_days is configured, a nightly purge task sets all payload columns to NULL for audit events older than the retention window.
The purge preserves:
- All metadata (action, target, agent, session, timestamps)
- Policy decisions and risk scores
- Data class tags
- The audit hash chain (integrity is unaffected)
The payload_purged_at timestamp records when the purge occurred, providing an auditable trail of data lifecycle management.
Retention status API
GET /api/v1/audit/retention-status
Authorization: Bearer <admin-jwt>
Response:
{
"events_with_payload": 14302,
"purged_last_24h": 891
}
Payload decryption
Encrypted payloads can be decrypted on demand through a dedicated, audit-logged endpoint.
POST /api/v1/audit/events/{event_id}/decrypt
Authorization: Bearer <admin-jwt>
Response:
{
"event_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"decrypted": {
"request": { "tool": "read_file", "path": "/etc/config" },
"response": { "content": "[REDACTED:credential_assignment]" }
}
}
Every decryption attempt -- successful or failed -- writes an immutable payload_decrypt audit record. This record is always stored in metadata_only mode to prevent recursive payload capture. The audit log includes the admin username, target event ID, and success/failure status.
If the event is not encrypted or the KMS provider is unavailable, the endpoint returns a 400 or 500 error respectively.
Configuration API
Get current policy
GET /api/v1/admin/data-protection
Authorization: Bearer <admin-jwt>
Response:
{
"payload_mode": "redacted",
"redact_dlp_matches": true,
"redact_fields": ["$.messages[*].content"],
"hash_identifiers": false,
"encryption_enabled": false,
"kms_provider": null,
"kms_key_id_suffix": null,
"payload_retention_days": 90,
"strip_payload_from_stream": true
}
The kms_key_id_suffix field returns only the last 8 characters of the KMS key ID. The full key ID is never exposed via the API.
Update policy
PATCH /api/v1/admin/data-protection
Authorization: Bearer <admin-jwt>
Content-Type: application/json
{
"payload_mode": "encrypted",
"encryption_enabled": true,
"kms_provider": "local",
"payload_retention_days": 30,
"test_kms_connectivity": true
}
Set test_kms_connectivity to true to verify KMS health before saving. The request fails with 400 if the KMS health check does not pass, preventing misconfiguration.
Validation rules:
encryption_enabled: truerequireskms_providerto be set.payload_modemust be one of:full,metadata_only,redacted,encrypted.kms_providermust be one of:local,aws,azure,gcp(Azure and GCP are not yet implemented).
Policy model reference
| Field | Type | Default | Description |
|---|---|---|---|
payload_mode | string | "full" | Protection mode: full, metadata_only, redacted, encrypted |
redact_dlp_matches | boolean | true | Replace DLP-matched spans with redaction tokens |
redact_fields | string[] | [] | JSON paths to always redact (e.g., $.auth.password) |
hash_identifiers | boolean | false | Pseudonymize agent/session/requester IDs |
identifier_salt | string | null | HMAC salt for pseudonymization (required when hash_identifiers is true) |
encryption_enabled | boolean | false | Enable KMS envelope encryption |
kms_provider | string | null | KMS provider: local, aws, azure, gcp |
kms_key_id | string | null | KMS key identifier (ARN for AWS, ignored for local) |
payload_retention_days | integer | null | Days before payload purge (null = retain indefinitely) |
strip_payload_from_stream | boolean | true | Exclude payload fields from SSE dashboard stream |
Audit event columns
The data protection pipeline writes to eight dedicated columns on the audit_events table:
| Column | Type | Description |
|---|---|---|
request_body | JSONB | Raw or redacted request payload (full mode only) |
response_body | JSONB | Raw or redacted response payload (full mode only) |
payload_redacted | JSONB | Combined redacted request + response (redacted mode) |
payload_encrypted | BYTEA | Encrypted payload blob (encrypted mode) |
encryption_key_id | TEXT | KMS key identifier used for encryption |
data_classes | TEXT[] | Detected data classes (e.g., ["CREDENTIAL", "PII"]) |
payload_purged_at | TIMESTAMP | When the retention purge nulled the payload |
dp_mode | TEXT | Protection mode applied: full, metadata_only, redacted, encrypted |
Isolation guarantees
The data protection pipeline enforces two isolation boundaries to prevent sensitive content from leaking through side channels:
-
Event bus isolation -- The
raw_payloadfield is removed from events published to the internal event bus. The behavioral monitor, drift detector, and all other subscribers never see payload content. -
SSE stream isolation -- When
strip_payload_from_streamistrue(the default), theto_sse()serializer excludes payload fields from the real-time dashboard stream. Administrators see metadata, policy decisions, and data class tags, but not payload content.
These boundaries ensure that even in full mode, payload data is only accessible through direct database queries or the decryption API -- never through real-time monitoring channels.
Compliance mapping
| Framework | Requirement | Behavry capability |
|---|---|---|
| GDPR Art. 32 | Encryption of personal data | encrypted mode with KMS envelope encryption; redacted mode with DLP-based pseudonymization |
| HIPAA 164.312(a)(2)(iv) | Encryption and decryption of ePHI | AES-256-GCM encryption; audit-logged decryption with admin authentication |
| HIPAA 164.312(b) | Audit controls | Immutable audit trail with hash chain; decryption attempts logged |
| SOC 2 CC6.1 | Logical and physical access controls | Data classification tags on every event; role-based decryption access; KMS key separation |
| SOC 2 CC6.7 | Restriction of data in transmission | SSE stream excludes payload fields; event bus strips raw content |
| GDPR Art. 17 | Right to erasure | Configurable retention purge with auditable payload_purged_at timestamp |