Observability

Logging

Format

The backend emits structured JSON logs to stdout:

{"time": "2025-02-15 14:23:11,456", "level": "INFO", "name": "behavry.proxy.engine", "message": "OPA decision for agent=a1b2c3d4 tool=read_file action=read resource=/home/projects/report.pdf → allow (filesystem.read)"}

Log Levels by Logger

Logger	Level	What's Logged
`behavry`	DEBUG	All application logs
`behavry.proxy.engine`	INFO	Every OPA decision (agent, tool, action, resource, result)
`behavry.policy.opa_client`	INFO	OPA request/response timing
`behavry.monitor.service`	INFO	Anomaly detections, baseline updates
`behavry.audit.service`	WARNING (on error)	Audit write failures
`uvicorn`	INFO	HTTP request logs
`sqlalchemy.engine`	WARNING	Only SQL errors (not queries)

Key Log Events to Watch

Message Pattern	Meaning
`OPA decision ... → deny`	Tool call blocked
`OPA decision ... → escalate`	Tool call queued for human review
`DLP findings for agent=...`	Sensitive data detected in tool inputs
`Baseline cache updated for agent`	Agent baseline approved/refreshed
`Prompt drift detected for agent`	System prompt hash mismatch
`Failed to write audit event`	Audit pipeline error (DB issue)
`Base policy push failed`	OPA unreachable at startup
`Behavry ready`	App started successfully

Enabling Debug Logging

BEHAVRY_DEBUG=true

This sets DEBUG level on the root logger. Warning: very verbose — includes all SQL queries and HTTP request bodies.

SSE Live Feed

The audit stream (GET /api/v1/audit/stream) is the real-time observability surface for the dashboard. Event types:

`event_type`	Payload
`tool_call`	tool call allowed through
`policy_deny`	tool call denied
`policy_escalate`	tool call held for escalation
`escalation`	escalation created
`alert`	behavioral alert raised
`system_notification`	system message (e.g., restart countdown)

Keep-alive comments (:keepalive) are sent every BEHAVRY_SSE_KEEPALIVE_SECONDS (default: 15s) to prevent proxy timeouts.

Metrics (Not Yet Implemented)

Prometheus metrics endpoint is not yet implemented. Planned at /metrics:

Metric	Type	Labels
`behavry_tool_calls_total`	Counter	`agent_id`, `policy_result`, `mcp_server`
`behavry_opa_decision_duration_seconds`	Histogram	`mcp_server`
`behavry_audit_write_duration_seconds`	Histogram	—
`behavry_escalations_pending`	Gauge	—
`behavry_alerts_open_total`	Gauge	`severity`
`behavry_opa_errors_total`	Counter	—

In the interim, derive metrics from the audit log via SQL or the REST API.

Key Queries for Operational Monitoring

Denial rate in last hour

SELECT
    COUNT(*) FILTER (WHERE policy_result = 'deny') AS denials,
    COUNT(*) AS total,
    ROUND(100.0 * COUNT(*) FILTER (WHERE policy_result = 'deny') / NULLIF(COUNT(*), 0), 1) AS deny_pct
FROM audit_events
WHERE timestamp > NOW() - INTERVAL '1 hour';

Top denied agents

SELECT agent_id, COUNT(*) AS deny_count
FROM audit_events
WHERE policy_result = 'deny'
  AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY agent_id
ORDER BY deny_count DESC
LIMIT 10;

OPA decision latency distribution

SELECT
    percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) AS p50,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM audit_events
WHERE timestamp > NOW() - INTERVAL '1 hour'
  AND policy_result IS NOT NULL;

Escalations older than 20 minutes (approaching timeout)

SELECT id, agent_id, tool_name, target, timestamp, timeout_at
FROM escalations
WHERE status = 'pending'
  AND timeout_at < NOW() + INTERVAL '10 minutes'
ORDER BY timeout_at ASC;

DLP findings by severity (last 24h)

SELECT
    finding->>'severity' AS severity,
    COUNT(*) AS occurrences
FROM audit_events,
     jsonb_array_elements(dlp_findings) AS finding
WHERE dlp_findings IS NOT NULL
  AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY finding->>'severity'
ORDER BY occurrences DESC;

TimescaleDB Performance

Chunk Status

-- Show chunk sizes and compression state
SELECT chunk_name,
       pg_size_pretty(before_compression_total_bytes) AS before,
       pg_size_pretty(after_compression_total_bytes) AS after,
       compression_status
FROM chunk_compression_stats('audit_events')
ORDER BY range_start DESC;

Compression Policy

-- Enable compression (recommended for production)
ALTER TABLE audit_events SET (timescaledb.compress, timescaledb.compress_segmentby = 'agent_id');
SELECT add_compression_policy('audit_events', INTERVAL '7 days');

-- Check policy jobs
SELECT * FROM timescaledb_information.jobs WHERE application_name LIKE '%compress%';

Alerting Recommendations (Production)

Configure these alerts in your monitoring system:

Alert	Condition	Severity
OPA unreachable	OPA health check failing	Critical
High denial rate	> 20% denials in 5 min	Warning
Audit write failures	Any `Failed to write audit event` logs	Critical
DB connection pool exhausted	Pool wait time > 1s	Warning
Pending escalations	Any escalation pending > 20 min	Warning
No audit events	0 events in 5 min during business hours	Warning
Disk space	DB volume > 80% capacity	Warning

Tracing (Not Yet Implemented)

OpenTelemetry tracing is not yet implemented. When added, each proxy request will carry a trace_id through:

JWT validation
Session check
DLP scan
OPA evaluation
Backend forwarding
Audit write

This will enable end-to-end latency attribution per enforcement stage.

Logging​

Format​

Log Levels by Logger​

Key Log Events to Watch​

Enabling Debug Logging​

SSE Live Feed​

Metrics (Not Yet Implemented)​

Key Queries for Operational Monitoring​

Denial rate in last hour​

Top denied agents​

OPA decision latency distribution​

Escalations older than 20 minutes (approaching timeout)​

DLP findings by severity (last 24h)​

TimescaleDB Performance​

Chunk Status​

Compression Policy​

Alerting Recommendations (Production)​

Tracing (Not Yet Implemented)​