Observability
Logging
Format
The backend emits structured JSON logs to stdout:
{"time": "2025-02-15 14:23:11,456", "level": "INFO", "name": "behavry.proxy.engine", "message": "OPA decision for agent=a1b2c3d4 tool=read_file action=read resource=/home/projects/report.pdf → allow (filesystem.read)"}
Log Levels by Logger
| Logger | Level | What's Logged |
|---|---|---|
behavry | DEBUG | All application logs |
behavry.proxy.engine | INFO | Every OPA decision (agent, tool, action, resource, result) |
behavry.policy.opa_client | INFO | OPA request/response timing |
behavry.monitor.service | INFO | Anomaly detections, baseline updates |
behavry.audit.service | WARNING (on error) | Audit write failures |
uvicorn | INFO | HTTP request logs |
sqlalchemy.engine | WARNING | Only SQL errors (not queries) |
Key Log Events to Watch
| Message Pattern | Meaning |
|---|---|
OPA decision ... → deny | Tool call blocked |
OPA decision ... → escalate | Tool call queued for human review |
DLP findings for agent=... | Sensitive data detected in tool inputs |
Baseline cache updated for agent | Agent baseline approved/refreshed |
Prompt drift detected for agent | System prompt hash mismatch |
Failed to write audit event | Audit pipeline error (DB issue) |
Base policy push failed | OPA unreachable at startup |
Behavry ready | App started successfully |
Enabling Debug Logging
BEHAVRY_DEBUG=true
This sets DEBUG level on the root logger. Warning: very verbose — includes all SQL queries and HTTP request bodies.
SSE Live Feed
The audit stream (GET /api/v1/audit/stream) is the real-time observability surface for the dashboard. Event types:
event_type | Payload |
|---|---|
tool_call | tool call allowed through |
policy_deny | tool call denied |
policy_escalate | tool call held for escalation |
escalation | escalation created |
alert | behavioral alert raised |
system_notification | system message (e.g., restart countdown) |
Keep-alive comments (:keepalive) are sent every BEHAVRY_SSE_KEEPALIVE_SECONDS (default: 15s) to prevent proxy timeouts.
Metrics (Not Yet Implemented)
Prometheus metrics endpoint is not yet implemented. Planned at /metrics:
| Metric | Type | Labels |
|---|---|---|
behavry_tool_calls_total | Counter | agent_id, policy_result, mcp_server |
behavry_opa_decision_duration_seconds | Histogram | mcp_server |
behavry_audit_write_duration_seconds | Histogram | — |
behavry_escalations_pending | Gauge | — |
behavry_alerts_open_total | Gauge | severity |
behavry_opa_errors_total | Counter | — |
In the interim, derive metrics from the audit log via SQL or the REST API.
Key Queries for Operational Monitoring
Denial rate in last hour
SELECT
COUNT(*) FILTER (WHERE policy_result = 'deny') AS denials,
COUNT(*) AS total,
ROUND(100.0 * COUNT(*) FILTER (WHERE policy_result = 'deny') / NULLIF(COUNT(*), 0), 1) AS deny_pct
FROM audit_events
WHERE timestamp > NOW() - INTERVAL '1 hour';
Top denied agents
SELECT agent_id, COUNT(*) AS deny_count
FROM audit_events
WHERE policy_result = 'deny'
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY agent_id
ORDER BY deny_count DESC
LIMIT 10;
OPA decision latency distribution
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) AS p50,
percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM audit_events
WHERE timestamp > NOW() - INTERVAL '1 hour'
AND policy_result IS NOT NULL;
Escalations older than 20 minutes (approaching timeout)
SELECT id, agent_id, tool_name, target, timestamp, timeout_at
FROM escalations
WHERE status = 'pending'
AND timeout_at < NOW() + INTERVAL '10 minutes'
ORDER BY timeout_at ASC;
DLP findings by severity (last 24h)
SELECT
finding->>'severity' AS severity,
COUNT(*) AS occurrences
FROM audit_events,
jsonb_array_elements(dlp_findings) AS finding
WHERE dlp_findings IS NOT NULL
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY finding->>'severity'
ORDER BY occurrences DESC;
TimescaleDB Performance
Chunk Status
-- Show chunk sizes and compression state
SELECT chunk_name,
pg_size_pretty(before_compression_total_bytes) AS before,
pg_size_pretty(after_compression_total_bytes) AS after,
compression_status
FROM chunk_compression_stats('audit_events')
ORDER BY range_start DESC;
Compression Policy
-- Enable compression (recommended for production)
ALTER TABLE audit_events SET (timescaledb.compress, timescaledb.compress_segmentby = 'agent_id');
SELECT add_compression_policy('audit_events', INTERVAL '7 days');
-- Check policy jobs
SELECT * FROM timescaledb_information.jobs WHERE application_name LIKE '%compress%';
Alerting Recommendations (Production)
Configure these alerts in your monitoring system:
| Alert | Condition | Severity |
|---|---|---|
| OPA unreachable | OPA health check failing | Critical |
| High denial rate | > 20% denials in 5 min | Warning |
| Audit write failures | Any Failed to write audit event logs | Critical |
| DB connection pool exhausted | Pool wait time > 1s | Warning |
| Pending escalations | Any escalation pending > 20 min | Warning |
| No audit events | 0 events in 5 min during business hours | Warning |
| Disk space | DB volume > 80% capacity | Warning |
Tracing (Not Yet Implemented)
OpenTelemetry tracing is not yet implemented. When added, each proxy request will carry a trace_id through:
- JWT validation
- Session check
- DLP scan
- OPA evaluation
- Backend forwarding
- Audit write
This will enable end-to-end latency attribution per enforcement stage.