Skip to main content

Observability

Logging

Format

The backend emits structured JSON logs to stdout:

{"time": "2025-02-15 14:23:11,456", "level": "INFO", "name": "behavry.proxy.engine", "message": "OPA decision for agent=a1b2c3d4 tool=read_file action=read resource=/home/projects/report.pdf → allow (filesystem.read)"}

Log Levels by Logger

LoggerLevelWhat's Logged
behavryDEBUGAll application logs
behavry.proxy.engineINFOEvery OPA decision (agent, tool, action, resource, result)
behavry.policy.opa_clientINFOOPA request/response timing
behavry.monitor.serviceINFOAnomaly detections, baseline updates
behavry.audit.serviceWARNING (on error)Audit write failures
uvicornINFOHTTP request logs
sqlalchemy.engineWARNINGOnly SQL errors (not queries)

Key Log Events to Watch

Message PatternMeaning
OPA decision ... → denyTool call blocked
OPA decision ... → escalateTool call queued for human review
DLP findings for agent=...Sensitive data detected in tool inputs
Baseline cache updated for agentAgent baseline approved/refreshed
Prompt drift detected for agentSystem prompt hash mismatch
Failed to write audit eventAudit pipeline error (DB issue)
Base policy push failedOPA unreachable at startup
Behavry readyApp started successfully

Enabling Debug Logging

BEHAVRY_DEBUG=true

This sets DEBUG level on the root logger. Warning: very verbose — includes all SQL queries and HTTP request bodies.


SSE Live Feed

The audit stream (GET /api/v1/audit/stream) is the real-time observability surface for the dashboard. Event types:

event_typePayload
tool_calltool call allowed through
policy_denytool call denied
policy_escalatetool call held for escalation
escalationescalation created
alertbehavioral alert raised
system_notificationsystem message (e.g., restart countdown)

Keep-alive comments (:keepalive) are sent every BEHAVRY_SSE_KEEPALIVE_SECONDS (default: 15s) to prevent proxy timeouts.


Metrics (Not Yet Implemented)

Prometheus metrics endpoint is not yet implemented. Planned at /metrics:

MetricTypeLabels
behavry_tool_calls_totalCounteragent_id, policy_result, mcp_server
behavry_opa_decision_duration_secondsHistogrammcp_server
behavry_audit_write_duration_secondsHistogram
behavry_escalations_pendingGauge
behavry_alerts_open_totalGaugeseverity
behavry_opa_errors_totalCounter

In the interim, derive metrics from the audit log via SQL or the REST API.


Key Queries for Operational Monitoring

Denial rate in last hour

SELECT
COUNT(*) FILTER (WHERE policy_result = 'deny') AS denials,
COUNT(*) AS total,
ROUND(100.0 * COUNT(*) FILTER (WHERE policy_result = 'deny') / NULLIF(COUNT(*), 0), 1) AS deny_pct
FROM audit_events
WHERE timestamp > NOW() - INTERVAL '1 hour';

Top denied agents

SELECT agent_id, COUNT(*) AS deny_count
FROM audit_events
WHERE policy_result = 'deny'
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY agent_id
ORDER BY deny_count DESC
LIMIT 10;

OPA decision latency distribution

SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) AS p50,
percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95,
percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99
FROM audit_events
WHERE timestamp > NOW() - INTERVAL '1 hour'
AND policy_result IS NOT NULL;

Escalations older than 20 minutes (approaching timeout)

SELECT id, agent_id, tool_name, target, timestamp, timeout_at
FROM escalations
WHERE status = 'pending'
AND timeout_at < NOW() + INTERVAL '10 minutes'
ORDER BY timeout_at ASC;

DLP findings by severity (last 24h)

SELECT
finding->>'severity' AS severity,
COUNT(*) AS occurrences
FROM audit_events,
jsonb_array_elements(dlp_findings) AS finding
WHERE dlp_findings IS NOT NULL
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY finding->>'severity'
ORDER BY occurrences DESC;

TimescaleDB Performance

Chunk Status

-- Show chunk sizes and compression state
SELECT chunk_name,
pg_size_pretty(before_compression_total_bytes) AS before,
pg_size_pretty(after_compression_total_bytes) AS after,
compression_status
FROM chunk_compression_stats('audit_events')
ORDER BY range_start DESC;

Compression Policy

-- Enable compression (recommended for production)
ALTER TABLE audit_events SET (timescaledb.compress, timescaledb.compress_segmentby = 'agent_id');
SELECT add_compression_policy('audit_events', INTERVAL '7 days');

-- Check policy jobs
SELECT * FROM timescaledb_information.jobs WHERE application_name LIKE '%compress%';

Alerting Recommendations (Production)

Configure these alerts in your monitoring system:

AlertConditionSeverity
OPA unreachableOPA health check failingCritical
High denial rate> 20% denials in 5 minWarning
Audit write failuresAny Failed to write audit event logsCritical
DB connection pool exhaustedPool wait time > 1sWarning
Pending escalationsAny escalation pending > 20 minWarning
No audit events0 events in 5 min during business hoursWarning
Disk spaceDB volume > 80% capacityWarning

Tracing (Not Yet Implemented)

OpenTelemetry tracing is not yet implemented. When added, each proxy request will carry a trace_id through:

  • JWT validation
  • Session check
  • DLP scan
  • OPA evaluation
  • Backend forwarding
  • Audit write

This will enable end-to-end latency attribution per enforcement stage.