Cost Attribution & Token Economics
Cost Attribution is included on the Professional and Enterprise plans.
What this is
Cost Attribution turns proxy traffic into dollar figures. Behavry extracts token usage from every LLM call (OpenAI, Anthropic, Google, Ollama, NemoClaw) that flows through the proxy, multiplies by the current model pricing, and rolls the result up per agent, per model, per provider, and per tenant.
You get two things out of it:
- Attribution — who spent what. "Agent X cost $14.20 yesterday; $11.80 of that was Claude Opus; here are the workflows that used it."
- Optimization signal — which agents are running on models that are too expensive for what they do, and which tool-heavy agents would benefit from Context Window Governance.
Pricing
Default pricing for 14 models is seeded at startup from backend/behavry/analytics/cost_attribution.py:
| Provider | Model | Input $/1k | Output $/1k |
|---|---|---|---|
| OpenAI | gpt-4o | 2.50 | 10.00 |
| OpenAI | gpt-4o-mini | 0.15 | 0.60 |
| OpenAI | gpt-4.1 | 2.00 | 8.00 |
| OpenAI | gpt-4.1-mini | 0.40 | 1.60 |
| OpenAI | gpt-4.1-nano | 0.10 | 0.40 |
| OpenAI | o3 | 2.00 | 8.00 |
| OpenAI | o3-mini | 1.10 | 4.40 |
| OpenAI | o4-mini | 1.10 | 4.40 |
| Anthropic | claude-sonnet-4-* | 3.00 | 15.00 |
| Anthropic | claude-haiku-3-5-* | 0.80 | 4.00 |
| Anthropic | claude-opus-4-* | 15.00 | 75.00 |
gemini-2.5-pro | 1.25 | 10.00 | |
gemini-2.5-flash | 0.15 | 0.60 | |
gemini-2.0-flash | 0.10 | 0.40 |
Tenants can override any row, or add custom models, through the pricing API. Ollama is treated as free by default (local inference); tenants can set a synthetic per-1k rate to capture compute cost.
How tokens get counted
Two integration paths feed the same audit_events row:
- AI proxies (
anthropic_proxy.py,openai_proxy.py,gemini_proxy.py,ollama_proxy.py,nemoclaw_proxy.py) publish events withinput_tokensandoutput_tokensparsed from the upstream response body. - MCP proxy writes its own audit rows directly and the
CostAuditWritersubscriber enriches them with token counts extracted from the tool response.
Both paths flow through extract_token_usage() → lookup_pricing() → estimated_cost on the event row.
When the upstream response does not include token counts (some streaming paths), a fallback tokenizer estimates them and tags the row with token_source=estimated.
Queries
All cost aggregation runs on TimescaleDB time_bucket so daily / hourly rollups are cheap regardless of table size.
API
Routes: backend/behavry/analytics/cost_api.py, pricing_routes.py.
| Method | Path | Purpose |
|---|---|---|
GET | /api/v1/cost/summary?period=7d | Tenant total, breakdown by provider |
GET | /api/v1/cost/by-agent?period=30d | Per-agent spend with rank |
GET | /api/v1/cost/by-model?period=30d | Per-model spend with token volume |
GET | /api/v1/cost/timeseries?bucket=1h | Time-series for charting |
GET | /api/v1/cost/pricing | Current pricing table |
POST | /api/v1/cost/pricing | Add or override a pricing row |
All endpoints respect tenant isolation via RLS.
Dashboard
Observability → Cost shows:
- Headline: yesterday / last 7d / last 30d spend
- Stacked area chart by provider
- Top 10 agents by spend
- Top 10 tool patterns by token burn
- Projected monthly spend based on the current 7-day rate
Related
- Context Window Governance — the lever most teams use to cut cost on token-heavy MCP sessions
- Observability Stack — Prometheus metrics export for cost
- SIEM Connectors — forward cost events alongside policy events