Cost Attribution & Token Economics

Feature row 38 — Sprint CA

Cost Attribution is included on the Professional and Enterprise plans.

What this is

Cost Attribution turns proxy traffic into dollar figures. Behavry extracts token usage from every LLM call (OpenAI, Anthropic, Google, Ollama, NemoClaw) that flows through the proxy, multiplies by the current model pricing, and rolls the result up per agent, per model, per provider, and per tenant.

You get two things out of it:

Attribution — who spent what. "Agent X cost $14.20 yesterday; $11.80 of that was Claude Opus; here are the workflows that used it."
Optimization signal — which agents are running on models that are too expensive for what they do, and which tool-heavy agents would benefit from Context Window Governance.

Pricing

Default pricing for 14 models is seeded at startup from backend/behavry/analytics/cost_attribution.py:

Provider	Model	Input $/1k	Output $/1k
OpenAI	`gpt-4o`	2.50	10.00
OpenAI	`gpt-4o-mini`	0.15	0.60
OpenAI	`gpt-4.1`	2.00	8.00
OpenAI	`gpt-4.1-mini`	0.40	1.60
OpenAI	`gpt-4.1-nano`	0.10	0.40
OpenAI	`o3`	2.00	8.00
OpenAI	`o3-mini`	1.10	4.40
OpenAI	`o4-mini`	1.10	4.40
Anthropic	`claude-sonnet-4-*`	3.00	15.00
Anthropic	`claude-haiku-3-5-*`	0.80	4.00
Anthropic	`claude-opus-4-*`	15.00	75.00
Google	`gemini-2.5-pro`	1.25	10.00
Google	`gemini-2.5-flash`	0.15	0.60
Google	`gemini-2.0-flash`	0.10	0.40

Tenants can override any row, or add custom models, through the pricing API. Ollama is treated as free by default (local inference); tenants can set a synthetic per-1k rate to capture compute cost.

How tokens get counted

Two integration paths feed the same audit_events row:

AI proxies (anthropic_proxy.py, openai_proxy.py, gemini_proxy.py, ollama_proxy.py, nemoclaw_proxy.py) publish events with input_tokens and output_tokens parsed from the upstream response body.
MCP proxy writes its own audit rows directly and the CostAuditWriter subscriber enriches them with token counts extracted from the tool response.

Both paths flow through extract_token_usage() → lookup_pricing() → estimated_cost on the event row.

When the upstream response does not include token counts (some streaming paths), a fallback tokenizer estimates them and tags the row with token_source=estimated.

Queries

All cost aggregation runs on TimescaleDB time_bucket so daily / hourly rollups are cheap regardless of table size.

API

Routes: backend/behavry/analytics/cost_api.py, pricing_routes.py.

Method	Path	Purpose
`GET`	`/api/v1/cost/summary?period=7d`	Tenant total, breakdown by provider
`GET`	`/api/v1/cost/by-agent?period=30d`	Per-agent spend with rank
`GET`	`/api/v1/cost/by-model?period=30d`	Per-model spend with token volume
`GET`	`/api/v1/cost/timeseries?bucket=1h`	Time-series for charting
`GET`	`/api/v1/cost/pricing`	Current pricing table
`POST`	`/api/v1/cost/pricing`	Add or override a pricing row

All endpoints respect tenant isolation via RLS.

Dashboard

Observability → Cost shows:

Headline: yesterday / last 7d / last 30d spend
Stacked area chart by provider
Top 10 agents by spend
Top 10 tool patterns by token burn
Projected monthly spend based on the current 7-day rate

Context Window Governance — the lever most teams use to cut cost on token-heavy MCP sessions
Observability Stack — Prometheus metrics export for cost
SIEM Connectors — forward cost events alongside policy events

What this is​

Pricing​

How tokens get counted​

Queries​

API​

Dashboard​

Related​

What this is

Pricing

How tokens get counted

Queries

API

Dashboard

Related