Behavry Integration -- Ollama API Proxy
For teams running local models with Ollama, Behavry can proxy all inference calls for identity verification, policy enforcement, and audit logging -- without requiring any upstream API key.
This is particularly useful for on-premises and air-gapped deployments where models run entirely on local hardware.
How It Works
Your Code (ollama SDK or OpenAI-compatible client)
| base_url=http://localhost:8000/api/v1/ollama
Behavry Proxy
| validates JWT | audits metadata | checks OPA policy
Ollama (localhost:11434)
^ response streamed back
The proxy:
- Validates your Behavry agent JWT (
Authorization: Bearer <behavry-jwt>) - Audits request metadata: model, message/prompt count, system prompt presence (boolean only) -- not message content
- Forwards request to
BEHAVRY_OLLAMA_URL/{path}(defaulthttp://localhost:11434) - Streams response back transparently
- Audits response metadata: token counts, done reason
Prerequisites
- Behavry stack running (
make devordocker compose up) - A Behavry agent with
web:readandweb:writepermissions - Ollama installed and running (ollama.com)
No API Key Required
Unlike the Anthropic, OpenAI, and Gemini proxies, Ollama does not require an upstream API key. There is no X-*-Key header needed. The only authentication is the Behavry agent JWT in the Authorization header.
This means setup is simpler: point your client at the Behavry proxy endpoint and provide your Behavry JWT.
Step 1 -- Get a Behavry JWT
curl -s -X POST http://localhost:8000/api/v1/auth/token \
-H "Content-Type: application/json" \
-d '{"client_id": "YOUR_CLIENT_ID", "client_secret": "YOUR_SECRET", "grant_type": "client_credentials"}' \
| jq -r .access_token
Step 2 -- Configure Your Code
Python (ollama SDK)
from ollama import Client
BEHAVRY_JWT = "eyJhbGci..." # Behavry agent token
client = Client(
host="http://localhost:8000/api/v1/ollama",
headers={"Authorization": f"Bearer {BEHAVRY_JWT}"},
)
response = client.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response["message"]["content"])
OpenAI-Compatible Client
Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. Through Behavry:
from openai import OpenAI
BEHAVRY_JWT = "eyJhbGci..."
client = OpenAI(
base_url="http://localhost:8000/api/v1/ollama/v1",
api_key=BEHAVRY_JWT, # used as Bearer token
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Direct HTTP (curl)
Native Ollama API:
curl -X POST "http://localhost:8000/api/v1/ollama/api/chat" \
-H "Authorization: Bearer $BEHAVRY_JWT" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
OpenAI-compatible API:
curl -X POST "http://localhost:8000/api/v1/ollama/v1/chat/completions" \
-H "Authorization: Bearer $BEHAVRY_JWT" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Step 3 -- Verify in Dashboard
Make a request and check http://localhost:5173 -- Live Activity.
You should see an event with:
tool_name:ollama-apimcp_server:ollama-proxyaction:POSTpolicy_result:allow
Endpoint
POST /api/v1/ollama/{path}
The {path} parameter captures the full Ollama API path. Both native and OpenAI-compatible endpoints are supported:
| Ollama Endpoint | Behavry Path | Description |
|---|---|---|
/api/chat | /api/v1/ollama/api/chat | Native chat completion |
/api/generate | /api/v1/ollama/api/generate | Native text generation |
/api/tags | /api/v1/ollama/api/tags | List local models |
/v1/chat/completions | /api/v1/ollama/v1/chat/completions | OpenAI-compatible chat |
/v1/models | /api/v1/ollama/v1/models | OpenAI-compatible model list |
Query parameters are forwarded as-is.
Two API Styles
Ollama supports two API formats, and the Behavry proxy handles both transparently:
Native Ollama API (/api/chat, /api/generate) -- Uses Ollama's own request/response format. Token counts come from prompt_eval_count and eval_count fields. Finish reason from done_reason.
OpenAI-compatible API (/v1/chat/completions) -- Uses the OpenAI request/response format. Token counts come from usage.prompt_tokens and usage.completion_tokens. Finish reason from choices[0].finish_reason.
Both styles produce identical audit events in Behavry.
Upstream URL Configuration
The proxy forwards requests to the Ollama instance specified by the BEHAVRY_OLLAMA_URL environment variable:
| Variable | Default | Description |
|---|---|---|
BEHAVRY_OLLAMA_URL | http://localhost:11434 | URL of the Ollama instance |
For remote Ollama instances or custom ports:
export BEHAVRY_OLLAMA_URL=http://gpu-server.internal:11434
Timeout
The proxy uses a 300-second timeout (5 minutes) for upstream requests. This is significantly longer than the other API proxies (120 seconds) to accommodate local model inference on consumer hardware, where large models may take several minutes to generate responses.
If you consistently hit timeouts, consider using a smaller model or hardware with more VRAM.
Audited Metadata
The proxy logs the following -- message content and prompt text are never stored:
| Field | Source (Native) | Source (OpenAI-compat) |
|---|---|---|
| Model | model field in request body | model field in request body |
| Has system prompt | true if any message has role: "system" | Same |
| Has tools | true if tools array present | Same |
| Message count | Length of messages array (or 1 for prompt) | Length of messages array |
| Input tokens | prompt_eval_count | usage.prompt_tokens |
| Output tokens | eval_count | usage.completion_tokens |
| Finish reason | done_reason | choices[0].finish_reason |
Streaming
Streaming is fully supported. The proxy detects streaming responses by the text/event-stream or application/x-ndjson content type and passes chunks through without buffering. An audit event is published at the start of the stream with the upstream status code.
Ollama's native API streams newline-delimited JSON (application/x-ndjson) by default. Set "stream": false in the request body to receive a single buffered response with full token counts.
Error Handling
502 -- "Cannot reach Ollama ... Is it running?"
The proxy returns a specific error when it cannot connect to the Ollama instance (httpx.ConnectError). This typically means:
- Ollama is not running. Start it with
ollama serve. - The
BEHAVRY_OLLAMA_URLpoints to an incorrect host or port. - A firewall is blocking the connection.
504 -- Gateway Timeout
The upstream request exceeded the 300-second timeout. Consider using a smaller model or adding more compute resources.
Policy Control
Example OPA policy to restrict which local models can be used:
package behavry.authz
# Only allow approved local models
deny if {
input.mcp_server == "ollama-proxy"
approved_models := {"llama3.2", "codellama", "mistral"}
not approved_models[input.model]
}
# Block local model usage for external-facing agents
deny if {
input.mcp_server == "ollama-proxy"
input.agent_role == "external"
}
Troubleshooting
401 from Behavry
JWT expired or missing. Re-fetch using Step 1.
502 "Cannot reach Ollama"
Ollama is not running or is on a different host/port. Verify with:
curl http://localhost:11434/api/tags
If that works but the proxy still fails, check BEHAVRY_OLLAMA_URL.
Model not found
Ollama returns 404 if the requested model is not pulled locally. Pull it first:
ollama pull llama3.2
Slow responses
Local inference speed depends on available hardware. The proxy allows up to 300 seconds per request. If responses are consistently slow, consider quantized models (e.g., llama3.2:q4_0) or offloading to a GPU-equipped machine and setting BEHAVRY_OLLAMA_URL accordingly.