Skip to main content

Behavry Integration -- Ollama API Proxy

For teams running local models with Ollama, Behavry can proxy all inference calls for identity verification, policy enforcement, and audit logging -- without requiring any upstream API key.

This is particularly useful for on-premises and air-gapped deployments where models run entirely on local hardware.


How It Works

Your Code (ollama SDK or OpenAI-compatible client)
| base_url=http://localhost:8000/api/v1/ollama
Behavry Proxy
| validates JWT | audits metadata | checks OPA policy
Ollama (localhost:11434)
^ response streamed back

The proxy:

  1. Validates your Behavry agent JWT (Authorization: Bearer <behavry-jwt>)
  2. Audits request metadata: model, message/prompt count, system prompt presence (boolean only) -- not message content
  3. Forwards request to BEHAVRY_OLLAMA_URL/{path} (default http://localhost:11434)
  4. Streams response back transparently
  5. Audits response metadata: token counts, done reason

Prerequisites

  • Behavry stack running (make dev or docker compose up)
  • A Behavry agent with web:read and web:write permissions
  • Ollama installed and running (ollama.com)

No API Key Required

Unlike the Anthropic, OpenAI, and Gemini proxies, Ollama does not require an upstream API key. There is no X-*-Key header needed. The only authentication is the Behavry agent JWT in the Authorization header.

This means setup is simpler: point your client at the Behavry proxy endpoint and provide your Behavry JWT.


Step 1 -- Get a Behavry JWT

curl -s -X POST http://localhost:8000/api/v1/auth/token \
-H "Content-Type: application/json" \
-d '{"client_id": "YOUR_CLIENT_ID", "client_secret": "YOUR_SECRET", "grant_type": "client_credentials"}' \
| jq -r .access_token

Step 2 -- Configure Your Code

Python (ollama SDK)

from ollama import Client

BEHAVRY_JWT = "eyJhbGci..." # Behavry agent token

client = Client(
host="http://localhost:8000/api/v1/ollama",
headers={"Authorization": f"Bearer {BEHAVRY_JWT}"},
)

response = client.chat(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response["message"]["content"])

OpenAI-Compatible Client

Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. Through Behavry:

from openai import OpenAI

BEHAVRY_JWT = "eyJhbGci..."

client = OpenAI(
base_url="http://localhost:8000/api/v1/ollama/v1",
api_key=BEHAVRY_JWT, # used as Bearer token
)

response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Direct HTTP (curl)

Native Ollama API:

curl -X POST "http://localhost:8000/api/v1/ollama/api/chat" \
-H "Authorization: Bearer $BEHAVRY_JWT" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'

OpenAI-compatible API:

curl -X POST "http://localhost:8000/api/v1/ollama/v1/chat/completions" \
-H "Authorization: Bearer $BEHAVRY_JWT" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'

Step 3 -- Verify in Dashboard

Make a request and check http://localhost:5173 -- Live Activity.

You should see an event with:

  • tool_name: ollama-api
  • mcp_server: ollama-proxy
  • action: POST
  • policy_result: allow

Endpoint

POST /api/v1/ollama/{path}

The {path} parameter captures the full Ollama API path. Both native and OpenAI-compatible endpoints are supported:

Ollama EndpointBehavry PathDescription
/api/chat/api/v1/ollama/api/chatNative chat completion
/api/generate/api/v1/ollama/api/generateNative text generation
/api/tags/api/v1/ollama/api/tagsList local models
/v1/chat/completions/api/v1/ollama/v1/chat/completionsOpenAI-compatible chat
/v1/models/api/v1/ollama/v1/modelsOpenAI-compatible model list

Query parameters are forwarded as-is.


Two API Styles

Ollama supports two API formats, and the Behavry proxy handles both transparently:

Native Ollama API (/api/chat, /api/generate) -- Uses Ollama's own request/response format. Token counts come from prompt_eval_count and eval_count fields. Finish reason from done_reason.

OpenAI-compatible API (/v1/chat/completions) -- Uses the OpenAI request/response format. Token counts come from usage.prompt_tokens and usage.completion_tokens. Finish reason from choices[0].finish_reason.

Both styles produce identical audit events in Behavry.


Upstream URL Configuration

The proxy forwards requests to the Ollama instance specified by the BEHAVRY_OLLAMA_URL environment variable:

VariableDefaultDescription
BEHAVRY_OLLAMA_URLhttp://localhost:11434URL of the Ollama instance

For remote Ollama instances or custom ports:

export BEHAVRY_OLLAMA_URL=http://gpu-server.internal:11434

Timeout

The proxy uses a 300-second timeout (5 minutes) for upstream requests. This is significantly longer than the other API proxies (120 seconds) to accommodate local model inference on consumer hardware, where large models may take several minutes to generate responses.

If you consistently hit timeouts, consider using a smaller model or hardware with more VRAM.


Audited Metadata

The proxy logs the following -- message content and prompt text are never stored:

FieldSource (Native)Source (OpenAI-compat)
Modelmodel field in request bodymodel field in request body
Has system prompttrue if any message has role: "system"Same
Has toolstrue if tools array presentSame
Message countLength of messages array (or 1 for prompt)Length of messages array
Input tokensprompt_eval_countusage.prompt_tokens
Output tokenseval_countusage.completion_tokens
Finish reasondone_reasonchoices[0].finish_reason

Streaming

Streaming is fully supported. The proxy detects streaming responses by the text/event-stream or application/x-ndjson content type and passes chunks through without buffering. An audit event is published at the start of the stream with the upstream status code.

Ollama's native API streams newline-delimited JSON (application/x-ndjson) by default. Set "stream": false in the request body to receive a single buffered response with full token counts.


Error Handling

502 -- "Cannot reach Ollama ... Is it running?"

The proxy returns a specific error when it cannot connect to the Ollama instance (httpx.ConnectError). This typically means:

  • Ollama is not running. Start it with ollama serve.
  • The BEHAVRY_OLLAMA_URL points to an incorrect host or port.
  • A firewall is blocking the connection.

504 -- Gateway Timeout

The upstream request exceeded the 300-second timeout. Consider using a smaller model or adding more compute resources.


Policy Control

Example OPA policy to restrict which local models can be used:

package behavry.authz

# Only allow approved local models
deny if {
input.mcp_server == "ollama-proxy"
approved_models := {"llama3.2", "codellama", "mistral"}
not approved_models[input.model]
}

# Block local model usage for external-facing agents
deny if {
input.mcp_server == "ollama-proxy"
input.agent_role == "external"
}

Troubleshooting

401 from Behavry

JWT expired or missing. Re-fetch using Step 1.

502 "Cannot reach Ollama"

Ollama is not running or is on a different host/port. Verify with:

curl http://localhost:11434/api/tags

If that works but the proxy still fails, check BEHAVRY_OLLAMA_URL.

Model not found

Ollama returns 404 if the requested model is not pulled locally. Pull it first:

ollama pull llama3.2

Slow responses

Local inference speed depends on available hardware. The proxy allows up to 300 seconds per request. If responses are consistently slow, consider quantized models (e.g., llama3.2:q4_0) or offloading to a GPU-equipped machine and setting BEHAVRY_OLLAMA_URL accordingly.