Does an LLM proxy slow down inference?

Not meaningfully. The dominant latency in any inference call is the model's generation time, which is unchanged. A local proxy adds a localhost-to-localhost hop and the time to write a record. A cloud proxy adds a network hop to the gateway, but the gateway forwards to the provider from a well-connected data center — so the total round trip is often comparable. Streaming responses pass through chunk-for-chunk in both models, so user-visible behavior is indistinguishable from a direct connection.

Can an LLM proxy work with self-hosted models?

Yes. The proxy is wire-format-aware, not provider-aware. A self-hosted Ollama or vLLM server speaking an OpenAI-compatible endpoint is captured the same way as the OpenAI cloud endpoint. The destination URL is configurable; the capture logic is unchanged.

How does an LLM proxy handle TLS?

How the proxy handles TLS depends on deployment. In a local setup, the tool may talk to a localhost endpoint and the proxy opens the TLS connection upstream. In managed or network-level deployments, the proxy may terminate TLS and establish a separate TLS session to the provider. The important point is that full request-body capture requires the proxy to see plaintext at some point in the request path; a pure pass-through TCP proxy can only log metadata.

Is a proxy required for every machine?

Not necessarily. With a cloud proxy, a single managed gateway can serve the whole team — each developer points their tools at the gateway URL and the proxy captures centrally. With a local proxy, each machine runs its own instance. Many teams combine both: a local daemon authenticates to a cloud gateway, giving the developer a localhost endpoint while the org gets a central archive. The right model depends on the coverage and policy requirements.

How does a proxy differ from a gateway?

In the platform-engineering vocabulary, an LLM proxy is the primitive and an enterprise inference gateway is the deployment — the same proxy primitive plus policy, multi-tenancy, central archive, and the operational shape required to run it for an org. A proxy at the laptop layer and the same code clustered with central capture at platform scale are two points on the same spectrum.

LLM Proxy: Network-Level AI Request Capture and Policy Enforcement

An LLM proxy is a network process that sits between an AI tool and a model provider. Instead of sending requests directly to Anthropic, OpenAI, Google, or a self-hosted endpoint, the tool routes those requests through the proxy first. A basic proxy forwards traffic. A capture-oriented LLM proxy also records requests and responses, preserves streaming behavior, and can optionally apply policy before forwarding the call to the destination.

What an LLM proxy does and why it matters

What an LLM proxy does on each request
Intercept	Receive the request the tool sent, in the provider wire format.
Record	Write prompt metadata, model, timestamp, and configured request fields to a durable archive.
Policy	Apply allowed-models, redaction, rate-limit, routing, or cache-safety rules.
Forward	Send the request to the real provider — preserving streaming.
Record response	Capture configured response fields, reassemble streams when needed, and return output to the tool.

Why an LLM proxy works better than per-tool SDK integration

The first question every team asks: why a proxy, when an SDK could record the same data? Two reasons.

The first is coverage. SDK integration is per-tool. A proxy integration is per-network. Five AI tools in the org means five SDK integrations or one proxy. The next tool a team adopts means another integration or zero work.

The second is independence. SDKs ride along with the tool. When the tool updates, the SDK can break. When the tool’s vendor changes the wire format, the SDK lags. A proxy speaks the provider’s wire format directly, so it depends on the same network surface the tool already uses. That surface is usually more stable than a per-tool SDK integration, though provider API changes can still require proxy updates.

A single captured proxy session can contain hundreds of messages across multiple model tiers. Capturing model switches and fallback events requires no SDK; the proxy sees the model field on every routed request.

How an LLM proxy intercepts and records inference traffic

The mechanics depend on the tool, but the patterns are small in number.

Three ways a tool ends up talking to the proxy
Mechanism	How it works	Where it's used
Environment variable	Tool reads ANTHROPIC_API_URL or equivalent; proxy sets it to localhost	Any tool that reads a base URL env var (e.g., Claude Code, Cursor, custom agents)
Config file	Tool reads a config that points at the proxy	IDE extensions, internal services
HTTP CONNECT	OS-level proxy setting that the HTTP client honors	Browsers, generic clients

When the proxy preserves streaming, error behavior, and provider-compatible request formats, the user-visible behavior is unchanged. The tool sends a request, gets a response, and streams output if it’s a streaming endpoint. The only difference is that the proxy was in the middle.

LLM proxy architecture diagram showing AI tools sending traffic through a proxy that captures requests, applies policy, and forwards to model providers (Anthropic, OpenAI, Bedrock, Vertex AI, self-hosted models)

What data an LLM proxy must preserve to stay invisible

A correctly written LLM proxy preserves three properties that make it invisible to the user:

Streaming. Forward chunks as they arrive. User-visible latency should remain dominated by provider latency, with only a small amount of proxy overhead when the proxy is local and writing efficiently.
Wire fidelity. By default, the provider receives the same request the tool sent. When policy transformations are configured — redaction, routing, header changes — those changes should be explicit and traceable.
Failure passthrough. By default, provider errors should pass through in the shape the tool expects. If the proxy introduces its own errors — policy blocks, rate limits, upstream failures — those should be explicit and easy to distinguish from provider failures.

Lose any of those and the proxy starts breaking the tools that depend on it. The discipline of a proxy is “be neutral by default, opinionated only when configured to be.”

What an LLM proxy can and cannot see

An LLM proxy only captures traffic routed through it. It can see provider API calls from configured tools, local agents, services, or build jobs. It cannot see unmanaged browser sessions, personal accounts, tools that ignore proxy settings, or traffic sent directly to a provider endpoint outside the configured path. This is why enterprise deployments usually pair the proxy with policy, device management, egress controls, and approved-tool rollout.

What capabilities an LLM proxy enables for platform teams

The proxy is rarely interesting on its own. Its value is the substrate it produces.

Specific things you can build on a proxy archive:

Session capture — each routed call as a queryable record.
Session replay — step through a past session in the original sequence.
Tool governance — enforce allowed models, redactions, egress rules.
Telemetry — cross-tool metrics from one source.
Skills — extracted from recurring patterns in the archive.
Cost attribution by team, by tool, by model.
Cache analysis — identify repeated prompt shapes and decide where caching is safe.

Each of those is a feature; the proxy is the primitive.

What an LLM proxy is not

A few common confusions worth resolving.

Not the same as a traditional API gateway. The shape is similar, but the payload and purpose are different: an LLM proxy is specialized for AI inference traffic, prompt/response capture, streaming, model metadata, and provider-compatible request formats.
Not a model router by default. A proxy can route, but the simplest configuration forwards to one provider. Routing is a feature on top.
Not a model wrapper. The proxy doesn’t replace the model or modify the response content (unless explicitly configured to redact). It records and forwards.
Not a security boundary. The proxy is on the egress path; it can enforce policy but it doesn’t isolate tools from each other. Sandboxing is a separate problem (see stereOS for the sandboxing primitive).

How a capture-oriented LLM proxy works in practice

There are two deployment models for a capture-oriented proxy, and most teams end up using both.

Local proxy

A local proxy runs as a background process on the developer’s machine, listening on a localhost port. On install, it sets the relevant environment variables so the tool’s requests go to localhost first. The proxy writes each request and response to a structured archive and forwards the call to the configured upstream provider.

Request lifecycle through a local capture proxy

tool                      proxy                        provider
|                            |                             |
| POST /v1/messages          |                             |
|--------------------------->|                             |
|                            | record request              |
|                            | apply policy (if any)       |
|                            | forward request             |
|                            |---------------------------->|
|                            |                             |
|                            |    streaming response       |
|                            |<----------------------------|
|                            | record response             |
|    streamed back            |                             |
|<---------------------------|                             |
|                            |                             |

The whole loop should add only a small amount of overhead. The session archive grows by one record per request.

Cloud proxy

A cloud proxy is a managed gateway that multiple developers or services share. Instead of each machine running its own proxy and storing records locally, the team provisions a centralized gateway — typically backed by a proxy like Envoy and a durable store like Postgres — and points their tools at it.

The request lifecycle is the same: intercept, record, apply policy, forward. The difference is where the proxy runs and where the records live. A cloud proxy gives the platform team a single capture archive for the whole org, model allowlisting at the edge, and centralized auth — without requiring anything on each developer’s machine beyond a pointer to the gateway URL.

Most teams start with one or the other and add the second when the need arrives. A local proxy gives individual developers capture and replay immediately. A cloud proxy gives the platform team visibility, policy, and cost attribution across teams. The two compose: a local daemon can authenticate to a cloud gateway, so the developer gets a localhost endpoint while the org gets a central archive.

Where the LLM proxy fits in the Paper Compute stack

Paper Cloud provisions and manages tapes AI gateways — the cloud proxy model described above. Each gateway runs an Envoy AI Gateway instance backed by a tapes store, so every prompt, response, and tool call is captured automatically. Backends scope which models are reachable and which provider credentials to use, giving platform teams model allowlisting and flexible auth at the edge. A local daemon (paperd) authenticates to the cloud gateway and exposes a localhost proxy, so developers get a local endpoint while the org gets a central archive.

At platform scale, the same proxy-and-capture pattern becomes the foundation for an enterprise inference gateway. stereOS addresses the separate runtime-isolation problem for high-risk agents.

paper is currently in development. Sign up for the waitlist to get early access.

LLM Proxy: Network-Level AI Request Capture and Policy Enforcement

What an LLM proxy does and why it matters

Why an LLM proxy works better than per-tool SDK integration

How an LLM proxy intercepts and records inference traffic

What data an LLM proxy must preserve to stay invisible

What an LLM proxy can and cannot see

What capabilities an LLM proxy enables for platform teams

What an LLM proxy is not

How a capture-oriented LLM proxy works in practice

Local proxy

Cloud proxy

Where the LLM proxy fits in the Paper Compute stack

Frequently asked questions

Where to go next

stereOS

LLM Proxy: Network-Level AI Request Capture and Policy Enforcement

What an LLM proxy does and why it matters

Why an LLM proxy works better than per-tool SDK integration

How an LLM proxy intercepts and records inference traffic

What data an LLM proxy must preserve to stay invisible

What an LLM proxy can and cannot see

What capabilities an LLM proxy enables for platform teams

What an LLM proxy is not

How a capture-oriented LLM proxy works in practice

Local proxy

Cloud proxy

Where the LLM proxy fits in the Paper Compute stack

Frequently asked questions

Where to go next

Related resources

stereOS