Is an inference gateway the same as an LLM proxy?

An LLM proxy is the smallest unit of an inference gateway — it intercepts requests and forwards them. An enterprise inference gateway is the full system around the proxy: capture archive, policy, egress, telemetry, replay, and audit. Every gateway is built on a proxy; not every proxy is a gateway.

How is an inference gateway different from API gateways like Kong or Apigee?

The shape is similar — a single network chokepoint where governance and telemetry happen — but the payload is different. Traditional API gateways operate on REST or GraphQL traffic where the payload is structured. An inference gateway operates on prompt/response traffic where the payload is unstructured natural language and the value is in capturing the full context. Some teams build inference gateways on top of Envoy or Kong; the API gateway is the routing primitive, and the AI-specific capture and policy logic sits on top of it.

Do we need an inference gateway if we only use one model provider?

Yes, and the reasons are independent of provider count. The gateway exists to give the platform team visibility, policy, and audit over AI traffic — not to abstract over multiple providers. Even with a single provider, the gateway answers 'which team spent what,' 'did this prompt include data we don't allow upstream,' and 'what did the agent do during that incident.' Multi-provider routing is a feature gateways can add; it is not the reason they exist.

Can we just deploy this in shadow mode first?

That's the recommended phase 1 pattern. Shadow mode means the gateway captures and observes but doesn't enforce policy or block traffic. The shadow-to-enforced rollout is a much easier conversation than a flag-day cutover.

What does retention policy look like for a gateway archive?

Retention is the one part of the gateway that is genuinely organization-specific. The primitive can capture full request/response records; retention, redaction, and purge behavior should be configured according to the organization's legal and security requirements. Common patterns: full retention for 90 days, redaction of payload content past 90 days, retention of metadata (timestamps, models, token counts) for two to seven years for cost and audit purposes. The gateway needs to support all three modes; the policy itself is yours.

Enterprise Inference Gateway: What It Is, Why You Need One, How to Build (or Buy) One

An enterprise inference gateway is a centralized control point for AI inference traffic. In its strongest form, AI tools, agents, IDE extensions, internal apps, and model SDKs route requests through the gateway, where traffic can be logged, governed, routed, redacted, measured, replayed, and audited.

It plays a role similar to an API gateway, an egress proxy, or an SSO layer: not because AI traffic is the same as traditional API traffic, but because enterprises eventually need one place where visibility, policy, cost accounting, and incident response can happen. Without a gateway or equivalent control plane, AI usage often fragments into personal API keys, vendor dashboards, unmanaged browser sessions, and team-specific scripts. With one, platform teams get a durable record of what was sent, where it went, who sent it, what it cost, and what happened next.

This page is the reference: what an inference gateway is, why larger and AI-heavy enterprises end up building one, what such a system has to do, and the honest version of build-vs-buy for 2026.

What an enterprise inference gateway owns

What an enterprise inference gateway owns
Capture	Each request, response, and tool call from gateway-connected tools — written to a durable archive before forwarding.
Policy	Per-team / per-model / per-data-class rules enforced before the prompt leaves the box.
Egress	Allow-listed destinations, logged egress, redaction of sensitive payloads.
Telemetry	Cost per team, per model, per project. Latency and token throughput in one place.
Replay & search	A queryable archive of every session so the platform team can answer "what did the agent do?"
Audit	Audit-ready records, with tamper-evidence, retention controls, and access logging added as part of the enterprise deployment.

Why enterprises build inference gateways

Most companies discover the need for an inference gateway the same way they discovered the need for SSO, an API gateway, and a service mesh: by the time anyone names the missing piece, three teams have already built something that almost works. The trigger is usually governance, not curiosity. Finance wants to know what the company spent on AI last quarter, broken down by team. InfoSec wants to know which prompts left the network and which tools were on the receiving end. Legal wants to know whether confidential customer data was sent outside the company’s approved trust boundary, under which vendor terms, retention settings, and data-processing controls. A single IDE plugin or vendor dashboard may answer part of the question, but it rarely gives platform teams a complete cross-tool, cross-provider, cross-team record.

A team of 50 engineers each running 5 hours of AI work a day could generate 250 sessions a day, in 5+ different tools, against 3+ models. Without a gateway, those sessions are dark.

The historical analogy is familiar. In the early 2010s, every web app handled authentication on its own until SSO made it cheaper to centralize than to keep building per-app login. Containers became orchestration platforms. Microservices became service meshes. Every stack repeats the same arc: a new primitive shows up; teams adopt it ad hoc; governance pressure forces consolidation; a platform layer is born.

AI tools are running that same arc on a faster clock. The difference is that the side door — an engineer pasting code into ChatGPT, an agent calling api.anthropic.com from a laptop — is open at every workstation, not at a server in a controlled environment. The data leaving the box is more sensitive than typical egress traffic, and the volume is high enough that no human can review it after the fact. The gateway is one of the few places a platform team can intervene consistently without sitting inside every engineer’s workflow.

The failure modes when you don’t have one are predictable. A senior engineer pastes a snippet from a private repo into a public model and the company learns about it from a security report two months later. Finance asks for AI spend by team and the only answer is the corporate card statement. An incident response team needs to know what a runaway agent did at 3 AM and finds a chat tab that has already been closed. Each of these is solvable with capture at the network layer; none of them are solvable from inside a single AI tool.

Six capabilities every inference gateway must have

“Inference gateway” is both an emerging product category and a set of responsibilities. The product names vary — AI gateway, LLM gateway, inference gateway, model gateway — but the responsibilities are converging. Different organizations realize different pieces first, but a complete gateway covers six functions: capture, policy, egress, telemetry, replay, and audit. The order matters. Visibility is the precondition for meaningful governance. Full payload capture enables replay and audit, while lighter metadata capture may be enough for routing, budgets, and some policy enforcement.

The six responsibilities of an enterprise inference gateway
Capability	What it answers
Capture	"What prompt was sent, by which tool, on whose behalf, with what context?"
Policy	"Is this team / model / data class allowed to make this request right now?"
Egress	"Which destinations did traffic actually go to, and what was redacted before it left?"
Telemetry	"What did this team / model / project cost this week, and how does that compare to last week?"
Replay & search	"What did the agent actually do during that session, and can we find others like it?"
Audit	"Can we prove these records have not been altered, and produce them in an investigation?"

Enterprise inference gateway architecture diagram showing AI tools (Claude Code, chat apps, internal bots, agent SDKs) routing through a central gateway for capture, policy, and routing to multiple LLM providers (Anthropic, OpenAI, Bedrock, Vertex AI, self-hosted models)

The interesting consequence of this list is that most teams underestimate it. The first version is almost always a one-line proxy that captures requests and forwards them. That handles capture in the simplest case. It does not handle policy (no per-team rules), egress (no destination allow-list), telemetry beyond a flat log, replay (no structured store), or audit (no tamper-evidence). Each of those becomes a project in its own right, which is why the second-system version of an enterprise inference gateway is so common.

A properly scoped gateway treats each function as a layer that composes. Capture is foundational. Policy and egress sit on top of capture as enforcement. Telemetry, replay, and audit are read-side consumers of the capture archive. If the layers stay clean, replacing any one of them — swapping a policy engine, changing the storage backend, adding a new egress destination — is a contained change.

For a deeper look at what the capture layer specifically records and the replay surface, see the companion concept pages on AI session capture and agent session replay (coming soon).

How to decide whether to build or buy an inference gateway

The build-vs-buy decision for an inference gateway is genuinely a real decision in 2026, not a foregone conclusion. The category is young enough that several plausible paths exist, and which one fits depends on the company’s existing platform investments, the size of the engineering org, and how much governance pressure has already arrived. The honest framing is: many companies will go through both phases — a v1 hack to prove the value, and a v2 dedicated gateway to handle the responsibilities the v1 cannot. Knowing that up front lets you choose how much of the second system to skip.

The v1 hack is well understood. A platform engineer sets up an HTTP proxy on a single laptop or a shared bastion, points one team’s tools at it via environment variables, and starts logging requests to a flat file or a single SQLite database. This works, and it works fast. Within a week, the team can answer questions like “how many requests went to which provider” and “which engineers used which models.” The v1 is enough to convince finance and security that the approach is real.

The v1 also collapses predictably. The flat log isn’t queryable past a certain volume. There’s no per-team policy. There’s no replay surface — only raw logs. There’s no tamper-evidence, which means audit doesn’t trust the records. There’s no UI for non-engineers, which means finance and security still depend on the platform team for every report. By the time the v1 has been running for three months and a second team wants to onboard, the v2 conversation starts.

In a typical proxy deployment, the ratio of prompt tokens to completion tokens can exceed 100:1. That cache shape — heavily skewed toward input — is invisible to any per-tool dashboard but immediately visible from the gateway archive.

The v2 conversation is about which pieces to build, which to buy, and which to glue together from open source. Three patterns are common in 2026:

Build the whole thing. Some teams have the headcount and the existing platform tooling to do it: a custom proxy on top of Envoy or Kong, a custom storage layer, a custom policy engine, a custom UI. This works. It is also a multi-quarter project, and it ages — every new model provider, every new framework, every new agent SDK is a maintenance task.
Buy a turnkey commercial gateway. A few vendors are marketing exactly this. The trade-off is the usual one: faster to deploy, harder to extend, and the records live on someone else’s infrastructure. For regulated industries this is often a non-starter on egress alone.
Adopt an open-source proxy primitive and build the platform layer on top. One increasingly common path is to adopt an open-source proxy primitive, then build the organization-specific platform layer on top. The basic proxy pattern is well understood. The hard part is making it reliable, useful, privacy-aware, and integrated enough for enterprise operations. The team’s engineering effort goes into the parts that are organization-specific: policy, cost allocation, the UI for finance and security, and the integrations with the rest of the platform stack.

The third path is what open-source proxy primitives are built for. The next section covers what that primitive needs to do — and where it stops.

What a proxy primitive does and does not cover

An open-source proxy-and-capture primitive is one foundation for an enterprise inference gateway. It is a single process that runs on a laptop or a server, intercepts AI-API calls between the tool and the provider, writes a structured row for each request, and forwards the request upstream. There is no SDK to adopt. The tool thinks it’s talking to the provider; it’s actually talking to a local listener that records the request first.

What the proxy primitive covers:

A network-layer proxy that captures the full prompt, response, model, tool calls, and token counts for each request that flows through it.
A structured archive — flat-file JSON exports or a local database. Both formats use the same schema, so the same queries run against either.
An open-source foundation. The capture layer is meant to be what a platform team builds on, not the whole platform.

What the proxy primitive does not cover:

A turnkey enterprise governance product. Per-team policy enforcement, cost-allocation reporting, and finance-facing UI are platform-team responsibilities, layered on top of the capture archive.
A full audit-and-compliance system on its own. The archive is structured for audit, but the workflow integrations (ticketing, retention policies, redaction rules specific to your data classes) are work the platform team owns.
A replacement for InfoSec, FinOps, or compliance teams. The gateway gives those teams the data they need to do their jobs; it does not do their jobs for them.

This scope is intentional. The gateway primitive is small enough to deploy on one laptop in an afternoon and load-bearing enough to support a company-wide rollout, but it stops at the boundary where every organization’s policy diverges.

For a concept-level definition of the primitive separately from this pillar, see LLM proxy.

How to roll out an enterprise inference gateway in three phases

The path from “no gateway” to “company-wide gateway” runs through three predictable phases. Each phase has a concrete deliverable, and each phase is short enough to ship inside a quarter. The phases compose: phase 2 is built on phase 1’s data, and phase 3 is built on phase 2’s tooling. Trying to skip phases is the most common cause of stalled gateway projects.

Three phases of an enterprise gateway rollout

Phase 1 — Single-laptop pilot
├── Install a capture proxy on one platform engineer's laptop
├── Point one AI tool at the proxy
├── Let it run for one week
└── Export the captured sessions to JSON or a local database

Phase 2 — Team rollout
├── Deploy the proxy on every laptop in one volunteer team
├── Centralize exports nightly to a shared store
├── Build the first dashboard (cost per team, per model)
└── Add the first policy (allowed-models list)

Phase 3 — Company-wide
├── Make the gateway the default network path for AI traffic
├── Add per-team policy, egress allow-list, redaction rules
├── Wire telemetry into the existing FinOps and observability stack
└── Establish retention and audit workflow with InfoSec / Legal

Phase 1 — Single-laptop pilot. One platform engineer installs a capture proxy on their own laptop and runs it for a week. The deliverable is a session export that demonstrates what the gateway captures: how many requests, against which models, with what cost shape. The point of phase 1 is to put real numbers in front of finance and security inside two weeks. One laptop, one week, one export file is enough to start the conversation.

Phase 2 — Team rollout. Once one engineer’s data is convincing, deploy the proxy on every laptop in one volunteer team. The deliverable is the first dashboard a non-engineer can read: cost per team, per model, per project. This is also the phase where the first policy gets written — usually an allowed-models list — and the first cost-allocation report goes to finance. The team is small enough to handle exceptions manually and large enough to surface real cross-tool patterns.

Phase 3 — Company-wide. The gateway becomes the default network path for AI traffic. Per-team policy is in place. Egress is allow-listed. Redaction rules cover the data classes legal cares about. Telemetry is wired into the existing FinOps and observability stack. Retention and audit workflows are agreed with InfoSec and legal. At this point, the platform team operates the gateway the way they operate any other shared platform — with a runbook, an on-call rotation, and a quarterly capacity review.

Cross-tool capture is the property that makes this rollout cheap. Any AI tool that can be configured to use the gateway’s base URL, proxy settings, or provider-compatible endpoint can be captured without a custom SDK integration.

The gateway also enables the downstream layers that everyone wants to build but can’t without capture in place: skills, runbooks, evals, shared knowledge. All three depend on a capture archive existing first. The gateway is the substrate; everything else is a downstream consumer.

For the team that runs the gateway day-to-day — the platform engineering function whose mandate this falls under — see the companion pillar on AI platform engineering. The two pillars are designed to be read together: this one is the artifact, that one is the team.

How Paper Compute approaches the inference gateway

paper console provisions and manages tapes AI gateways — shared inference endpoints with durable session capture built in. Each gateway runs an Envoy AI Gateway instance backed by a tapes store, so every prompt, response, and tool call is captured automatically. Backends scope which models are reachable and which provider credentials to use, giving teams model allowlisting and flexible auth at the edge. stereOS covers the narrow case where the gateway needs a hardened runtime for sensitive workloads.

In the context of this page, paper is one implementation path for the inference gateway primitive — specifically the capture, policy, and egress layers. The pillar this page describes is the category; Paper is one way to build it.

paper is currently in development. Sign up for the waitlist to get early access.

Enterprise Inference Gateway: What It Is, Why You Need One, How to Build (or Buy) One — 2026 Reference

What an enterprise inference gateway owns

Why enterprises build inference gateways

Six capabilities every inference gateway must have

How to decide whether to build or buy an inference gateway

What a proxy primitive does and does not cover

How to roll out an enterprise inference gateway in three phases

How Paper Compute approaches the inference gateway

Frequently asked questions

Where to go next

stereOS

Enterprise Inference Gateway: What It Is, Why You Need One, How to Build (or Buy) One — 2026 Reference

What an enterprise inference gateway owns

Why enterprises build inference gateways

Six capabilities every inference gateway must have

How to decide whether to build or buy an inference gateway

What a proxy primitive does and does not cover

How to roll out an enterprise inference gateway in three phases

How Paper Compute approaches the inference gateway

Frequently asked questions

Where to go next

Related resources

stereOS