Token-Based Pricing: How It Works + Implementation Guide

A complete guide to token-based pricing, including architecture, entitlements, credits, and real-time enforcement in the request path.

Sara Nelissen
  |  
Jun 11, 2026
  |  
min read
Token-Based Pricing: How It Works + Implementation Guide

Token-based pricing starts simply enough with a credits column, a decrement on each request, and a check in middleware. It holds until an AI agent runs autonomously on an enterprise account, fires thousands of requests against the same credit pool, and two sessions pass the same balance check before either debit settles. 

The account goes negative, support gets the call, and engineering gets the ticket.

The problem is that enforcement ran after the fact instead of before execution. Here is how token-based pricing actually works and how to implement it without building toward that failure.

How token-based pricing works in real-time systems

Token-based pricing works by metering usage events, converting that usage into credit or cost units, enforcing limits before execution, and recording usage in a ledger for reconciliation.

The distinction that matters for implementation:

  • Metering tells you what happened
  • Enforcement decides what is allowed, in real time

Most systems get metering right and skip enforcement. Usage gets recorded, but nothing stops a request from proceeding when a balance is exhausted. This leads to overages, credit drift, and billing disputes.

Core components of a token pricing system

Component What it does Failure mode if missing
Metering layer Tracks token usage events Usage mismatch, billing drift
Credit system Stores balances and burn logic Incorrect depletion order
Entitlements layer Defines limits and access rules Inconsistent feature access
Enforcement layer Stops or allows requests in real time Over-consumption under load
Ledger Records all debits and credits No auditability

This stack runs directly in the request path rather than after billing. When any component sits outside that path, enforcement can break under load.

Each component also needs to fail safely. If the entitlement service degrades, enforcement can't silently pass requests through, which is how accounts go negative at scale.

Why token-based pricing becomes an infrastructure problem

Token-based pricing becomes an infrastructure problem as soon as usage is checked and enforced in real time across systems, where consistency, concurrency, and state accuracy directly affect access.

Most teams start with a credits table, a decrement function, and a middleware check. That works at low scale, but it begins to strain when requests overlap, plans change mid-cycle, or multiple services rely on the same entitlement state.

Where systems break first

Look out for these common breaking points:

  • Concurrency: Two requests pass the same balance check before either debit completes. Both proceed, and the account drops below zero.
  • Cache drift: A plan change updates entitlements, but cached data continues to serve the previous state during the TTL. The customer upgrades, yet still sees old limits.
  • Burn order bugs: Promotional credits remain unused while paid credits are consumed first, which leads to incorrect depletion of balances.
  • Partial failures: The debit succeeds, but the request fails. The customer is charged without receiving the service, and the ledger lacks a clear trace of the event.

These issues point to a reliability problem at the infrastructure level. The credit layer needs to behave with the same correctness guarantees as any other stateful system in the stack: atomic writes, consistent reads, and defined behavior under failure.

Token-based pricing architecture: What you actually need

Token-based pricing architecture requires real-time metering, entitlement resolution, credit tracking, and enforcement that runs in the request path.

Each component must work together synchronously to ensure usage is measured, validated, and controlled before a request is allowed to execute. 

Here are the different components and what they do:

Component How it operates at scale Why it matters in production
Metering layer Ingests events at high throughput, idempotent so retries don't double-count Makes sure usage is measured accurately and tied to the right entity
Entitlements layer Resolves plan, add-ons, trials, and grants into one effective state; the most generous value wins Keeps access consistent and prevents conflicting rules across services
Credit system Tracks balances as blocks with their own expiry, cost basis, and burn priority Prevents incorrect credit usage and supports expiry, promos, and auditability
Enforcement layer Decides access locally against a cached entitlement state, with API fallback on cache miss Stops over-consumption and keeps usage within defined limits in real time
Ledger Appends every grant, debit, and expiration as immutable events, reconcilable back to source Provides a reliable audit trail for finance and debugging
Request path execution Resolves entitlement, credit, and enforcement in one round trip before compute is consumed Makes sure decisions are made on current state instead of delayed or stale data

Request flow for token enforcement

  1. Request arrives (API call, agent execution, feature access)
  2. Entitlement check runs, evaluating plan, add-ons, credits, and active overrides
  3. The application provides the requested token amount to the entitlement check
  4. The credit system verifies the available balance against the requirement
  5. Enforcement decision is returned: allow, block, or allow with overage
  6. Debit happens atomically at the storage level
  7. The event is written to the ledger

Enforcement resolves synchronously before execution. The debit and ledger write follow atomically once the decision is made. 

Entitlements

Entitlements are the rules that determine access and usage limits for a given plan. They reflect what a customer paid for, unlike RBAC, which focuses on roles and permissions within a team.

They resolve across multiple sources simultaneously:

  • Base plan
  • Parent plan
  • Add-ons
  • Active trials
  • Promotional grants

Resolution rule: The most permissive valid value wins across those sources.

The output of an entitlement check must include:

  • Access status
  • Usage limit
  • Current usage against that limit
  • Whether the unlimited flag is set

That is what your system needs to return synchronously before each request proceeds. Returning only a boolean loses the context that the application needs to decide what to show the user next.

Credit system requirements

Token pricing depends on a real credit system, not a counter. A counter handles simple depletion and nothing else. It breaks under concurrent load, provides no audit trail, and cannot enforce burn order or block-level expiry.

A production credit system requires:

  • Block-level credit issuance
  • Expiry dates per block
  • Cost basis tracking (paid versus promotional)
  • Burn order: promotional before paid, expiring before non-expiring
  • Configurable hard versus soft depletion behavior
  • Append-only ledger for auditability and revenue recognition

Without burn order logic, customers drain paid credits while promotional balances accumulate.

Without block-level expiry, you have no way to handle promotional credits that should lapse at the end of a campaign.

Without an append-only ledger, finance has no reliable record to reconcile against for ASC 606 or IFRS.

Prepaid credit balances are deferred revenue until consumed. Each debit event is the moment of recognition. If the ledger cannot map debits to specific credit blocks, revenue recognition becomes a manual reconciliation every close cycle.

Enforcement in the request path

Enforcement needs to happen before execution, synchronously, against the current state of usage and entitlements. If limits are evaluated after usage is recorded or during a billing cycle, the system can observe overages but cannot prevent them.

This is what separates usage tracking from usage control. Control determines whether a request should proceed in the first place and stops issues before they scale.

In practice, the challenge is keeping enforcement accurate once requests overlap under load. Credit state and entitlements have to resolve inside the execution flow, fast enough to stay invisible to the user experience.

At enterprise scale, where you have thousands of concurrent sessions, multiple billing providers, and complex plan hierarchies, enforcement decisions need to resolve in single-digit milliseconds. That's the case for keeping the entitlement check local rather than making a round trip on every request.

Architectures built for this keep enforcement local to the application, with cache-hit checks resolving at P95 under 100ms. Stigg is one example, using a BYOC Sidecar to keep decisions close to the customer's infrastructure.

How to implement token-based pricing

Implementing token-based pricing requires defining how usage is measured, separating pricing logic from application code, and enforcing limits in real time under production conditions.

Each step ensures that usage is tracked accurately and controlled consistently before it turns into a billing problem.

1. Define your usage model first

Start by defining what actually counts as usage. That includes how tokens are measured, how retries behave, and how streaming responses are accounted for. These decisions form your metering contract, and changing them later usually means a migration across your data and billing layers.

Retries are where this tends to break down. If a request fails after a debit, you need a clear policy:

  • If the debit is not reversed, users are charged for failed operations
  • If the debit is reversed, the reversal must be idempotent under retries
  • If neither is handled cleanly, balances drift over time

This goes beyond edge-case handling and directly shapes how reliable and trustworthy your pricing system remains under failure conditions.

2. Separate pricing from application logic

Pricing logic should sit outside the application. Plans, limits, and credit rules belong in a product catalog that the application reads at runtime.

When pricing lives in code, every change becomes a deploy. When it lives in configuration, changes can be tested and shipped without touching the application layer.

This separation makes simulation viable by letting you run a new pricing model against real usage in staging and validate behavior before release, so issues surface early instead of after usage has already been charged.

3. Implement real-time entitlement checks

Access decisions should come from a dedicated runtime rather than logic scattered across services. Each request calls into that system and receives a structured response with access status, usage limits, and current consumption.

When entitlement logic is distributed, it diverges over time. One service enforces limits strictly, another allows overage, and a third may still reflect an outdated plan. A centralized runtime removes that drift and makes sure every request is evaluated against the same state.

4. Build atomic credit enforcement

Credit enforcement has to hold under concurrency as well as normal load, which means debits need to be atomic at the storage layer, retries need to be idempotent, and balance state needs to stay consistent across services.

A simple way to validate this is to simulate parallel requests against the same account:

  • Introduce slight delays between read and write operations
  • Run concurrent debits against the same balance
  • Check whether any sequence results in a negative balance

If it does, the system is not enforcing limits correctly under load.

5. Replay real usage before launch

Synthetic data rarely captures how users actually behave. Real usage includes burst traffic, retries, long-running sessions, and uneven consumption patterns that expose weaknesses in pricing logic.

A more reliable approach is to replay production traffic in staging and observe how the system behaves over time. Focus on:

  • Whether credits deplete in the intended order
  • Whether limits trigger at the right thresholds
  • Whether enforcement stays consistent across sessions

This is where issues show up that are otherwise invisible in controlled tests.

6. Test under concurrency

Systems that pass tests under average load can still fail at peak conditions if testing doesn't reflect production traffic. Concurrency introduces race conditions, delayed writes, and cache inconsistencies that only appear when multiple requests interact with shared state.

Testing should reflect how the system is actually used:

  • Simulate P90 or P95 concurrent sessions, not averages
  • Run overlapping requests against the same accounts or credit pools
  • Introduce variability in request timing to surface race conditions

This is often the difference between a model that looks correct in staging and one that holds up in production.

Build vs. buy: When token-based pricing systems break

Most teams build their own token or credit system because the initial version looks simple and ships quickly. A credits table and a decrement function are enough at low scale, especially when usage is predictable and concurrency is limited.

The challenges show up as soon as token-based pricing is exposed to real usage patterns. Systems start to fail when they need to support:

  • Multiple pricing models running at once, such as tiers combined with credits or usage caps
  • Team-level budgets and org-level balances across shared usage
  • Real-time enforcement across services with consistent state
  • Ledger accuracy for auditability and revenue recognition
  • Mid-cycle plan changes with overlapping entitlements

At that point, what started as a simple counter turns into a distributed state problem with strict consistency requirements. Maintaining atomicity, burn order, and entitlement resolution under concurrency becomes an ongoing engineering effort rather than a one-time build.

One of our customers, Miro, reported saving 5,000 engineering hours after adopting Stigg, which included moving AI credit handling into a dedicated system. The main impact was less maintenance overhead as pricing logic became easier to change without touching application code.

What changes when you add a control layer

Adding a control layer moves access decisions out of the application and into a runtime that evaluates usage and enforces limits on every request.

Instead of each service deciding independently, this layer sits between the product and billing, applies entitlement and credit logic in real time, and returns a consistent decision before execution.

Billing still records what happened, while the control layer determines what is allowed based on current state.

Common mistakes when implementing token-based pricing

Token-based pricing tends to break in production for predictable reasons. Most of them come from systems that worked fine early on, then started to drift once usage became concurrent, uneven, and harder to control.

Treating credits as counters instead of ledgers

This usually starts with a simple integer column that gets decremented on each request. It works until you introduce promotional credits, expirations, or multiple balances, and suddenly, you cannot explain why a user still has credits left but cannot use them, or why paid credits were consumed before free ones.

How to fix it

Move to an append-only ledger model. Track credits at the block level with:

  • Expiry per block
  • Cost basis (paid vs promo)
  • Explicit burn order

This gives you predictable behavior at runtime, and something finance can actually reconcile later.

Enforcing limits after execution

A common pattern is logging usage first and enforcing limits later, often during billing or aggregation. This shows up when a customer blows past their limit, and support has to explain why they were allowed to keep going.

How to fix it

Move enforcement into the request path. Every request should:

  • Check entitlements
  • Verify available balance
  • Return a decision before execution

If the check happens after the request runs, the system is observing usage rather than controlling it.

Overlooking concurrency behavior

Everything looks correct in testing until two requests hit the same balance at the same time. Both pass the check, both proceed, and now the account is overdrawn even though each request looked valid on its own.

How to fix it

Enforce atomicity at the storage layer and test it explicitly:

  • Run parallel requests against the same account
  • Introduce delays between read and write
  • Verify that no sequence produces a negative balance

Also, make retries idempotent so failures do not result in duplicate charges.

Embedding pricing logic in application code

Pricing logic often starts in one service, then gets copied into others. A few months later, different parts of the product enforce different rules, and fixing it means coordinating multiple deploys.

How to fix it

Move pricing logic into a centralized runtime or catalog:

  • Application calls a single system for entitlement checks
  • Rules are defined once and applied everywhere
  • Pricing changes happen without touching application code

This keeps enforcement consistent and reduces the surface area for bugs.

Relying only on synthetic test data

Synthetic tests tend to be clean and predictable, but real usage is often very different. You start seeing burst traffic, retries, long sessions, and patterns that were never modeled in staging.

How to fix it

Replay real usage before launch:

  • Take a slice of production logs
  • Run it against the new pricing model in staging
  • Inspect burn order, limit triggers, and concurrency behavior

This is where most issues show up, especially ones that only appear over time or across sessions.

Token-based pricing infrastructure: Where the control layer lives

The control layer in token-based pricing lives inside the runtime enforcement system that evaluates credits, entitlements, and access decisions while requests are still executing.

That layer centralizes:

  • Low-latency entitlement checks during execution
  • Append-only credit state with consistent balances under concurrent load
  • Request-level enforcement before compute is consumed
  • Shared metering, entitlement resolution, and credit logic across services

Without this layer, every service ends up maintaining its own version of pricing and access logic, which becomes difficult to keep consistent at scale.

Stigg is the usage runtime for AI products. Entitlements, credits, usage limits, and spend governance are enforced synchronously in the request path instead of after the fact in billing. 

This is how it works:

  • Runs through a BYOC Sidecar deployed inside your own cloud
  • Resolves most entitlement checks from local cache at P95 under 100ms
  • Falls back to API retrieval when cache state is unavailable
  • Works alongside existing billing infrastructure rather than replacing it

The enforcement problem in your infrastructure

Token-based pricing breaks when enforcement doesn’t hold under real usage, especially once concurrency, shared state, and multiple pricing rules interact.

You start to see it in a few consistent ways:

  • Negative balances when concurrent requests hit the same credits
  • Credits depleting in the wrong order across services
  • Limits are enforced differently depending on where the request is handled
  • Ledger data that does not match actual usage

These issues tend to show up after launch, when usage patterns become uneven and harder to control. They usually point to the same underlying problem. Enforcement is happening too late or too inconsistently to keep state aligned.

If your pricing changes still require deploys or enforcement only happens after usage is recorded, see how Stigg structures that control layer in the request path for AI products that need runtime enforcement at scale.

FAQs

1. What is token-based pricing?

Token-based pricing is a usage model where customers pay based on tokens consumed per request. Tokens represent the work performed, while entitlements and credit systems enforce limits in real time.

2. How does token-based pricing enforce token limits in real time?

Token-based pricing enforces limits by checking entitlements and available credits before execution. The system returns an access decision, then debits usage atomically in the request path.

3. Can billing systems handle token-based pricing alone?

No, billing systems cannot handle token-based pricing alone because they record usage after it happens. Token-based pricing requires enforcement before execution, which billing systems are not designed to do.

4. What breaks first in token-based pricing systems?

Token-based pricing systems break under concurrency first, where multiple requests access the same balance at the same time. Cache drift during plan changes is another common issue that causes inconsistent enforcement.

5. Why is token-based pricing hard to implement?

Token-based pricing is hard to implement because it requires real-time enforcement, atomic credit handling, and consistent state across services. Most systems break when usage becomes concurrent and unpredictable.

Live Event
Jonah Cohen

Provably Correct, Impossibly Fast

Inside OpenAI's Real-Time Access Engine

Jonah Cohen OpenAI
Tech Lead, Financial Engineering
June 25th at 10:00 AM PT
Register