Token-based pricing starts simply enough with a credits column, a decrement on each request, and a check in middleware. It holds until an AI agent runs autonomously on an enterprise account, fires thousands of requests against the same credit pool, and two sessions pass the same balance check before either debit settles.
The account goes negative, support gets the call, and engineering gets the ticket.
The problem is that enforcement ran after the fact instead of before execution. Here is how token-based pricing actually works and how to implement it without building toward that failure.
How token-based pricing works in real-time systems
Token-based pricing works by metering usage events, converting that usage into credit or cost units, enforcing limits before execution, and recording usage in a ledger for reconciliation.
The distinction that matters for implementation:
- Metering tells you what happened
- Enforcement decides what is allowed, in real time
Most systems get metering right and skip enforcement. Usage gets recorded, but nothing stops a request from proceeding when a balance is exhausted. This leads to overages, credit drift, and billing disputes.
Core components of a token pricing system
This stack runs directly in the request path rather than after billing. When any component sits outside that path, enforcement can break under load.
Each component also needs to fail safely. If the entitlement service degrades, enforcement can't silently pass requests through, which is how accounts go negative at scale.
Why token-based pricing becomes an infrastructure problem
Token-based pricing becomes an infrastructure problem as soon as usage is checked and enforced in real time across systems, where consistency, concurrency, and state accuracy directly affect access.
Most teams start with a credits table, a decrement function, and a middleware check. That works at low scale, but it begins to strain when requests overlap, plans change mid-cycle, or multiple services rely on the same entitlement state.
Where systems break first
Look out for these common breaking points:
- Concurrency: Two requests pass the same balance check before either debit completes. Both proceed, and the account drops below zero.
- Cache drift: A plan change updates entitlements, but cached data continues to serve the previous state during the TTL. The customer upgrades, yet still sees old limits.
- Burn order bugs: Promotional credits remain unused while paid credits are consumed first, which leads to incorrect depletion of balances.
- Partial failures: The debit succeeds, but the request fails. The customer is charged without receiving the service, and the ledger lacks a clear trace of the event.
These issues point to a reliability problem at the infrastructure level. The credit layer needs to behave with the same correctness guarantees as any other stateful system in the stack: atomic writes, consistent reads, and defined behavior under failure.
Token-based pricing architecture: What you actually need
Token-based pricing architecture requires real-time metering, entitlement resolution, credit tracking, and enforcement that runs in the request path.
Each component must work together synchronously to ensure usage is measured, validated, and controlled before a request is allowed to execute.
Here are the different components and what they do:
Request flow for token enforcement
- Request arrives (API call, agent execution, feature access)
- Entitlement check runs, evaluating plan, add-ons, credits, and active overrides
- The application provides the requested token amount to the entitlement check
- The credit system verifies the available balance against the requirement
- Enforcement decision is returned: allow, block, or allow with overage
- Debit happens atomically at the storage level
- The event is written to the ledger
Enforcement resolves synchronously before execution. The debit and ledger write follow atomically once the decision is made.
Entitlements
Entitlements are the rules that determine access and usage limits for a given plan. They reflect what a customer paid for, unlike RBAC, which focuses on roles and permissions within a team.
They resolve across multiple sources simultaneously:
- Base plan
- Parent plan
- Add-ons
- Active trials
- Promotional grants
Resolution rule: The most permissive valid value wins across those sources.
The output of an entitlement check must include:
- Access status
- Usage limit
- Current usage against that limit
- Whether the unlimited flag is set
That is what your system needs to return synchronously before each request proceeds. Returning only a boolean loses the context that the application needs to decide what to show the user next.
Credit system requirements
Token pricing depends on a real credit system, not a counter. A counter handles simple depletion and nothing else. It breaks under concurrent load, provides no audit trail, and cannot enforce burn order or block-level expiry.
A production credit system requires:
- Block-level credit issuance
- Expiry dates per block
- Cost basis tracking (paid versus promotional)
- Burn order: promotional before paid, expiring before non-expiring
- Configurable hard versus soft depletion behavior
- Append-only ledger for auditability and revenue recognition
Without burn order logic, customers drain paid credits while promotional balances accumulate.
Without block-level expiry, you have no way to handle promotional credits that should lapse at the end of a campaign.
Without an append-only ledger, finance has no reliable record to reconcile against for ASC 606 or IFRS.
Prepaid credit balances are deferred revenue until consumed. Each debit event is the moment of recognition. If the ledger cannot map debits to specific credit blocks, revenue recognition becomes a manual reconciliation every close cycle.
Enforcement in the request path
Enforcement needs to happen before execution, synchronously, against the current state of usage and entitlements. If limits are evaluated after usage is recorded or during a billing cycle, the system can observe overages but cannot prevent them.
This is what separates usage tracking from usage control. Control determines whether a request should proceed in the first place and stops issues before they scale.
In practice, the challenge is keeping enforcement accurate once requests overlap under load. Credit state and entitlements have to resolve inside the execution flow, fast enough to stay invisible to the user experience.
At enterprise scale, where you have thousands of concurrent sessions, multiple billing providers, and complex plan hierarchies, enforcement decisions need to resolve in single-digit milliseconds. That's the case for keeping the entitlement check local rather than making a round trip on every request.
Architectures built for this keep enforcement local to the application, with cache-hit checks resolving at P95 under 100ms. Stigg is one example, using a BYOC Sidecar to keep decisions close to the customer's infrastructure.
How to implement token-based pricing
Implementing token-based pricing requires defining how usage is measured, separating pricing logic from application code, and enforcing limits in real time under production conditions.
Each step ensures that usage is tracked accurately and controlled consistently before it turns into a billing problem.
1. Define your usage model first
Start by defining what actually counts as usage. That includes how tokens are measured, how retries behave, and how streaming responses are accounted for. These decisions form your metering contract, and changing them later usually means a migration across your data and billing layers.
Retries are where this tends to break down. If a request fails after a debit, you need a clear policy:
- If the debit is not reversed, users are charged for failed operations
- If the debit is reversed, the reversal must be idempotent under retries
- If neither is handled cleanly, balances drift over time
This goes beyond edge-case handling and directly shapes how reliable and trustworthy your pricing system remains under failure conditions.
2. Separate pricing from application logic
Pricing logic should sit outside the application. Plans, limits, and credit rules belong in a product catalog that the application reads at runtime.
When pricing lives in code, every change becomes a deploy. When it lives in configuration, changes can be tested and shipped without touching the application layer.
This separation makes simulation viable by letting you run a new pricing model against real usage in staging and validate behavior before release, so issues surface early instead of after usage has already been charged.
3. Implement real-time entitlement checks
Access decisions should come from a dedicated runtime rather than logic scattered across services. Each request calls into that system and receives a structured response with access status, usage limits, and current consumption.
When entitlement logic is distributed, it diverges over time. One service enforces limits strictly, another allows overage, and a third may still reflect an outdated plan. A centralized runtime removes that drift and makes sure every request is evaluated against the same state.
4. Build atomic credit enforcement
Credit enforcement has to hold under concurrency as well as normal load, which means debits need to be atomic at the storage layer, retries need to be idempotent, and balance state needs to stay consistent across services.
A simple way to validate this is to simulate parallel requests against the same account:
- Introduce slight delays between read and write operations
- Run concurrent debits against the same balance
- Check whether any sequence results in a negative balance
If it does, the system is not enforcing limits correctly under load.
5. Replay real usage before launch
Synthetic data rarely captures how users actually behave. Real usage includes burst traffic, retries, long-running sessions, and uneven consumption patterns that expose weaknesses in pricing logic.
A more reliable approach is to replay production traffic in staging and observe how the system behaves over time. Focus on:
- Whether credits deplete in the intended order
- Whether limits trigger at the right thresholds
- Whether enforcement stays consistent across sessions
This is where issues show up that are otherwise invisible in controlled tests.
6. Test under concurrency
Systems that pass tests under average load can still fail at peak conditions if testing doesn't reflect production traffic. Concurrency introduces race conditions, delayed writes, and cache inconsistencies that only appear when multiple requests interact with shared state.
Testing should reflect how the system is actually used:
- Simulate P90 or P95 concurrent sessions, not averages
- Run overlapping requests against the same accounts or credit pools
- Introduce variability in request timing to surface race conditions
This is often the difference between a model that looks correct in staging and one that holds up in production.
Build vs. buy: When token-based pricing systems break
Most teams build their own token or credit system because the initial version looks simple and ships quickly. A credits table and a decrement function are enough at low scale, especially when usage is predictable and concurrency is limited.
The challenges show up as soon as token-based pricing is exposed to real usage patterns. Systems start to fail when they need to support:
- Multiple pricing models running at once, such as tiers combined with credits or usage caps
- Team-level budgets and org-level balances across shared usage
- Real-time enforcement across services with consistent state
- Ledger accuracy for auditability and revenue recognition
- Mid-cycle plan changes with overlapping entitlements
At that point, what started as a simple counter turns into a distributed state problem with strict consistency requirements. Maintaining atomicity, burn order, and entitlement resolution under concurrency becomes an ongoing engineering effort rather than a one-time build.
One of our customers, Miro, reported saving 5,000 engineering hours after adopting Stigg, which included moving AI credit handling into a dedicated system. The main impact was less maintenance overhead as pricing logic became easier to change without touching application code.
What changes when you add a control layer
Adding a control layer moves access decisions out of the application and into a runtime that evaluates usage and enforces limits on every request.
Instead of each service deciding independently, this layer sits between the product and billing, applies entitlement and credit logic in real time, and returns a consistent decision before execution.
Billing still records what happened, while the control layer determines what is allowed based on current state.
Common mistakes when implementing token-based pricing
Token-based pricing tends to break in production for predictable reasons. Most of them come from systems that worked fine early on, then started to drift once usage became concurrent, uneven, and harder to control.
Treating credits as counters instead of ledgers
This usually starts with a simple integer column that gets decremented on each request. It works until you introduce promotional credits, expirations, or multiple balances, and suddenly, you cannot explain why a user still has credits left but cannot use them, or why paid credits were consumed before free ones.
How to fix it
Move to an append-only ledger model. Track credits at the block level with:
- Expiry per block
- Cost basis (paid vs promo)
- Explicit burn order
This gives you predictable behavior at runtime, and something finance can actually reconcile later.
Enforcing limits after execution
A common pattern is logging usage first and enforcing limits later, often during billing or aggregation. This shows up when a customer blows past their limit, and support has to explain why they were allowed to keep going.
How to fix it
Move enforcement into the request path. Every request should:
- Check entitlements
- Verify available balance
- Return a decision before execution
If the check happens after the request runs, the system is observing usage rather than controlling it.
Overlooking concurrency behavior
Everything looks correct in testing until two requests hit the same balance at the same time. Both pass the check, both proceed, and now the account is overdrawn even though each request looked valid on its own.
How to fix it
Enforce atomicity at the storage layer and test it explicitly:
- Run parallel requests against the same account
- Introduce delays between read and write
- Verify that no sequence produces a negative balance
Also, make retries idempotent so failures do not result in duplicate charges.
Embedding pricing logic in application code
Pricing logic often starts in one service, then gets copied into others. A few months later, different parts of the product enforce different rules, and fixing it means coordinating multiple deploys.
How to fix it
Move pricing logic into a centralized runtime or catalog:
- Application calls a single system for entitlement checks
- Rules are defined once and applied everywhere
- Pricing changes happen without touching application code
This keeps enforcement consistent and reduces the surface area for bugs.
Relying only on synthetic test data
Synthetic tests tend to be clean and predictable, but real usage is often very different. You start seeing burst traffic, retries, long sessions, and patterns that were never modeled in staging.
How to fix it
Replay real usage before launch:
- Take a slice of production logs
- Run it against the new pricing model in staging
- Inspect burn order, limit triggers, and concurrency behavior
This is where most issues show up, especially ones that only appear over time or across sessions.
Token-based pricing infrastructure: Where the control layer lives
The control layer in token-based pricing lives inside the runtime enforcement system that evaluates credits, entitlements, and access decisions while requests are still executing.
That layer centralizes:
- Low-latency entitlement checks during execution
- Append-only credit state with consistent balances under concurrent load
- Request-level enforcement before compute is consumed
- Shared metering, entitlement resolution, and credit logic across services
Without this layer, every service ends up maintaining its own version of pricing and access logic, which becomes difficult to keep consistent at scale.
Stigg is the usage runtime for AI products. Entitlements, credits, usage limits, and spend governance are enforced synchronously in the request path instead of after the fact in billing.
This is how it works:
- Runs through a BYOC Sidecar deployed inside your own cloud
- Resolves most entitlement checks from local cache at P95 under 100ms
- Falls back to API retrieval when cache state is unavailable
- Works alongside existing billing infrastructure rather than replacing it
The enforcement problem in your infrastructure
Token-based pricing breaks when enforcement doesn’t hold under real usage, especially once concurrency, shared state, and multiple pricing rules interact.
You start to see it in a few consistent ways:
- Negative balances when concurrent requests hit the same credits
- Credits depleting in the wrong order across services
- Limits are enforced differently depending on where the request is handled
- Ledger data that does not match actual usage
These issues tend to show up after launch, when usage patterns become uneven and harder to control. They usually point to the same underlying problem. Enforcement is happening too late or too inconsistently to keep state aligned.
If your pricing changes still require deploys or enforcement only happens after usage is recorded, see how Stigg structures that control layer in the request path for AI products that need runtime enforcement at scale.
FAQs
1. What is token-based pricing?
Token-based pricing is a usage model where customers pay based on tokens consumed per request. Tokens represent the work performed, while entitlements and credit systems enforce limits in real time.
2. How does token-based pricing enforce token limits in real time?
Token-based pricing enforces limits by checking entitlements and available credits before execution. The system returns an access decision, then debits usage atomically in the request path.
3. Can billing systems handle token-based pricing alone?
No, billing systems cannot handle token-based pricing alone because they record usage after it happens. Token-based pricing requires enforcement before execution, which billing systems are not designed to do.
4. What breaks first in token-based pricing systems?
Token-based pricing systems break under concurrency first, where multiple requests access the same balance at the same time. Cache drift during plan changes is another common issue that causes inconsistent enforcement.
5. Why is token-based pricing hard to implement?
Token-based pricing is hard to implement because it requires real-time enforcement, atomic credit handling, and consistent state across services. Most systems break when usage becomes concurrent and unpredictable.

%20(1).png)
%20(1).png)
%20(1).png)
%20(1).png)