What is a Token in AI? Pricing and Usage Explained

Understand what a token in AI is, and how token usage impacts pricing, credits, and enforcement in production systems.

Sara Nelissen
  |  
May 20, 2026
  |  
8
min read
What is a Token in AI? Pricing and Usage Explained

A token in AI is the unit that drives credits, limits, and usage tracking across every request. When pricing lives in your codebase, every AI request depends on token accounting.

If you started with a simple counter tied to usage, it will start failing once you deal with concurrent requests, shared team budgets, and mid-cycle plan changes.

This guide explains what a token in AI is, how it shapes pricing, and how to structure the system that enforces it in production.

What is a token in AI?

A token in AI is the smallest unit of data a language model processes. It is a fragment of text that a tokenizer converts into a numeric ID before the model runs any computation.

A token does not map cleanly to a word. Tokenization depends on how the model splits text into smaller units, which affects how usage is counted.

  • Short, common words often map to a single token
  • Longer or less common words split into multiple tokens
  • The word "encoding" can split into "encod" and "ing" because BPE tokenizers favor common subwords

This variation matters because token count is not predictable from text length alone. Your system needs to measure actual tokens, not assume based on words.

Tokens are the unit your system tracks for usage and pricing. Each request consumes different types of tokens that need to be counted correctly.

Token type Where it comes from What it affects
Input tokens User prompt Initial usage count
Output tokens Model response Response size and cost
Reasoning tokens Internal model processing Hidden usage that increases total cost

Reasoning tokens often remain invisible in the response but still increase total usage on complex requests.

If your system does not track these counts accurately, limits fail, quotas drift, and pricing becomes inconsistent.

How tokenization works

Tokenization works by converting raw input into a sequence of integer IDs that the model can process. The tokenizer maps text into these IDs based on a fixed vocabulary tied to the model.

Each model uses its own tokenizer. The same prompt can produce different token counts depending on the model you use.

This affects how your system behaves:

  • Usage tracking against a customer’s credit balance
  • Enforcement of per-feature limits across model versions
  • Entitlement updates when customers switch between model versions

Tokenization also applies to non-text inputs. Multimodal models convert images into patch tokens and audio into spectrogram or semantic tokens.

Each modality follows different counting rules. Your metering pipeline needs to handle these differences when a single request includes text, images, and tool outputs.

How tokens drive AI pricing models

Tokens drive AI pricing models by defining how usage gets measured, limited, and charged. Every pricing model maps back to how tokens are counted and enforced in your system.

Most pricing models fall into four patterns. Each one introduces different infrastructure requirements:

Model How it works Infrastructure requirement
1. Pay-per-token Charge per input and output token Real-time metering and enforcement
2. Prepaid credits Customer buys credits that tokens draw down Credit ledger, balance checks, depletion logic
3. Subscription with limits Fixed token cap per plan Entitlement enforcement and limit handling
4. Hybrid Subscription-based with token-based overages Combined enforcement and overage detection

The complexity shows up in enforcement. Your system needs to apply the right limits at the right time across users, teams, and features.

Most teams move toward a hybrid model after their first iteration. Miro launched a hybrid seat and credit-based model in under 6 weeks. Miro saved an estimated 5,000 engineering hours by not building and maintaining its in-house monetization infrastructure.

Token metering: What the infrastructure actually needs

Token metering infrastructure needs to capture, attribute, and enforce usage in real time. The system must track every token and apply limits before the request completes.

Metering for pricing requires exact counts. Small errors in measurement lead to incorrect limits, missed overages, and reconciliation issues.

A production-grade token metering layer needs to handle several things:

  • Event attribution: Each token must link to a customer, feature, session, and timestamp. Systems use JSON events emitted at inference time and routed through a metering pipeline.
  • Real-time enforcement: The check has to happen before the inference call. Delayed enforcement can document the overage, but it cannot prevent it.
  • Concurrent session handling: Shared credit pools break under concurrent load when two requests read the same balance before either updates it. Both pass, both consume credits, and the balance goes negative. The billing system records the overage accurately, but nothing prevents it. Atomic debit operations solve this. Simple counter decrements do not.
  • Cache behavior: Enforcement depends on low-latency access to entitlement state. On a cache hit, reads are immediate, and entitlement data resolves from local cache without an upstream call. On a cache miss, the Sidecar deployed in your own cloud (BYOC) fetches from Stigg's Edge API at around 100ms, with a configurable timeout to prevent upstream latency from affecting application performance. (See persistent caching for the full architecture.)

Errors in attribution, enforcement, or concurrency handling show up as inconsistent quotas, broken limits, and unpredictable customer experience under load.

What are entitlements in AI pricing?

Entitlements define what a customer can consume based on their plan. In AI products, that means how many tokens they can use, where limits apply, and what happens when those limits are hit.

RBAC answers who can access a feature, whereas entitlements answer how much they can use it.

Concept What it controls How it behaves
RBAC Access to features Allow or deny
Entitlements Usage and consumption limits Measured: boolean, usage-limited, or configured

A common pattern: teams choose a usage-based billing tool first, then discover that enforcement is a separate problem. The system can count usage for invoicing, but it cannot enforce limits inside the product. An entitlements layer handles that.

A token entitlement defines what your system evaluates at runtime, including access status, usage limits, current usage, unlimited flags, and whether enforcement stops usage or allows overage.

Every AI request hits this layer. Your system checks the entitlement and returns this data, and the application decides what to do next:

  • Show a paywall
  • Trigger an upgrade prompt
  • Allow the request with an overage warning

Starting with a counter and adding logic later breaks once edge cases accumulate.

Features should stay visible across plans so users can see what unlocks at higher tiers. Entitlements also need to resolve across multiple sources:

  • Active plan
  • Parent plan
  • Trials
  • Promotions

When more than one source applies, the system selects the most generous value. This logic gets complex fast and often breaks as plans and overrides increase.

How credits work in AI pricing infrastructure

Credits act as the layer that tracks and limits token usage in AI products. They represent a consumable balance that decreases as tokens are consumed or actions are executed.

A credit can map directly to tokens, such as 1 credit equals 1,000 tokens. It can also abstract usage, where 1 credit represents an action like a generated response or agent run.

The system behind credits is more than a balance field. It needs to handle how credits are issued, consumed, and tracked over time.

These rules define how token usage gets translated into enforceable limits:

  • Block-level expiry: Credits are issued in batches with different expiration dates
  • Cost basis and categories: Paid and promotional credits need to be tracked separately
  • Burn order: Configurable rules for which credits deplete first, typically promotional before paid, with earlier expiry before later
  • Depletion behavior: Hard limits stop usage, while soft limits allow usage and create overage
  • Append-only ledger: Every change is recorded as an immutable event for audit and reconciliation

The credit ledger also needs to support independent audit and reconciliation, with every debit and credit event traceable to a specific request, customer, and timestamp.

If the credit system is not structured correctly, token-based pricing breaks across plans and billing periods.

Why AI usage governance matters for token-based pricing

AI usage governance controls how token consumption is allocated, limited, and enforced across users and teams. Account-level token limits are not enough once usage spans multiple users, workflows, and departments.

At a small scale, a single balance works. At enterprise scale, it breaks. One workflow can consume a large share of tokens in a short time, leaving no control over how usage is distributed.

Enterprise systems need controls at a finer level:

  • Per-user and per-team token allocations from a shared balance
  • Limits on specific workflows or agent actions
  • Real-time alerts as usage approaches a threshold
  • Visibility into token consumption across teams
  • Approval flows for requests that exceed defined limits

These controls sit in the enforcement layer of the system. They run at request time, using cached data for fast decisions and fallback checks when needed.

Without this layer, token-based pricing breaks across teams. Limits apply at the account level, but usage happens at the user and workflow level, which leaves no defined point of control below the account.

When to build or buy token and credit infrastructure

Teams often build a token or credit system early. A credits table, a decrement function, and periodic resets cover the initial version.

Complexity arrives when token-based pricing stretches across these scenarios:

  • Mid-cycle plan changes: Upgrades require prorated credits, limit updates, and consistent enforcement across active sessions
  • Grandfathered plans: Legacy entitlements need to persist alongside new pricing models
  • Team-level usage: Shared token pools require a different data model, not just added logic
  • Audit requirements: Finance needs an append-only ledger that aligns with how tokens map to credits and revenue

These problems come from token enforcement. The system needs to handle changing limits, multiple scopes, and consistent state across requests.

Most teams can build an initial version quickly. The complexity shows up later in enforcement, shared usage, and plan changes.

Where this breaks and what comes next

A token system works until enforcement moves beyond a single balance. Shared usage, plan changes, and concurrent requests introduce conflicts that a simple counter cannot handle.

The real issue is timing. Billing systems are designed to process usage once execution is complete, aggregating events and generating accurate invoices for finance teams. But enforcement operates on a different timeline. 

By the time token usage reaches billing, the inference has already run, the compute has already been consumed, and the cost already exists.

That’s why AI products end up needing a separate runtime layer. Billing systems see usage records, while the runtime layer sees the request itself and determines whether it should be allowed before cost gets created.

Stigg provides the usage runtime layer between the product and billing stack, built for enforcement decisions that traditional billing systems were never meant to own:

  • Event-based token metering tied directly to requests
  • Credit systems with expiry rules, categories, and controlled burn order
  • Entitlements that evaluate limits across plans, trials, and overrides
  • Low-latency entitlement checks with resilient local enforcement architecture
  • Usage governance across users, teams, agents, and workflows

This layer determines whether a request is allowed before compute cost is incurred. Billing systems continue handling usage records and invoicing downstream.

Once access control and credit enforcement are distributed across the codebase, the operational overhead adds up quickly. That is where dedicated runtime infrastructure becomes necessary. See how Stigg structures that layer for AI usage governance.

FAQs

1. What is a token in AI?

A token in AI is the smallest unit of text a language model processes and generates. It represents a fragment of text mapped to a numeric ID by the tokenizer. Token count determines usage, limits, and cost in LLM-based systems.

2. How many tokens is a word?

The number of tokens per word depends on the model and content. English text averages about 0.75 words per token, so 1,000 tokens is roughly 750 words. Code, technical text, and non-English content tend to produce more tokens for the same length.

3. What is the difference between tokens and credits?

The main difference between tokens and credits is abstraction. Tokens are the raw unit counted by the model. Credits are the unit customers purchase and consume, often mapped to tokens or actions. Systems must track both and enforce limits at request time.

4. What happens when a customer runs out of tokens?

When a customer runs out of tokens, the system either blocks the request or allows it and records overage. Hard limits stop execution and trigger a paywall or upgrade prompt. Soft limits allow execution and route the excess usage to billing.

5. Why do different AI models return different token counts?

Different AI models return different token counts because each uses its own tokenizer and vocabulary. The same input can produce different token sequences across models. This affects limits, credit consumption, and cost after a model change.