Deployment

Token Budget

Definition

The total number of tokens allocated for a model request, encompassing both input (prompt + context) and output. Managing token budgets is central to controlling inference cost in production LLM applications, especially when processing long documents or maintaining conversational history.

Related Terms

Context Window

The maximum amount of text (measured in tokens) an LLM can process in a single request, encompassing both the prompt and the generated output. Larger context windows—now exceeding 1 million tokens in some models—enable processing of long documents, codebases, and meeting transcripts in one pass.

Inference Cost

The compute and financial cost of running a model to produce a single prediction or generated response. Inference cost is often the dominant AI operational expenditure at scale and is managed through model compression, caching, quantization, and batching strategies.

Prompt Cache

A technique that caches the key-value (KV) attention states of a repeated prompt prefix, so subsequent requests reuse the pre-computed computation rather than re-running it. Prompt caching reduces latency and cost significantly for applications with long system prompts.

Knowing the Terms Is Step One. Applying Them Is Step Two.

Book a Physical AI Fit Call to discuss how these AI concepts translate to your specific industry and business challenges.