Batching Strategy Is What Sets Your SLO
In LLM serving architecture, nothing affects SLO attainment more directly than how you construct your batches. No matter how powerful the GPU, simultaneously meeting TTFT (Time To First Token), ITL (Inter-Token Latency), and aggregate throughput targets requires batching strategies designed to fit the actual distribution of request arrival patterns, input lengths, output lengths, and KV cache occupancy. This post covers the five techniques deployed in production in 2026 — continuous batching, chunked prefill, PagedAttention, RadixAttention, and sorted batching — their characteristics, their SLO impact, and how they compose with prompt caching.
Continuous Batching
Continuous batching (also called iteration-level scheduling) reconstructs the batch at every decode step. Where static batching holds a batch until all requests complete before accepting new ones, continuous batching immediately removes completed requests and fills their slots with new arrivals. The result is GPU utilization that is much less sensitive to variance in request length — particularly effective for chat workloads with high output-length variance, where effective throughput gains of 2–4× are common.
Continuous batching alone, however, leaves a problem: prefill and decode collide within the same step. A request with a long prompt (8K+ tokens) can take hundreds of milliseconds to prefill, stalling decode for every other in-flight request. This prefill stall is what creates the long tail in ITL distributions.
Chunked Prefill
Chunked prefill splits long prompts into small chunks (typically 512–2048 tokens) and processes "one prefill chunk + other requests' decode" simultaneously within each iteration. This spreads prefill load across multiple steps and dramatically reduces ITL jitter. For any system with a p99 ITL SLO of 50ms or tighter, chunked prefill is non-negotiable — and it is standard in vLLM, SGLang, and TensorRT-LLM.
Chunk size tuning matters: too small and kernel launch overhead dominates; too large and decodes get stalled again. KGA's rule of thumb is 4096–8192 tokens on H200 hardware and 8192–16384 tokens on B200. A secondary benefit: mixing compute-bound prefill with memory-bound decode causes both compute and memory to be utilized simultaneously, improving MFU above what either workload achieves alone.
PagedAttention
PagedAttention manages KV cache in fixed-size pages (typically 16 tokens each) to eliminate fragmentation — the basis of vLLM. Physical memory utilization drops 2–4× compared to contiguous allocation. As of 2026, most serving frameworks — TensorRT-LLM, SGLang, DeepSpeed-Inference — have adopted similar paging mechanisms.
PagedAttention is also the foundation for prefix sharing: multiple requests using the same system prompt reference the same physical pages, holding only a single copy of the KV. However, vLLM's native prefix caching works only on exact-match prefixes. For partial matches and cross-tenant optimization, RadixAttention-style approaches win.
RadixAttention
SGLang's RadixAttention manages KV cache as a radix tree, enabling sharing of arbitrary prefix fragments. System prompts, few-shot examples, tool definitions, and tenant-common context are all automatically detected and deduplicated down to a single physical KV location.
In benchmarks on multi-turn chat with a shared system prompt (average 8 turns, 256 concurrent sessions), RadixAttention achieves 40–60% reduction in prefill cost and 30–45% improvement in median TTFT compared to vLLM's prefix caching. For agent workloads with large tool definitions, the gap can reach 2–3×.
The hard part is eviction policy design. Naive LRU produces TTFT spikes when a hot tenant's cache is evicted by a cold one. KGA uses weighted LRU with per-tenant priority scores, or hard partitioning with fixed per-tenant allocations, depending on the use case.
Sorted Batching
Sorted batching groups requests with similar input or remaining output lengths together. Naive continuous batching with a mix of long and short requests incurs padding and padding mask inefficiency; sorted batching mitigates this. In 2026, this is typically implemented in the upper-layer router or scheduler rather than in the serving framework itself.
Separating short-response endpoints (classification, reranking, short summarization) from long-response endpoints (report generation, long-form code generation) and applying sorted batching within each is the simplest approach — and measurably achieves 20–35% improvement in effective throughput.
SLO Impact Summary
- TTFT: Dominated by chunked prefill and prefix caching / RadixAttention. With both in place, p50 TTFT is typically around 100ms; p99 stays under 500ms.
- ITL: Continuous batching and chunked prefill both contribute. Without chunked prefill, p99 ITL easily exceeds 200ms.
- Throughput: Maximized by the combination of PagedAttention for memory efficiency, continuous batching for slot utilization, and RadixAttention for reducing wasted prefill.
- p99 tail: Protected by sorted batching and admission control. A queuing admission policy for low-priority requests is necessary during QPS spikes.
Prompt Cache Strategy
Four axes for designing prompt caching in a production stack:
System prompt sharing: Pre-warm common system prompts at startup and keep them resident. For RadixAttention, the standard practice is issuing warmup requests in the startup script to populate the KV tree.
Tenant partitioning: In a multi-tenant SaaS context, mixing tenant A and B context raises security and compliance concerns. RadixAttention provides no cryptographic isolation, so in regulated domains (finance, healthcare), the correct design separates tenants at the process or GPU group level, limiting the shared layer to non-sensitive system prompts only.
TTL and eviction: For long-session chat agents, caching intermediate multi-turn KV in a temporary cache and evicting it N minutes after the conversation pauses works well. KGA clients commonly set TTLs of 15–30 minutes to balance memory pressure against re-prefill cost.
Cross-request deduplication: Agent workloads frequently pass the same tool responses to multiple LLM calls. Fingerprinting those tool responses and loading them into the KV tree only once reduces TTFT by 30% or more.
Configuration Example: vLLM + Chunked Prefill + Prefix Caching
```python llm = LLM( model="Qwen/Qwen3-72B-Instruct", tensor_parallel_size=8, kv_cache_dtype="fp8", enable_prefix_caching=True, enable_chunked_prefill=True, max_num_batched_tokens=8192, max_num_seqs=256, gpu_memory_utilization=0.92, block_size=16, preemption_mode="recompute", ) ```
Setting `max_num_seqs` too high causes frequent swap due to memory pressure. In practice, back-calculate from peak QPS and p99 output length: `QPS_peak × p99_output_length / expected_decode_rate` is roughly the upper bound in KGA's experience.
Conclusion
Batching strategy in 2026 is not a matter of choosing one of five techniques. The correct answer is to enable all of them and tune the parameters to your SLO. To protect TTFT: chunked prefill and RadixAttention. To protect ITL: chunked prefill and sorted batching. To maximize throughput: continuous batching and PagedAttention. To protect tenant isolation: hard partitioning and tenant-aware eviction. Designing around these four axes, an 8× H200 SXM system running a 70B model can realistically target aggregate 2,500 tok/s and p99 TTFT under 400ms.