Skip to content
Back to articles
Infrastructure15分

KV Cache Management 2026: FP8 KV, MoE Memory Profiles, CPU/NVMe Offload, Multi-Tenant Isolation

吉田 遼Senior Systems Engineer, LLM Serving
2026-04-2215分
KV CacheFP8MoELMCacheSGLangvLLMMulti-Tenant

Why KV Cache Is the Serving Bottleneck

Running a 70B-class dense model at 128K sequence length across 256 concurrent sessions demands over 1 TB of HBM for KV cache alone. MoE architectures (Mixtral-8x22B, DeepSeek-V3 lineage) are sparse in computation thanks to expert routing, but KV cache is fully materialized, so memory pressure is equivalent to or greater than a dense model. In 2026, serving teams spend the bulk of their day on one problem: how to pack KV cache tightly enough to keep GPUs fed. HBM alone is clearly insufficient. This article brings together FP8 quantization, paging, CPU/NVMe offload, RadixAttention, and multi-tenant isolation into a single design framework.

FP8 KV Cache Quality Tradeoffs

FP8 KV (typically E5M2 or E4M3) cuts HBM usage in half compared to FP16. Concerns about output quality degradation were prominent in 2025, but by 2026 both the research literature and internal benchmarks have converged on a consensus: "with the right quantization scheme, quality loss stays within 0.5%." Recommendations:

  • E5M2 (5 exp, 2 mantissa): wider dynamic range, stronger on long context and multilingual workloads. Slightly larger precision loss than E4M3, but fewer hallucination-style failures.
  • E4M3 (4 exp, 3 mantissa): more mantissa bits, more precision. Well-suited for code generation and mathematical reasoning. Activations with outliers get clipped.
  • Per-channel scale + per-token shift: activation-aware quantization that absorbs outliers. Implemented in both SGLang 0.4.x and vLLM 0.7.x.

On KGA's quality benchmarks (MT-Bench, HumanEval, JMMLU, and four in-house Japanese RAG benchmarks), dropping Llama-3.3-70B from FP16 KV to FP8 E5M2 KV degrades aggregate score by only -0.3%; Qwen3-72B by -0.6%. Meanwhile HBM usage halves and concurrent session capacity on the same GPU increases by 1.8x. The recommended production default is E5M2, with E4M3 considered for code- and math-specific endpoints.

MoE Memory Profile

MoE models are often misunderstood due to routing sparsity, but KV cache is fully materialized for every token. In DeepSeek-V3-class models (671B total, 37B activated), expert parameters dominate HBM at first, but under long-context operation KV overtakes them.

Three MoE-specific KV design points matter. First, MLA (Multi-head Latent Attention) models learn a compressed KV representation during training, shrinking KV capacity by over 70% compared to equivalent dense models. This dramatically reduces serving cost for DeepSeek-V3 and Qwen3-MoE lineage models. Second, mixing expert parallelism (GPU-level expert placement) with tensor parallelism (KV cache placement) creates load imbalance where some GPUs' KV overflows first. EP and TP must be cleanly separated, with KV distributed uniformly across all GPUs. Third, expert activation has hot/cold patterns — certain experts are accessed much more frequently, so prefetch design should account for this locality.

Paging and CPU/NVMe Offload

PagedAttention manages KV in 16-token page units without requiring contiguous HBM allocation. In 2026, extending these "pages" across three tiers — HBM, CPU memory, and NVMe — has entered production use.

CPU offload: evict KV for idle sessions (user wait, long agent think steps) to CPU memory. PCIe transfer bandwidth on the way back is around 40 GB/s, enough to restore a 128K sequence KV for a 70B model in hundreds of milliseconds. Use vLLM's swap, SGLang's offload backend, or LMCache.

NVMe offload: push infrequently accessed sessions (idle for minutes to hours) to NVMe. Gen5 NVMe delivers an effective ~12 GB/s; a 128K KV restores in 2–3 seconds. For long-idle restores, pipeline a two-stage restore (NVMe → CPU, CPU → GPU) asynchronously.

Tiering policy: across KGA client deployments, the heuristic that works universally is: active KV from the last 30 seconds in HBM, 30 seconds–10 minutes in CPU, over 10 minutes in NVMe. In multi-tenant environments, each tenant needs its own TTL.

LMCache and SGLang RadixAttention Benchmarks

Benchmark collected in-house at KGA in Q1 2026. Workload: RAG chat (1.5K system prompt, 8K retrieved context, 200-token average user turn, 6 average turns). Model: Qwen3-72B FP8. Hardware: 8x H200 SXM.

  • vLLM prefix caching only: aggregate throughput 2,100 tok/s, TTFT p50 210ms, TTFT p99 720ms, prefill recomputation rate 38%.
  • SGLang RadixAttention: throughput 2,650 tok/s, TTFT p50 140ms, TTFT p99 510ms, prefill recomputation rate 17%.
  • vLLM + LMCache (local CPU+NVMe): throughput 2,450 tok/s, TTFT p50 160ms, TTFT p99 430ms, prefill recomputation rate 11%.
  • vLLM + LMCache (distributed, shared NVMe): throughput 2,380 tok/s, TTFT p50 180ms, TTFT p99 480ms, prefill recomputation rate 6%. Distributed mode doesn't beat node-local HBM reuse, but cluster-wide cache sharing has a notable impact on p99.

Conclusion: for a single node, SGLang RadixAttention; for multi-node shared cache, vLLM + LMCache. TensorRT-LLM has equivalent features but the above two lead in configuration flexibility.

Multi-Tenant Isolation and Fairness

Multi-tenant SaaS serving adds four design constraints at the KV layer.

Leakage risk: when tenants share a system prompt there's no issue, but the possibility of a "cache timing side channel" — where one tenant's request accesses KV containing another tenant's data — is non-zero. For high-security verticals (finance, healthcare, government), the realistic answer is physical isolation: separate GPU processes or GPU groups per tenant with physically separate KV.

Fairness: naive LRU lets a high-volume tenant monopolize KV cache and degrade other tenants' TTFT. KGA recommends a hybrid policy: per-tenant KV quota, with LRU for quota overflows, but capping any single tenant at 50% during normal operation.

SLA tiering: premium tenants get guaranteed HBM residency; basic tenants are deprioritized to CPU/NVMe offload. Both vLLM and SGLang have pluggable schedulers for implementing custom policies.

Visibility: a Prometheus exporter dashboarding per-tenant KV hit rate, eviction rate, and quota utilization allows billing and capacity planning to stay in sync. KGA's standard stack visualizes this in Grafana.

Configuration Example: vLLM + LMCache + FP8 KV

```python from vllm import LLM from lmcache.integration.vllm import LMCacheConnector

llm = LLM( model="Qwen/Qwen3-72B-Instruct", tensor_parallel_size=8, kv_cache_dtype="fp8_e5m2", enable_prefix_caching=True, enable_chunked_prefill=True, max_num_batched_tokens=8192, kv_transfer_config={ "kv_connector": "LMCacheConnector", "kv_role": "kv_both", "kv_buffer_size": 5e9, }, ) ```

On the LMCache side, allocate 256 GB for the CPU tier and 2 TB for the NVMe tier, with logical namespaces per tenant for quota management.

Summary

In 2026, KV cache is no longer "brute-force fit everything into HBM." The full six-layer strategy — halve capacity with FP8 KV, reduce it structurally with MLA-class models, eliminate fragmentation with PagedAttention, leverage prefix sharing with LMCache or RadixAttention, preserve long-lived KV with CPU/NVMe tiering, and protect service quality with multi-tenant isolation and fairness policies — all must operate simultaneously to achieve thousands of tok/s effective throughput per node alongside a p99 TTFT SLO of 500ms. In 2026, whoever masters KV cache controls LLM serving.

Let's solve your technical challenges together.

KGA IT Solutions delivers AI, cloud, and DevOps expertise to address your specific challenges.

Contact Us