What IT services does KGA provide?

KGA provides comprehensive IT support services including software installation and setup, SaaS system maintenance, application configuration, technical support, digital consulting (including website development), security services, and data management & backup solutions.

What areas do you cover?

Based in Kosai, Shizuoka, we provide remote support nationwide across Japan. On-site support is available primarily in the Tokai region.

Can I consult before signing a contract?

Yes, initial consultation and estimates are completely free. We will listen to your IT challenges and propose the optimal solution.

Is emergency support available?

Yes, the Premium plan includes 24-hour emergency support. The Standard plan also provides priority response during business hours.

Can you set up international TV apps?

Yes, we support the installation and configuration of international TV applications and media players. We help set up environments for legal access to international content.

Do you offer multilingual support?

We support 9 languages: Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish.

Are there any setup or hidden fees?

No. All prices displayed are final and tax-included. There are no setup fees, hidden charges, or surprise invoices. What you see is exactly what you pay.

Can I change plans later?

Yes. You can upgrade, downgrade, or cancel at any time. Upgrades take effect immediately and we will prorate the difference. Downgrades take effect at the next renewal cycle.

Which payment methods do you accept?

We accept all major credit cards (Visa, Mastercard, JCB, American Express) through Komoju, as well as bank transfers and convenience store payments in Japan. Invoicing is available for Business IT Plan customers.

Do you offer refunds?

Yes. We offer a 14-day money-back guarantee on all annual plans — no questions asked. Monthly Business IT Plan subscriptions can be cancelled at any time with prorated refunds for unused service.

What is the difference between the annual plans and the Business IT Plan?

Annual plans cover app configuration and support for individuals and small teams. The Business IT Plan is a comprehensive monthly subscription for companies that require website development, system management, automation, security, and a dedicated account manager.

Do you provide support in English?

Yes. Our team provides full multilingual support in Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish — by email, chat, and scheduled video calls.

LLM Serving Batching Strategies 2026: Continuous Batching, Chunked Prefill, RadixAttention — KGA Tech Blog

Batching Strategy Is What Sets Your SLO

In LLM serving architecture, nothing affects SLO attainment more directly than how you construct your batches. No matter how powerful the GPU, simultaneously meeting TTFT (Time To First Token), ITL (Inter-Token Latency), and aggregate throughput targets requires batching strategies designed to fit the actual distribution of request arrival patterns, input lengths, output lengths, and KV cache occupancy. This post covers the five techniques deployed in production in 2026 — continuous batching, chunked prefill, PagedAttention, RadixAttention, and sorted batching — their characteristics, their SLO impact, and how they compose with prompt caching.

Continuous Batching

Continuous batching (also called iteration-level scheduling) reconstructs the batch at every decode step. Where static batching holds a batch until all requests complete before accepting new ones, continuous batching immediately removes completed requests and fills their slots with new arrivals. The result is GPU utilization that is much less sensitive to variance in request length — particularly effective for chat workloads with high output-length variance, where effective throughput gains of 2–4× are common.

Continuous batching alone, however, leaves a problem: prefill and decode collide within the same step. A request with a long prompt (8K+ tokens) can take hundreds of milliseconds to prefill, stalling decode for every other in-flight request. This prefill stall is what creates the long tail in ITL distributions.

Chunked Prefill

Chunked prefill splits long prompts into small chunks (typically 512–2048 tokens) and processes "one prefill chunk + other requests' decode" simultaneously within each iteration. This spreads prefill load across multiple steps and dramatically reduces ITL jitter. For any system with a p99 ITL SLO of 50ms or tighter, chunked prefill is non-negotiable — and it is standard in vLLM, SGLang, and TensorRT-LLM.

Chunk size tuning matters: too small and kernel launch overhead dominates; too large and decodes get stalled again. KGA's rule of thumb is 4096–8192 tokens on H200 hardware and 8192–16384 tokens on B200. A secondary benefit: mixing compute-bound prefill with memory-bound decode causes both compute and memory to be utilized simultaneously, improving MFU above what either workload achieves alone.

PagedAttention

PagedAttention manages KV cache in fixed-size pages (typically 16 tokens each) to eliminate fragmentation — the basis of vLLM. Physical memory utilization drops 2–4× compared to contiguous allocation. As of 2026, most serving frameworks — TensorRT-LLM, SGLang, DeepSpeed-Inference — have adopted similar paging mechanisms.

PagedAttention is also the foundation for prefix sharing: multiple requests using the same system prompt reference the same physical pages, holding only a single copy of the KV. However, vLLM's native prefix caching works only on exact-match prefixes. For partial matches and cross-tenant optimization, RadixAttention-style approaches win.

RadixAttention

SGLang's RadixAttention manages KV cache as a radix tree, enabling sharing of arbitrary prefix fragments. System prompts, few-shot examples, tool definitions, and tenant-common context are all automatically detected and deduplicated down to a single physical KV location.

In benchmarks on multi-turn chat with a shared system prompt (average 8 turns, 256 concurrent sessions), RadixAttention achieves 40–60% reduction in prefill cost and 30–45% improvement in median TTFT compared to vLLM's prefix caching. For agent workloads with large tool definitions, the gap can reach 2–3×.

The hard part is eviction policy design. Naive LRU produces TTFT spikes when a hot tenant's cache is evicted by a cold one. KGA uses weighted LRU with per-tenant priority scores, or hard partitioning with fixed per-tenant allocations, depending on the use case.

Sorted Batching

Sorted batching groups requests with similar input or remaining output lengths together. Naive continuous batching with a mix of long and short requests incurs padding and padding mask inefficiency; sorted batching mitigates this. In 2026, this is typically implemented in the upper-layer router or scheduler rather than in the serving framework itself.

Separating short-response endpoints (classification, reranking, short summarization) from long-response endpoints (report generation, long-form code generation) and applying sorted batching within each is the simplest approach — and measurably achieves 20–35% improvement in effective throughput.

SLO Impact Summary

TTFT: Dominated by chunked prefill and prefix caching / RadixAttention. With both in place, p50 TTFT is typically around 100ms; p99 stays under 500ms.
ITL: Continuous batching and chunked prefill both contribute. Without chunked prefill, p99 ITL easily exceeds 200ms.
Throughput: Maximized by the combination of PagedAttention for memory efficiency, continuous batching for slot utilization, and RadixAttention for reducing wasted prefill.
p99 tail: Protected by sorted batching and admission control. A queuing admission policy for low-priority requests is necessary during QPS spikes.

Prompt Cache Strategy

Four axes for designing prompt caching in a production stack:

System prompt sharing: Pre-warm common system prompts at startup and keep them resident. For RadixAttention, the standard practice is issuing warmup requests in the startup script to populate the KV tree.

Tenant partitioning: In a multi-tenant SaaS context, mixing tenant A and B context raises security and compliance concerns. RadixAttention provides no cryptographic isolation, so in regulated domains (finance, healthcare), the correct design separates tenants at the process or GPU group level, limiting the shared layer to non-sensitive system prompts only.

TTL and eviction: For long-session chat agents, caching intermediate multi-turn KV in a temporary cache and evicting it N minutes after the conversation pauses works well. KGA clients commonly set TTLs of 15–30 minutes to balance memory pressure against re-prefill cost.

Cross-request deduplication: Agent workloads frequently pass the same tool responses to multiple LLM calls. Fingerprinting those tool responses and loading them into the KV tree only once reduces TTFT by 30% or more.

Configuration Example: vLLM + Chunked Prefill + Prefix Caching

```python llm = LLM( model="Qwen/Qwen3-72B-Instruct", tensor_parallel_size=8, kv_cache_dtype="fp8", enable_prefix_caching=True, enable_chunked_prefill=True, max_num_batched_tokens=8192, max_num_seqs=256, gpu_memory_utilization=0.92, block_size=16, preemption_mode="recompute", ) ```

Setting `max_num_seqs` too high causes frequent swap due to memory pressure. In practice, back-calculate from peak QPS and p99 output length: `QPS_peak × p99_output_length / expected_decode_rate` is roughly the upper bound in KGA's experience.

Conclusion

Batching strategy in 2026 is not a matter of choosing one of five techniques. The correct answer is to enable all of them and tune the parameters to your SLO. To protect TTFT: chunked prefill and RadixAttention. To protect ITL: chunked prefill and sorted batching. To maximize throughput: continuous batching and PagedAttention. To protect tenant isolation: hard partitioning and tenant-aware eviction. Designing around these four axes, an 8× H200 SXM system running a 70B model can realistically target aggregate 2,500 tok/s and p99 TTFT under 400ms.

LLM Serving Batching Strategies 2026: Continuous Batching, Chunked Prefill, RadixAttention