What IT services does KGA provide?

KGA provides comprehensive IT support services including software installation and setup, SaaS system maintenance, application configuration, technical support, digital consulting (including website development), security services, and data management & backup solutions.

What areas do you cover?

Based in Kosai, Shizuoka, we provide remote support nationwide across Japan. On-site support is available primarily in the Tokai region.

Can I consult before signing a contract?

Yes, initial consultation and estimates are completely free. We will listen to your IT challenges and propose the optimal solution.

Is emergency support available?

Yes, the Business plan (monthly) includes 24-hour emergency support. The Annual Basic and Annual Premium plans provide priority response during business hours.

Can you set up international TV apps?

Yes, we support the installation and configuration of international TV applications and media players. We help set up environments for legal access to international content.

Do you offer multilingual support?

We support 9 languages: Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish.

Are there any setup or hidden fees?

No. All prices displayed are final and tax-included. There are no setup fees, hidden charges, or surprise invoices. What you see is exactly what you pay.

Can I change plans later?

Yes. You can upgrade, downgrade, or cancel at any time. Upgrades take effect immediately and we will prorate the difference. Downgrades take effect at the next renewal cycle.

Which payment methods do you accept?

We accept all major credit cards (Visa, Mastercard, JCB, American Express) through Stripe and Komoju, as well as bank transfers and convenience store payments in Japan. Invoicing is available for Business IT Plan customers.

Do you offer refunds?

Yes. We offer a 14-day money-back guarantee on all annual plans — no questions asked. Monthly Business IT Plan subscriptions can be cancelled at any time with prorated refunds for unused service.

What is the difference between the annual plans and the Business IT Plan?

Annual plans cover app configuration and support for individuals and small teams. The Business IT Plan is a comprehensive monthly subscription for companies that require website development, system management, automation, security, and a dedicated account manager.

Do you provide support in English?

Yes. Our team provides full multilingual support in Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish — by email, chat, and scheduled video calls.

推論エンジン戦争 2026: vLLM・SGLang・TensorRT-LLM・llama.cpp・MLX 完全比較 — KGA Tech Blog

推論エンジンの2026年勢力図

LLM推論スタックは 2025年の激変を経て、用途ごとの最適解が明確化した。本稿では実運用の観点から5つの主要エンジン、vLLM 0.8.3、SGLang 0.4.5、TensorRT-LLM 0.18、llama.cpp (b4800)、MLX 0.22 の特性を整理し、7B / 70B / 405B の規模別チューニング指針を提示する。

スループットとレイテンシの実測

H100 80GB ×1 で Llama 3.1 8B Instruct、FP8 量子化、入力 1k / 出力 512、同時リクエスト 64 の条件で測定した結果。

| エンジン | 出力tok/s | p50レイテンシ | p99レイテンシ | VRAM使用 | |---|---|---|---|---| | vLLM 0.8 | 4820 | 182ms | 412ms | 68.4GB | | SGLang 0.4 | 5640 | 158ms | 378ms | 71.2GB | | TensorRT-LLM | 5310 | 164ms | 342ms | 62.8GB | | llama.cpp | 1820 | 486ms | 1240ms | 9.8GB (Q4_K_M) | | MLX (M3 Ultra) | 312 | 2840ms | 5120ms | 16GB |

SGLang は RadixAttention によるプレフィックスキャッシュ共有が効き、プロンプトテンプレートを共有するチャット用途でトップスループット。TensorRT-LLM は p99 レイテンシで優位、SLA 重視の本番運用に適する。vLLM は両者の中間で、対応モデル数と開発速度のバランスが良い。

バッチスケジューリングの進化

Continuous Batching (vLLM 発祥、全エンジン採用済み) はリクエスト単位でのイテレーション結合により GPU 稼働率を 90% 台に押し上げた技術。2026年版ではこれが前提になった。

RadixAttention (SGLang) は KV キャッシュをトライ木で管理し、システムプロンプトやマルチターン履歴の再計算を完全回避する。システムプロンプト 2k・同時ユーザー数 1000 規模のSaaSで、実効スループット 3倍改善の事例あり。

Speculative Decoding は全エンジンでサポートされ、2026年は EAGLE-3、Medusa-2、Lookahead の3系統が主流。EAGLE-3 は 70B 母モデルで 2.4x、405B で 2.8x の加速を実現している。draft モデルは母モデルの 1/30〜1/50 サイズで十分。

```python # SGLang での EAGLE-3 設定例 python -m sglang.launch_server \ --model-path meta-llama/Llama-4-70B-Instruct \ --speculative-algorithm EAGLE3 \ --speculative-draft meta-llama/Llama-3.2-1B \ --speculative-num-steps 5 \ --speculative-eagle-topk 8 \ --tp 4 ```

Chunked Prefill は長プロンプトを分割し、decode ステップと混在実行することで TTFT を改善する。入力 32k 超の RAG 用途では必須。

量子化の選択肢: GPTQ / AWQ / FP8 / MXFP4

年の量子化事情は、精度とスループットの両取りが進み、GPTQ 4bit はほぼ過去のものになりつつある。

FP8 (E4M3): H100、Blackwell でネイティブ対応。精度劣化はほぼゼロ、スループット 1.8x。2026年の本番運用標準。
AWQ 4bit: メモリ律速な 70B+ での選択肢。精度劣化 0.5〜1.5pt、VRAM 半減。
MXFP4 (Microscaling): Blackwell B200 ネイティブ、4bit ながら FP8 に迫る精度。vLLM 0.8 / TRT-LLM 0.18 でサポート開始。
GPTQ 4bit: レガシー。新規採用は非推奨。
llama.cpp Q4_K_M / IQ4_XS: CPU / Apple Silicon 向け。IQ4_XS は同サイズで Q4_K_M 比 perplexity 3〜5% 改善。

規模別チューニング指針

7B級 (Llama 3.2 8B, Qwen 3 7B)

本番: SGLang + FP8 + continuous batching + EAGLE-3、単一 H100 で 5000+ tok/s
ローカル: llama.cpp + IQ4_XS、M3 Max で 60 tok/s、RTX 4090 で 140 tok/s
エッジ: MLX + 4bit、M2 iPad で 25 tok/s

70B級 (Llama 4 70B, Qwen 3 72B)

本番: vLLM + FP8 + TP=4、H100 ×4 で 1800 tok/s、バッチ 32
コスト最適: MI300X ×2 + FP8、H100 比 35% 削減
トリック: `--kv-cache-dtype fp8_e5m2` でコンテキスト長を実質2倍

405B級 (Llama 4 405B, DeepSeek R2)

TP=8 + EP (expert parallel) 必須
MoE モデルは `--enable-expert-parallel --ep-size 8` で活性エキスパートの通信最適化
KV キャッシュオフロード (`--swap-space 64`) でバッチサイズ 2倍可能
長文脈は TRT-LLM の chunked prefill + PagedAttention が最速

運用で躓くポイント

プロダクション導入で最も多い失敗は「ベンチマーク上の最速エンジンを盲信する」ことだ。SGLang は確かに速いが、モデル対応の足並みでは vLLM が先行する。新モデルリリース当日から動かす必要があるなら vLLM、固定モデルで徹底的に高速化するなら SGLang か TRT-LLM という選び分けが実務的。

もう一つの落とし穴は KV キャッシュ容量の見積もり。70B FP8 で 128k コンテキスト × 同時 16 リクエストは 40GB 超を KV で消費する。H100 1枚では破綻する構成だが、見落とす事例が後を絶たない。PagedAttention と FP8 KV の併用で 60% 削減可能。

2026年後半への備え

Blackwell B200 / GB200 NVL72 の本格普及、Intel Gaudi 3 の vLLM 統合、そして Hopper H200 の中古流通が今年後半の焦点となる。推論エンジンの選定は「モデル × ハードウェア × 量子化」の3次元最適化問題であり、四半期ごとの再評価が運用の要となる。

推論エンジン戦争 2026: vLLM・SGLang・TensorRT-LLM・llama.cpp・MLX 完全比較