What IT services does KGA provide?

KGA provides comprehensive IT support services including software installation and setup, SaaS system maintenance, application configuration, technical support, digital consulting (including website development), security services, and data management & backup solutions.

What areas do you cover?

Based in Kosai, Shizuoka, we provide remote support nationwide across Japan. On-site support is available primarily in the Tokai region.

Can I consult before signing a contract?

Yes, initial consultation and estimates are completely free. We will listen to your IT challenges and propose the optimal solution.

Is emergency support available?

Yes, the Business plan (monthly) includes 24-hour emergency support. The Annual Basic and Annual Premium plans provide priority response during business hours.

Can you set up international TV apps?

Yes, we support the installation and configuration of international TV applications and media players. We help set up environments for legal access to international content.

Do you offer multilingual support?

We support 9 languages: Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish.

Are there any setup or hidden fees?

No. All prices displayed are final and tax-included. There are no setup fees, hidden charges, or surprise invoices. What you see is exactly what you pay.

Can I change plans later?

Yes. You can upgrade, downgrade, or cancel at any time. Upgrades take effect immediately and we will prorate the difference. Downgrades take effect at the next renewal cycle.

Which payment methods do you accept?

We accept all major credit cards (Visa, Mastercard, JCB, American Express) through Stripe and Komoju, as well as bank transfers and convenience store payments in Japan. Invoicing is available for Business IT Plan customers.

Do you offer refunds?

Yes. We offer a 14-day money-back guarantee on all annual plans — no questions asked. Monthly Business IT Plan subscriptions can be cancelled at any time with prorated refunds for unused service.

What is the difference between the annual plans and the Business IT Plan?

Annual plans cover app configuration and support for individuals and small teams. The Business IT Plan is a comprehensive monthly subscription for companies that require website development, system management, automation, security, and a dedicated account manager.

Do you provide support in English?

Yes. Our team provides full multilingual support in Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish — by email, chat, and scheduled video calls.

LLM評価フレームワークの構築: 品質をどう測るか — KGA Tech Blog

「なんとなく良くなった」を許さない

LLMアプリケーション開発で最も危険なのは、プロンプトの変更やモデルの切り替えを「感覚」で評価することだ。「前のバージョンより良い応答が出ている気がする」では、リグレッションを見逃し、本番障害につながる。KGAでは2024年後半からLLM評価フレームワークの構築に取り組み、現在はすべてのLLMプロジェクトでCI/CDパイプラインに組み込んでいる。

評価の3つの軸

KGAの評価フレームワークは3つの軸で品質を計測する。正確性（Accuracy）: 事実として正しいか、指示通りの形式か。一貫性（Consistency）: 同じ入力に対して安定した品質の出力を返すか。有用性（Usefulness）: エンドユーザーが実際に価値を感じるか。

正確性は自動テストで計測しやすい。JSON出力の構造検証、数値の範囲チェック、既知の正解との完全一致・部分一致。一貫性はtemperature 0で同一入力を10回実行し、出力のばらつきを測定する。有用性は人間評価が不可欠だが、コストが高いため、LLM-as-Judgeによる代理評価と組み合わせる。

カスタムベンチマークの設計

汎用ベンチマーク（MMLU、HumanEval等）はモデル選択の参考にはなるが、プロダクションの品質保証には使えない。ドメイン固有のベンチマークを自前で構築する必要がある。

KGAの方法論は以下の通りだ。Step 1: 本番ログから代表的な入力を100-500件サンプリング。Step 2: 各入力に対して「理想的な出力」を人間が作成（Golden Dataset）。Step 3: 評価基準を明文化（構造、内容、トーン等のチェックリスト）。Step 4: 自動評価メトリクスをGolden Datasetに対してキャリブレーション。

Golden Datasetの作成は手間がかかるが、これが評価の品質を決定する。KGAでは1プロジェクトあたり最低200件のGolden Dataを作成している。作成コストは概ね20-40人時だが、この投資なしには信頼性のある評価はできない。

LLM-as-Judge: 自動評価の核心

人間評価のスケーラビリティの限界を補うのがLLM-as-Judgeだ。評価用のLLM（KGAではClaude 3.5 Sonnetを使用）に、入力・出力・評価基準を渡し、1-5のスコアと根拠を生成させる。

LLM-as-Judgeの精度を上げるためのテクニックは3つ。第一に、評価基準を極めて具体的に記述する。「わかりやすい文章か」ではなく「専門用語に初出時の説明があるか」「一文の平均文字数が80字以内か」のように定量化可能な基準にする。第二に、参照回答（Golden Data）を提供する。「以下の参照回答と比較して品質を評価してください」と指示することで、基準の一貫性が向上する。第三に、ペアワイズ比較を使う。絶対評価（1-5点）よりも相対評価（AとBのどちらが良いか）の方が一貫性が高い。

KGAの実測では、上記テクニックを組み合わせたLLM-as-Judgeと人間評価の一致率は82%。ペアワイズ比較に限れば89%の一致率を達成している。

自動テストパイプラインの設計

KGAのLLM評価パイプラインはGitHub Actionsで動作する。プロンプトの変更がPRとして提出されると、以下のパイプラインが自動実行される。

ベンチマークデータセット（200件）に対して新旧プロンプトで推論を実行。2. 構造バリデーション（JSONスキーマ準拠率、必須フィールドの存在等）。3. LLM-as-Judgeによるペアワイズ比較（新 vs 旧）。4. レグレッションチェック（旧バージョンより5%以上品質低下したカテゴリがないか）。5. コスト計算（トークン消費量の変化）。6. 結果サマリーをPRコメントとして自動投稿。

パイプラインの実行時間は200件のベンチマークで約15分、API費用は約$3。プロンプトの変更頻度を考えれば十分にペイする投資だ。

メトリクスの選び方

テキスト生成タスクの自動メトリクスとして、BLEUやROUGEは参考程度にしかならない。KGAが重視するメトリクスは以下の通り。

構造遵守率: 指定したフォーマット（JSON、マークダウン等）に準拠している割合。最も基本的で自動化しやすい。Pass/Failの二値評価。目標値は98%以上。

事実整合性: 入力データに含まれる情報と出力の整合性。ハルシネーション検出。KGAではNLI（Natural Language Inference）モデルで入力-出力間のentailment/contradictionを判定している。

応答一貫性: 同一入力に対するN回実行の出力の意味的類似度。embeddingのコサイン類似度で測定。0.85以上を安定と判定。

タスク完了率: エンドツーエンドでタスクが正しく完了したか。例えば「メールのドラフト作成」タスクなら、件名・本文・署名がすべて含まれているかをチェック。

評価フレームワークの限界

正直に言えば、LLM評価の「正解」はまだ誰もわかっていない。LLM-as-Judgeの評価にもバイアスがあり、冗長な回答を高評価する傾向（verbosity bias）や、自身の出力に甘い評価をする傾向（self-preference bias）が知られている。KGAでは月次で人間評価とLLM-as-Judgeの一致率をモニタリングし、乖離が大きくなったら評価プロンプトを調整している。完全な自動化は不可能だが、80%の品質保証を自動化し、残りの20%に人間のリソースを集中させるのが現実的なアプローチだ。

LLM評価フレームワークの構築: 品質をどう測るか