What IT services does KGA provide?

KGA provides comprehensive IT support services including software installation and setup, SaaS system maintenance, application configuration, technical support, digital consulting (including website development), security services, and data management & backup solutions.

What areas do you cover?

Based in Kosai, Shizuoka, we provide remote support nationwide across Japan. On-site support is available primarily in the Tokai region.

Can I consult before signing a contract?

Yes, initial consultation and estimates are completely free. We will listen to your IT challenges and propose the optimal solution.

Is emergency support available?

Yes, the Premium plan includes 24-hour emergency support. The Standard plan also provides priority response during business hours.

Can you set up international TV apps?

Yes, we support the installation and configuration of international TV applications and media players. We help set up environments for legal access to international content.

Do you offer multilingual support?

We support 9 languages: Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish.

Are there any setup or hidden fees?

No. All prices displayed are final and tax-included. There are no setup fees, hidden charges, or surprise invoices. What you see is exactly what you pay.

Can I change plans later?

Yes. You can upgrade, downgrade, or cancel at any time. Upgrades take effect immediately and we will prorate the difference. Downgrades take effect at the next renewal cycle.

Which payment methods do you accept?

We accept all major credit cards (Visa, Mastercard, JCB, American Express) through Komoju, as well as bank transfers and convenience store payments in Japan. Invoicing is available for Business IT Plan customers.

Do you offer refunds?

Yes. We offer a 14-day money-back guarantee on all annual plans — no questions asked. Monthly Business IT Plan subscriptions can be cancelled at any time with prorated refunds for unused service.

What is the difference between the annual plans and the Business IT Plan?

Annual plans cover app configuration and support for individuals and small teams. The Business IT Plan is a comprehensive monthly subscription for companies that require website development, system management, automation, security, and a dedicated account manager.

Do you provide support in English?

Yes. Our team provides full multilingual support in Japanese, English, Portuguese, Korean, Chinese, Malay, Filipino, Vietnamese, and Spanish — by email, chat, and scheduled video calls.

Raising RAG Retrieval Quality in 2026: Hybrid, Rerankers, HyDE, and Late Chunking — KGA Tech Blog

Retrieval Sets the Ceiling for RAG Quality

In 2026, it's rare for a RAG project to see a dramatic improvement by switching the generation model from GPT-4 to Claude Opus 4.7. The vast majority of problems come down to Retrieval failing to surface the needed information, or retrieving noise that confuses the LLM. Looking back at RAG projects KGA has supported over the past 18 months, Retrieval improvements alone drove 30–60% gains in user-facing quality scores, while LLM model switches produced an average improvement of around 8%.

This article organizes 2026's established RAG Retrieval practices across five layers: (1) hybrid search, (2) rerankers, (3) query rewriting, (4) chunking strategy, and (5) evaluation metrics.

Layer 1: Hybrid Search (BM25 + Dense)

Pure dense retrieval frequently fails on model numbers ("ZX-450B"), abbreviations ("TLS1.3"), proper names, and specific numeric conditions. BM25, conversely, is weak on semantic paraphrasing. Fusing both using Reciprocal Rank Fusion (RRF) has become the industry standard.

RRF is elegantly simple: convert each search result's rank to `1 / (k + rank)` and sum (default k=60). It handles differences in score scale across models without issue, making it universally effective for combining different retrieval systems.

```python from collections import defaultdict

def rrf_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[tuple[str, float]]: scores = defaultdict(float) for results in ranked_lists: for rank, doc_id in enumerate(results, start=1): scores[doc_id] += 1.0 / (k + rank) return sorted(scores.items(), key=lambda x: -x[1])

bm25_top = bm25_index.search(query, k=50) dense_top = vector_db.search(query_embed, k=50) fused = rrf_fusion([bm25_top, dense_top])[:20] ```

On a legal search project at KGA: BM25 alone achieved nDCG@10=0.52, dense alone 0.61, and RRF hybrid reached 0.73. Simple — but remarkably effective.

Layer 2: Rerankers (Cohere Rerank-3, Voyage Rerank-2)

After hybrid search returns top 50–200 candidates, running a cross-encoder reranker to precisely re-score them is the 2026 de facto standard. Dense retrieval uses a bi-encoder structure that vectorizes query and document independently, losing query-document interaction. A reranker passes both through the same Transformer simultaneously to produce a true relevance score.

Cohere Rerank-3 (released September 2025): 100 language support, 128k context, $2/1k queries
Voyage Rerank-2: 32k context, top-tier on MTEB Reranking benchmark, $0.5/1k queries
Jina Reranker v2: OSS-friendly option; Japanese performance is one step behind Cohere
BGE-Reranker-v2-m3: MIT-licensed OSS, the first choice for self-hosted deployments

```python import cohere

co = cohere.Client() results = co.rerank( model="rerank-3", query="Requirements for cross-border transfers under data protection law", documents=[d.text for d in top50], top_n=10, ) reranked = [top50[r.index] for r in results.results] ```

The effect of reranking is dramatic — KGA's average improvement is from nDCG@10 around 0.65 to around 0.82. However, latency increases by +200–400ms for top-50 sets, so asynchronous processing or a design that serves fast first-stage results while reranking runs in the background is necessary for search UX.

Layer 3: Query Rewriting (HyDE, Query2Doc, Multi-Query)

User queries are often short and ambiguous ("contract termination," "memory utilization"). Expanding them with an LLM before searching has become a standard part of 2026 Retrieval design.

HyDE (Hypothetical Document Embeddings): Have the LLM write a hypothetical answer document and use its embedding as the search key
Query2Doc: Concatenate the query with the pseudo-document and embed the result (a HyDE variant that tends to be more stable in practice)
Multi-Query: Have the LLM generate 3–5 alternative phrasings of the same intent, run parallel searches, and fuse with RRF

```python def hyde_search(query: str, vector_db, llm): hypothetical = llm.generate( f"Write a 3-sentence hypothetical answer to the following question: {query}" ) combined = query + " " + hypothetical # Query2Doc variant embed = embedding_model.encode(combined) return vector_db.search(embed, k=50) ```

Multi-Query offers the best cost-to-performance ratio — on an internal FAQ RAG at KGA, it improved Recall@20 by 12% over single-query search. The cost is one additional LLM call per query, but running it on Claude Haiku or GPT-4.1-mini keeps the cost well under ¥0.1 per query.

Layer 4: Chunking Strategy

Any RAG still operating on "split every 1,000 characters" is out of date by 2026 standards. The chunking approach directly determines Retrieval quality.

Fixed-size: The classic N-character split with 10–20% overlap
Recursive Character Splitting: Splits recursively in order — paragraph → sentence → word. The default in LangChain's `RecursiveCharacterTextSplitter`
Semantic Chunking: Embeds consecutive sentences and splits at points where cosine similarity drops sharply, producing semantically cohesive chunks
Late Chunking (Jina 2024/2025): Passes the full long document through the embedding model first, extracts per-token embeddings, then pools them to form chunk embeddings. Because the chunks retain surrounding context, pronouns and co-references are preserved
Hierarchical / Parent-Child: Search on small chunks; pass the parent chunk to generation. Balances precision and context length

Late Chunking implementation example:

```python from transformers import AutoModel, AutoTokenizer import torch

model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")

def late_chunking(long_text: str, chunk_boundaries: list[tuple[int, int]]): inputs = tokenizer(long_text, return_tensors="pt", truncation=True, max_length=8192) with torch.no_grad(): token_embeds = model(**inputs).last_hidden_state[0] chunk_embeds = [] for start, end in chunk_boundaries: chunk_embeds.append(token_embeds[start:end].mean(dim=0)) return torch.stack(chunk_embeds) ```

After deploying Late Chunking in a contract RAG at KGA, Recall@10 improved from 71% to 88% on documents with heavy cross-referencing (e.g., "as defined in Article 3").

Layer 5: Evaluation Metrics and Offline Evaluation

Retrieval cannot be evaluated as "it works." Quantitative metrics and regression tests are mandatory.

Recall@k: The proportion of ground-truth documents appearing in the top k. The most important metric — it sets the RAG ceiling
nDCG@10 (Normalized Discounted Cumulative Gain): Accounts for ranking quality, not just presence. The de facto standard in commercial search
MRR (Mean Reciprocal Rank): Mean of the reciprocal of the rank of the first correct result. Closely approximates user-perceived quality
Hit Rate@k: Binary — did the correct document appear in the top k?

Evaluation sets should have a minimum of 100 queries, ideally 500. Maintain query and ground-truth document ID pairs in CSV format and run as regression tests in CI.

```python import numpy as np

def ndcg_at_k(ranked_ids: list[str], relevance: dict[str, int], k: int = 10) -> float: gains = [relevance.get(doc_id, 0) for doc_id in ranked_ids[:k]] dcg = sum((2 g - 1) / np.log2(idx + 2) for idx, g in enumerate(gains)) ideal_gains = sorted(relevance.values(), reverse=True)[:k] idcg = sum((2 g - 1) / np.log2(idx + 2) for idx, g in enumerate(ideal_gains)) return dcg / idcg if idcg > 0 else 0.0 ```

Frameworks like LlamaIndex's `RetrieverEvaluator`, RAGAS, and Trulens standardize these calculations — in 2026, start by installing RAGAS rather than building from scratch.

Reference Production Configuration

The standard production RAG configuration KGA proposes in 2026:

Ingest: Document → Late Chunking (Jina v3) → store parent/child
Index: Store both dense + sparse in pgvector or Qdrant
Query: Multi-Query (generate 3 variations with LLM)
First stage: Hybrid BM25 + Dense → RRF fusion → top 100
Second stage: Cohere Rerank-3 → compress to top 10
Generation: Pass top 10 to Claude Opus 4.7 for final answer
Eval: Run regression tests across 500 queries weekly in CI; block on nDCG@10 / MRR / Recall@20 threshold violations

Build this configuration rigorously, and "the RAG gets smarter without changing the LLM" becomes a real, felt experience. Retrieval is the core of RAG, and in 2026 it is the highest-priority layer for project investment.

Raising RAG Retrieval Quality in 2026: Hybrid, Rerankers, HyDE, and Late Chunking