Why Embedding Models Have Become the Primary Battleground Again
In the early days of the 2023 RAG boom, embedding models were largely dismissed — "text-embedding-ada-002 is good enough" was the prevailing attitude. The situation in 2026 is entirely different. As LLM generation quality approaches saturation for some tasks, it has become common understanding that what you fail to retrieve sets the ceiling for the entire RAG system. Embedding model selection and tuning have recovered their status as the highest-ROI investment available.
Hugging Face's MTEB (Massive Text Embedding Benchmark) leaderboard now lists 200+ models as of April 2026, but the practical options for Japanese-language products narrow down to around 10. This post compares OpenAI, Voyage, Cohere, BAAI, and Jina across MTEB, BEIR, and JMTEB scores alongside ease of implementation.
April 2026 Score Summary
- OpenAI text-embedding-3-large: MTEB 64.6, JMTEB 75.2, 3072 dimensions, 8192-token context, $0.13/M tokens
- OpenAI Embed-4 (released 2026/02): MTEB 68.9, JMTEB 79.8, 4096 dimensions, 32k-token context, $0.18/M tokens
- Voyage-3-large: MTEB 67.8, BEIR 58.3, 1024/2048/4096 dimensions, 32k-token context, $0.18/M tokens
- Voyage-3: MTEB 64.2, 1024 dimensions, 32k-token context, $0.06/M tokens (among the best value-for-cost)
- Cohere Embed v4: MTEB 68.1, JMTEB 77.5, 1536 dimensions, 128k-token context, $0.12/M tokens, multimodal
- BGE-M3 (BAAI): MTEB 59.4, JMTEB 73.1, 1024 dimensions, 8192-token context, OSS (MIT)
- Jina Embeddings v3: MTEB 65.5, JMTEB 74.8, 1024 dimensions (Matryoshka), 8192-token context, OSS + API
On raw scores, Embed-4 and Voyage-3-large sit at the top — but the production decision depends on four axes: (1) fine-tuning feasibility for your domain, (2) latency and dimensionality, (3) multilingual and Japanese-language performance, (4) data residency compliance.
Japanese Performance: Where JMTEB Matters
JMTEB, a Japanese-language MTEB maintained by a research group at Tokyo Institute of Technology, evaluates Retrieval, STS, Classification, Clustering, and Reranking in a combined score. Models that rank highly on English MTEB can shift significantly on JMTEB.
April 2026 JMTEB trends:
- On the Japanese Retrieval subset: Embed-4 > Voyage-3-large > Cohere Embed v4 > BGE-M3 > text-embedding-3-large
- On STS (sentence similarity): Cohere Embed v4 leads. It handles the wide variety of Japanese keigo and phrasing variations particularly well.
- On cross-lingual retrieval (Japanese-English semantic search): BGE-M3 performs surprisingly well, within close range of Voyage-3-large.
For domestic finance and public sector projects where data cannot cross borders, OpenAI, Voyage, and Cohere are off the table. Self-hosting BGE-M3 is the only viable path. In 2026, running GGUF/AWQ quantized BGE-M3 via Llama.cpp or vLLM on a single H100 can handle 2,000 req/s — it has become the default embedding model for on-premises RAG deployments.
Matryoshka Representation: Layering Dimensions
Matryoshka Representation Learning (MRL) is a 2026 technology common to Voyage-3, Jina v3, and OpenAI's text-embedding-3 series. The model is trained with a hierarchical loss function so that just the first k dimensions of the resulting vector still carry sufficient semantic meaning.
Previously, "3072 dimensions gives high accuracy but is heavy; truncating to 256 dimensions causes accuracy to collapse" was the tradeoff. With MRL-enabled models, a two-stage approach becomes possible: index at 3072 dimensions, run first-stage retrieval at 256 dimensions for speed, then re-rank with the full 3072 dimensions at the second stage.
```python from openai import OpenAI import numpy as np
client = OpenAI() resp = client.embeddings.create( model="text-embedding-3-large", input=texts, dimensions=256, # Return only first 256 dimensions via MRL ) short_vecs = np.array([d.embedding for d in resp.data])
# Separately obtain full 3072-dimension vectors for full re-ranking resp_full = client.embeddings.create( model="text-embedding-3-large", input=texts, ) full_vecs = np.array([d.embedding for d in resp_full.data]) ```
Storing 100M vectors at 3072 dimensions requires 1.2 TB of raw data. Trimming to 256 dimensions with MRL brings that down to 100 GB, making HNSW construction and memory residency practical. Combined with the multi-vector capabilities in Qdrant and Weaviate, this becomes even more powerful.
ColBERT and Late Interaction
In 2026, production RAG systems implementing Late Interaction (ColBERT-style retrieval) alongside or in place of single dense vectors have become more common. ColBERT does not compress a document into a single vector — it retains token-level vector arrays, and similarity is computed against query-side token vectors using MaxSim.
- Nuance retention in long documents is dramatically better than single dense vectors
- Storage cost is 10–50× that of dense (depending on token count)
- Qdrant 1.12, Vespa, and Weaviate 1.28 all support multi-vector natively
Jina-ColBERT-v2 and ColBERTv2 (Stanford) deliver retrieval performance approaching top dense models on MTEB asymmetric tasks while remaining more robust to domain shift. They are particularly effective for long contracts, academic papers, and source code — content that cannot be adequately compressed into a single vector.
```python from ragatouille import RAGPretrainedModel
rag = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2") rag.index( collection=documents, index_name="contracts", max_document_length=512, split_documents=True, ) hits = rag.search("Warranty period clauses for LCD panels", k=20) ```
Domain-Specific Fine-Tuning
General-purpose embedding models are broad but inevitably lose to specialized domain models in areas like healthcare, legal, or proprietary terminology. In 2026, the toolchain for contrastive fine-tuning can be run in a matter of hours.
- sentence-transformers 3.x: `SentenceTransformerTrainer` with LoRA fine-tuning
- Voyage Fine-tuning API: Generates a custom model from domain query/document pairs in around 2 hours
- Cohere Custom Models: Domain learning for both Rerank and Embed
On a KGA healthcare project, fine-tuning BGE-M3 on 50,000 pairs of in-house Q&A produced an 8-point improvement in nDCG@10 on the JMTEB healthcare Retrieval subset and a 14-point improvement on the internal test set. The value of breaking dependency on the best general-purpose model is significant.
Selection Rules
- Broad domain, English-centric, API access acceptable — Voyage-3-large
- Japanese-centric, 128k long-context required — Cohere Embed v4
- On-premises required, MIT license — BGE-M3
- Cost efficiency above all — Voyage-3 or text-embedding-3-small
- Unified image + text retrieval — Cohere Embed v4 (multimodal)
- Long-text nuance at highest priority — Jina-ColBERT-v2 (Late Interaction)
Embedding selection is not "pick one model and done." The 2026 production architecture is a layered design: first-stage dense + second-stage Late Interaction + domain reranker.