The Synthetic Data Era Has Arrived
The open-source LLM fine-tuning landscape of 2026 has undergone a complete paradigm shift — from the "human-annotation-centered" approach of 2023–2024 to a "teacher-model-driven synthetic data" approach. High-quality datasets distilled from frontier closed-source models like Claude Opus 4.7, GPT-5, and Gemini 2.5 Ultra are now publicly available, and 7B–13B base models can now achieve instruction-following capabilities comparable to 70B models from 2024.
This article organizes April 2026 best practices across five axes: data generation, algorithms, Japanese-language specialization, reproducible recipes, and ethics.
Standard Teacher Model Distillation Pipeline
Microsoft's Phi series pioneered the "textbook-quality data" philosophy, and it has been refined further in 2026. Community datasets replicating the Phi-5/Phi-5-mini recipe have standardized on the following pipeline:
- Seed data extraction: pull the top 5% by quality score from Common Crawl + GitHub + arXiv + Stack Exchange
- Question generation via teacher model: prompt Claude Opus 4.7 to generate "10 questions a graduate student might ask about this document"
- Answer generation with chain-of-thought: GPT-5 generates answers with reasoning traces, self-consistency checked
- Difficulty balancing: mix easy/medium/hard at a 3:5:2 ratio, 200–4000 tokens in length
- Rejection sampling: a separate teacher scores outputs, bottom 30% discarded
The MAP-Neo-v2 dataset published in March 2026 (2.1T tokens, CC-BY-4.0) is a Japanese-English-Chinese multilingual corpus built with this pipeline. The compute required for continued pretraining on a Llama 3 8B base was equivalent to roughly ¥3 billion — and it's being distributed for free.
Choosing Between DPO, IPO, and KTO
Preference learning algorithms moved past the RLHF era into computationally lighter offline methods. Here's the current state of when to use each:
- DPO (Direct Preference Optimization): first choice when you have abundant pairwise preference data. Simple to implement, 1/5 the compute cost of PPO. Weaker reward-hacking resistance than PPO.
- IPO (Identity Preference Optimization): theoretically addresses DPO's overfitting problem. Outperforms DPO especially on small datasets (under 10K pairs).
- KTO (Kahneman-Tversky Optimization): no pairs required — learns from binary good/bad labels only. Can directly leverage user thumbs-up/thumbs-down logs, which is a major practical advantage.
- SimPO: improves on DPO without a reference model. 40% memory reduction, performance maintained. Close to becoming the 2026 standard.
- RLAIF (AI Feedback): replaces human labelers with Claude or GPT. 1/100 the cost, ~95% of human-annotation quality.
```yaml # SimPO configuration in axolotl (Qwen 3 7B base) base_model: Qwen/Qwen3-7B-Base rl: simpo simpo_gamma: 1.4 simpo_beta: 2.0 datasets: - path: argilla/ultrafeedback-binarized-preferences-cleaned type: chatml.ultra learning_rate: 5.0e-7 num_epochs: 1 sample_packing: true gradient_checkpointing: true adapter: lora lora_r: 64 lora_alpha: 128 ```
Japanese-Language Model Progress
By 2026, the route of continued training on top of foreign base models has decisively won for Japanese LLMs. Here's the current status of the three major lineages:
Swallow v3 (Tokyo Institute of Technology): continued pretraining + instruction tuning on Llama 4 70B. 600B additional Japanese tokens, JMT-Bench 8.52, Jaster 77.4. Free for research; commercial use follows the Llama 4 Community License.
Rinna Nekomata-2 (rinna): Qwen 3 72B base, commercially usable under Apache 2.0. Outperforms Swallow in honorifics, formal register, and business document fluency; JMT-Bench 8.47.
Sarashina 2.5 (SB Intuitions): hybrid of scratch training and Llama 4 distillation. Two sizes: 405B and 70B. As the standard-bearer for domestically developed "sovereign AI," adoption in finance, healthcare, and municipal government is accelerating rapidly.
The key 2026 trend: Japanese-specific model development has decomposed into three stages — base model selection × Japanese synthetic data × lightweight preference learning — reproducible by anyone with a few hundred lines of axolotl YAML.
Reproducible Recipe: axolotl × unsloth
unsloth in its 2026 version has improved QLoRA memory efficiency by 4.2x, reaching the point where a 70B QLoRA run fits on a single RTX 4090. axolotl supports both distributed training and preference learning with high reproducibility in multi-node, multi-GPU setups.
A typical Japanese instruction-tuning recipe:
- Choose base model (Qwen 3 7B Base)
- Japanese synthetic data: 500K examples (Claude Opus 4.7 distillation, CC-BY-4.0)
- unsloth + QLoRA r=128, 3 epochs, 18 hours on a single 3090
- SimPO phase: 100K pairs from rinna/ultrafeedback-ja, 6 hours on a single 4090
- Evaluation: JMT-Bench, Jaster, elyza-tasks-100
Total cost: roughly $180 in cloud compute equivalents. The era of building a Japanese model that outperforms 2024 commercial APIs has arrived.
Ethics and Data Provenance
The most important point to emphasize is data provenance. Even synthetic data carries the shadow of the teacher model's training data and its copyright implications. Since the EU AI Act took effect in 2026, models intended for European deployment must document:
- License list of seed data (including robots.txt compliance status)
- Teacher model ToS and derivative work clauses
- PII removal methodology and filter accuracy
- Bias evaluation (BBQ-ja, StereoSet-ja, etc.)
- Right-to-erasure compliance procedures
Hugging Face made Dataset Cards v2 mandatory in March 2026; datasets lacking the above information are excluded from download statistics displays. If you're building for commercial use, provenance documentation is a high-ROI investment.
What to Watch in H2 2026
Self-improvement loops (self-play, self-reward) are moving from research to practical application. Successors to Meta's Self-Rewarding Language Models, public implementations of Anthropic's Constitutional AI, and a Japanese-language "Constitutional AI" developed domestically are all anticipated. The era has arrived where fine-tuning practitioners are differentiated not by algorithm mastery but by their skill in data design and evaluation design.