Skip to content
Back to Portfolio
AI / Machine LearningIn Development

LoRA-JP — Japanese Domain LoRA Fine-tuning Pipeline

LoRA-JP — Japanese Domain LoRA Fine-tuning Pipeline

A reproducible QLoRA/LoRA fine-tuning pipeline for Japanese business documents. Uses Unsloth+TRL to accelerate training and vLLM dynamic LoRA adapters for hot-swappable serving. Internal R&D project.

2026 Ongoing (R&D prototype) 2026-05
#LoRA#ファインチューニング#vLLM#日本語#QLoRA

Live Demo

Preview the actual application interface

DEMO
app.finetune.jp/dashboard
Train loss
0.174
step 4212/5000
Eval F1
0.884
+26pt vs base
Throughput
1,842 tok/s
4x A100-80G
LoRA rank
r=64
α=128 · QLoRA 4bit

Training curve

train loss eval loss
step 01k2k3k4k5k

VRAM usage

48GBof 80GB · 4bit QLoRA

base weights

16.4 GB (4bit)

adapters + kv

31.6 GB

Checkpoint registry

internal benchmark · 社内検証
ckptbase modeldatasetadapterF1status
ckpt-4212Qwen2.5-32Bjp-legal-42k148 MB0.884deployed
ckpt-4198Llama-3.1-70Bjp-medical-18k312 MB0.871eval
ckpt-4180Qwen2.5-32Bjp-finance-26k148 MB0.852deployed
ckpt-4155Phi-3-14Bjp-customer-94k82 MB0.818archive
ckpt-4142Qwen2.5-32Bjp-legal-42k (v1)148 MB0.806archive

Domain benchmark

base vs tuned
法律QA (legal-ja)+26pt
base 62tuned 88
医療NER (med-ner-ja)+20pt
base 71tuned 91
金融要約 (fin-sum)+26pt
base 58tuned 84
長文読解 (jp-mmlu)+13pt
base 66tuned 79

Hot-swap adapter

live
checkpointckpt-4212
adapter size148 MB
swap latency112 ms
base kept hot shared
Adapter swapped in-place — base weights pinned in VRAM. Zero cold-start.

Challenge

Generic LLMs handle internal abbreviations, in-house terminology, and document formats poorly, and prompt engineering alone can't close the gap. Full fine-tuning, on the other hand, is too VRAM- and cost-heavy for small in-house experiments.

Solution

Unsloth-driven QLoRA (4-bit quantization + LoRA) makes training feasible on a single A100 40GB. TRL's SFTTrainer and DPOTrainer combine supervised and preference tuning, with WandB tracking loss and sample outputs. Resulting adapters are hot-swapped through vLLM's dynamic LoRA loading.

Results

  • 1.9× training throughput on identical hardware (Unsloth vs. vanilla HF)
  • +14% task accuracy on the internal eval set (internal benchmark)
  • Three adapters served concurrently via vLLM hot-swap with zero restarts
  • Training-job configs unified into reviewable YAML in PRs
Key Metrics

Measured Impact

学習スループット

1.9x

vs HF標準

社内タスク精度

+14%

内部評価

同時アダプタ数

3

vLLM dynamic

学習コスト

A100 40GB x1

単一GPU

Features

What it does

学習

QLoRA 4bit

bitsandbytes+PEFTでVRAM消費を最小化。

選好学習

TRL DPOTrainerで好ましい応答を学習。

サービング

動的LoRA切替

vLLMホットスワップで再起動なし切替。

複数アダプタ多重化

同一ベースモデルで3系統のアダプタ同時公開。

Architecture

System Layers

Layered architecture showing components, responsibilities, and data flow.

L1

Layer

データ層

学習用ペアをParquetで版管理。

社内ドキュメントETLSudachi前処理Parquet
L2

Layer

学習層

QLoRAでVRAM節約しつつ選好学習まで統合。

UnslothTRL SFTTrainerDPOTrainerDeepSpeed ZeRO-2
L3

Layer

サービング層

複数アダプタを同一ベースモデルで多重化。

vLLMdynamic LoRATritonEnvoy
Development Process

How we built it

Step 1

データ設計

SFT/DPO用のプロンプト・応答ペアを設計。

Deliverables

  • スキーマ
  • サンプル500件
Step 2

学習パイプライン

Unsloth+TRLでジョブをYAML化。

Deliverables

  • 学習ジョブYAML
  • 再現スクリプト
Step 3

評価接続

Kotobaハーネスと自動連携。

Deliverables

  • CIジョブ
  • 差分レポート
Step 4

サービング検証

vLLMでアダプタ同時提供の負荷試験。

Deliverables

  • 負荷試験レポート
  • 運用手順書
Roadmap

Delivery Timeline

  • Phase 1In Progress2026-05

    データ前処理

    社内文書をSFT/DPO形式に変換するETLを整備。

  • Phase 2Planned2026-06

    Unsloth学習基盤

    QLoRA学習ジョブをKubernetesで再現可能化。

  • Phase 3Planned2026-07

    vLLMホットスワップ

    dynamic LoRA loadingを本番類似環境で検証。

  • Phase 4Planned2026-09

    評価ループ統合

    Kotoba評価ハーネスと接続し回帰テスト化。

Team

Who built it

2engineers

Roles

  • MLエンジニア (リード)
  • データエンジニア
Tech Stack

Tools & Platforms

Backend

Python 3.12vLLM

Infrastructure

Docker

Other

PyTorch 2.4UnslothTRLQLoRAPEFTbitsandbytesWandBDeepSpeedHugging Face HubMLflow
Build with KGA

Considering a similar project?

We will propose the best solution for your business needs.

Discuss Your Project