Agent-Assisted Fine-Tuning | Agent Engineering

How coding agents automate the entire LLM fine-tuning workflow from GPU selection to model deployment using natural language instructions.

Fine-tuning LLMs has traditionally demanded deep MLOps expertise: selecting hardware, configuring training scripts, managing datasets, monitoring jobs, and deploying models. Coding agents have reduced much of this friction. Tools like Claude Code can now handle a significant portion of the workflow, making custom model training more accessible to developers without deep ML infrastructure experience.

Why Agent-Assisted Fine-Tuning?

The friction in fine-tuning is rarely the math — it is the dozens of operational decisions that precede actual training. Which GPU should you pick for a 7B model with LoRA? What batch size fits in 24 GB of VRAM? Does your dataset have the right column format for DPO? Coding agents handle these decisions automatically: they validate dataset format before incurring GPU costs, select hardware appropriate to model size and budget, generate training configuration, submit jobs to compute platforms, and monitor progress through a conversational interface.

Real-World Impact

Teams report spending $20–30 total for multiple training runs including failed experiments — cheaper than one hour of ML consulting. The agent handles hardware selection, job orchestration, and monitoring, removing friction from the fine-tuning process.

Architecture

Agent-Assisted Fine-Tuning Flow

┌─────────────────────────────────────────────────────────────────┐
│                      CODING AGENT                                │
│              (Claude Code / Codex / Gemini CLI)                  │
│                                                                  │
│  User: "Fine-tune Qwen-7B on my customer support data"          │
│                                                                  │
│  Agent Actions:                                                  │
│  1. Validate dataset format                                      │
│  2. Select hardware (a10g-large for 7B + LoRA)                  │
│  3. Generate training configuration                              │
│  4. Submit job to compute platform                               │
│  5. Monitor progress and report status                           │
│  6. Convert to GGUF for local deployment                         │
└─────────────────────────────────────────────────────────────────┘
                            │
                            │ Skills / Plugins
                            │
       ┌────────────────────┼────────────────────┐
       │                    │                    │
       ▼                    ▼                    ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│  Hugging Face   │  │    Unsloth      │  │   Local LLM     │
│     Jobs        │  │                 │  │  (llama.cpp)    │
│                 │  │                 │  │                 │
│ - Managed GPU   │  │ - 2x faster     │  │ - Private data  │
│ - Auto scaling  │  │ - 30% less VRAM │  │ - No API costs  │
│ - Trackio logs  │  │ - GGUF export   │  │ - Offline use   │
└─────────────────┘  └─────────────────┘  └─────────────────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │
                            ▼
                  ┌─────────────────┐
                  │  Fine-Tuned     │
                  │     Model       │
                  │                 │
                  │ • HF Hub        │
                  │ • GGUF local    │
                  │ • API endpoint  │
                  └─────────────────┘

Training Methods

Agents support multiple training methods and automatically select the best approach based on your dataset and goals. Supervised Fine-Tuning (SFT) is best for high-quality input-output demonstration pairs — customer support conversations, code generation examples, and domain-specific question answering. The dataset needs a messages column with conversation format. Direct Preference Optimization (DPO) suits alignment tasks where you have preference-annotated data with chosen and rejected columns; it requires no separate reward model and is typically applied after SFT. Group Relative Policy Optimization (GRPO) excels at verifiable tasks with programmatic success criteria such as math reasoning and structured code generation, where the model learns from comparing multiple sampled responses against a reward function.

For best results, combine methods in sequence: SFT to teach behaviors, DPO for alignment, and GRPO for reasoning capability.

Method	Best For	Dataset Requirements
SFT	Teaching specific behaviors, domain adaptation	`messages` column with conversations
DPO	Alignment, preference learning, safety	`chosen` and `rejected` columns
GRPO	Math, code, verifiable reasoning tasks	Tasks with programmatic success criteria

Fine-Tuning Frameworks

Several frameworks can be driven by coding agents. Unsloth delivers 2x faster training with 30% less VRAM through kernel-level optimizations and supports native GGUF export for local deployment:

from unsloth import FastLanguageModel
import torch

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-Instruct-bnb-4bit",
    max_seq_length=4096,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

# Add LoRA adapters (2x faster than standard)
model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=64,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=42,
)

# Train with HuggingFace Trainer
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=100,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        output_dir="outputs",
    ),
)

trainer.train()

# Save and convert to GGUF
model.save_pretrained_gguf(
    "outputs-gguf",
    tokenizer,
    quantization_method="q4_k_m"
)

Axolotl handles multi-GPU production workloads via YAML configuration with DeepSpeed and FSDP support. LLaMA-Factory provides a zero-code web UI for beginners and supports over 100 model architectures. HuggingFace TRL is the official library for RLHF pipelines and research-grade flexibility.

LoRA and QLoRA

Parameter-efficient fine-tuning through LoRA trains small adapter layers — typically with rank 32 and alpha 64 — rather than full model weights. This updates roughly 1% of the original parameters while preserving base model quality and allowing multiple adapters per base model. QLoRA combines LoRA with 4-bit NormalFloat quantization, enabling a 70B model to train on a single 24 GB GPU.

When to Use Which

Use LoRA for 95% of production fine-tuning needs — it is efficient and maintains quality. Use QLoRA when VRAM is limited or training very large models. Use full fine-tuning only when maximum accuracy is critical and resources are abundant.

Hardware and Cost Guide

Model Size	Recommended GPU	Training Time	Estimated Cost
<1B	t4-small	Minutes	$1–2
1–3B	t4-medium / a10g-small	Hours	$5–15
3–7B	a10g-large (LoRA)	Hours	$15–40
7–13B	a100-large (LoRA)	Hours	$40–100
70B+	Multi-GPU / QLoRA	Many hours	$100+

Cost Optimization

Start with small test runs (100 examples) to validate your workflow before committing to full training. The agent automatically suggests appropriate hardware to balance cost and performance.

Skill Transfer as an Alternative

An alternative to weight-based fine-tuning is transferring expertise from expensive frontier models to cheaper ones via structured context — what HuggingFace’s upskill tool calls the “Robin Hood” approach. A capable teacher model (like Claude Opus) solves a problem and exports the execution trace as a skill: a structured directory with a SKILL.md file (~500 tokens) encoding domain expertise and a skill_meta.json defining evaluation test cases. Smaller student models then load the skill as context.

Skills don’t improve all models equally. Always measure per model using upskill eval before deployment — some models may regress. Skills work best for specialized domains like CUDA kernel patterns, API usage conventions, and project-specific guidelines, where concise structured context can meaningfully encode expertise.

Best Practices

Always validate dataset format on CPU before incurring GPU costs, then run a test job with 100 examples before full training. Save checkpoints every 500 steps for long runs to enable recovery from failures and evaluation at intermediate stages. Watch training loss via Trackio or Weights & Biases — flat loss means the model is not learning; spiking loss indicates issues with learning rate or data quality. Push datasets to Hugging Face Hub with version tags for reproducibility. Run the fine-tuned model through your evaluation suite before production deployment.