How I Trained an Email Triage SLM Using Tinker APIs

Feb 21, 2026

A learning project exploring LoRA fine-tuning and Tinker's distributed training API. Email triage served as a concrete use case to understand the mechanics of fine-tuning small language models—how LoRA works, what happens during training, and what infrastructure Tinker abstracts away.

Why This Project

After a year of building agents and prompting LLMs, I got curious about the math behind it—how do these models actually learn? That led me to machine learning. I went through Google's Machine Learning Crash Course. I'd always wanted to train a model but lacked compute. When Tinker came along, I gave it a try and picked an SLM (Small Language Models are the Future of Agentic AI). Email triage provided a clear, structured task to explore these concepts.

The goal wasn't to build a production system, but to understand:

How LoRA adapters modify model behavior
What happens during forward and backward passes
How tokenization and masking affect learning
What Tinker handles vs. what we implement

Email triage output format:

Step 1: Classification → \{category\}
Step 2: Priority → \{High|Medium|Low\}
Step 3: Suggested actions → \{actions\}
Step 4: Drafted response → \{draft\}

This format made it easy to see what the model learned—format adherence, category recognition, priority mapping. A good test case for understanding fine-tuning mechanics.

How LoRA (Low-Rank Adaptation) Works

Base Model: Qwen/Qwen3-4B-Instruct-2507 (4B parameters, instruction-tuned)

Why LoRA? Full fine-tuning updates all 4B parameters, requiring significant compute and risking catastrophic forgetting—the model loses previously learned knowledge. LoRA (Low-Rank Adaptation) adds trainable adapter matrices to attention layers while keeping base weights frozen, preserving general knowledge while specializing for the task. In simpler terms: instead of retraining the whole model, you add a small "plugin" that learns your task—the original model stays as-is, and the plugin handles the new behavior.

LoRA Mathematical Foundation:

The core insight: weight updates during fine-tuning have low intrinsic rank—meaning the changes we need can be captured with far fewer numbers than the full matrix. You don't need to tweak millions of parameters; a smaller set of adjustments does the job. For a weight matrix W ∈ R^(d×d), instead of learning ΔW directly, LoRA decomposes it:

W' = W + ΔW
ΔW = BA where B ∈ R^(d×r), A ∈ R^(r×d)

This factorization exploits the low-rank property of weight updates. Think of it like tuning a radio: you don't need to rebuild the whole circuit—a few knobs (volume, bass, treble) can get you the sound you want. Same here: the model doesn't need millions of independent tweaks; a smaller set of adjustments (the B and A matrices) captures what's needed. The rank r << d is that "smaller set"—it captures the essential patterns for adaptation.

Why Low-Rank Works:

During fine-tuning, weight updates don't require full rank. The model needs to adjust attention patterns, not completely restructure representations. Low-rank matrices B and A can express these pattern changes efficiently. With rank r=32 and d=4096, we reduce trainable parameters from ~16M to ~260K per matrix—about 1.6% of the original size.

Rank Selection:

Rank controls the expressiveness-capacity trade-off:

Lower ranks (8-16): Underfit—insufficient capacity to learn task-specific patterns
Rank 32: Common default—enough capacity for most tasks without overfitting
Higher ranks (64-128): Diminishing returns—more parameters don't improve performance proportionally, increase overfitting risk

The optimal rank depends on task complexity. For structured output tasks like this, rank 32 is a reasonable starting point—enough expressiveness without excessive capacity.

LoRA Application:

LoRA adapters are applied to attention layers (query, key, value projections) and optionally feed-forward layers. During forward pass:

h' = (W + BA)h = Wh + BAh

The base model computes Wh (frozen), while LoRA computes BAh (trainable). This additive modification allows the model to specialize attention patterns without disrupting base representations.

Prompt Engineering vs LoRA: Tradeoffs

When deciding between prompt engineering and LoRA fine-tuning, several factors matter:

Prompt Engineering:

Pros:
- No training required—immediate iteration
- No additional model weights—use base model as-is
- Easy to experiment with different instructions
- No compute cost for training
Cons:
- Format consistency unreliable—model may deviate from structure
- Token overhead—long prompts consume context window
- Limited pattern learning—can't teach complex mappings reliably
- Per-request cost—prompt tokens add to inference cost
- Generic responses—model doesn't learn domain-specific patterns

LoRA Fine-Tuning:

Pros:
- Format consistency—model learns structure from training data
- Pattern learning—can teach complex mappings (category→priority, etc.)
- Shorter prompts—model internalizes instructions
- Domain specialization—learns task-specific features
- Better generalization—learns patterns, not just instructions
Cons:
- Training cost—requires compute for fine-tuning
- Training time—can't iterate instantly
- Storage—need to store adapter weights (~10-50MB)
- One-time setup—requires data preparation and training loop

When to Use Each:

Prompt engineering makes sense when:

Task is simple and can be described clearly in prompts
Format consistency isn't critical
You need immediate iteration without training
Task changes frequently (prompts easier to update than retraining)
Limited compute budget for training

LoRA fine-tuning makes sense when:

Format consistency is critical (structured outputs)
Complex mappings need to be learned (category→priority, etc.)
Task is stable and won't change frequently
You have training compute available
Domain-specific patterns matter (email triage, code generation, etc.)

Hybrid Approach:

You can combine both: use prompt engineering for flexibility, then fine-tune with LoRA for consistency. Fine-tune on a core set of patterns, then use prompts for variations or edge cases.

For this project, LoRA made sense because:

Structured output format needed consistency
Category-to-priority mappings were complex to prompt
Task was well-defined and stable
Training compute was available via Tinker

Data Preparation: Supervised Learning Setup

Dataset: jason23322/high-accuracy-email-classifier from HuggingFace

Transformation Process:

Raw emails with category labels are transformed into prompt-completion pairs for supervised fine-tuning. The prompt provides context and instruction, while the completion contains the desired structured output.

Prompt Construction:

System instruction: "You are an efficient admin assistant..."
Task specification: "Analyze this incoming email step-by-step..."
Email content: The actual email text

Completion Construction:

Structured format: Step 1-4 with consistent delimiters
Category mapping: Direct label → formatted category name
Priority logic: Category → priority level (verify_code=High, spam=Low)
Action generation: Context-aware suggestions based on category
Draft response: Category-appropriate template

Domain Knowledge Encoding:

The mapping logic encodes explicit domain knowledge:

verify_code → High priority (time-sensitive, urgent action)
spam/promotions → Low priority (non-essential, can be deleted)
updates → Medium priority (informational, acknowledge if needed)

This teaches the model the relationship between content type and urgency, rather than learning it from scratch.

Dataset Statistics:

400 examples (train[:400] split)
6 categories: Spam, Promotions, Verify_code, Updates, Forum, Social_media
Balanced distribution prevents category bias

Why This Format Works:

The structured format with explicit steps teaches the model:

Format consistency (always Step 1-4)
Category recognition (content → category)
Priority mapping (category → priority)
Action generation (context → actions)
Response drafting (category → draft)

This is more effective than free-form completions because it provides clear structure for the model to learn.

Tokenization: Next-Token Prediction Setup

Tokenization:

Language models are trained on next-token prediction: given tokens [t₁, t₂, ..., tₙ], predict tₙ₊₁. This requires careful tokenization to separate what the model sees (input) from what it predicts (target).

Prompt-Completion Separation:

The key insight: we want the model to learn to generate completions given prompts, not modify prompts. This requires:

Prompt tokenization: Tokenize with special tokens (BOS) to mark sequence start
Completion tokenization: Tokenize without special tokens to avoid double BOS/EOS
Weight masking: Set prompt weights to 0, completion weights to 1

Weight Masking Rationale:

Loss is computed as:

L = Σᵢ wᵢ × CrossEntropy(predicted_i, target_i)

Where wᵢ = 0 for prompt tokens, wᵢ = 1 for completion tokens. This ensures:

Model doesn't learn to modify prompts (w=0 means no gradient)
Model learns to generate completions (w=1 means full gradient)
Training focuses on completion generation, not prompt understanding

EOS Token Handling:

End-of-sequence (EOS) tokens teach stopping behavior. Without EOS:

Model continues generating indefinitely
No clear stopping signal
Repetition and hallucination increase

With EOS appended to completion:

Model learns to predict EOS after completion
Clear stopping signal during inference
Prevents over-generation

Token Shifting:

For next-token prediction, we shift tokens:

Input: [t₁, t₂, ..., tₙ₋₁]
Target: [t₂, t₃, ..., tₙ]

This aligns with the autoregressive nature: predict token i+1 given tokens 1..i.

Implementation Flow:

Tokenize prompt with add_special_tokens=True → includes BOS
Tokenize completion with add_special_tokens=False → no BOS
Append EOS to completion tokens
Combine: [prompt_tokens, completion_tokens]
Create weights: [0, ..., 0, 1, ..., 1] (prompt=0, completion=1)
Shift for next-token prediction: input = tokens[:-1], target = tokens[1:]
Shift weights accordingly: weights = weights[1:]

This format ensures the model learns the prompt→completion mapping correctly.

Training Loop: Implementation Outline

Training Flow:

Epoch Loop: Iterate through dataset multiple times (3 epochs)
Batch Processing: Process examples in batches of 4
Data Preparation: Convert prompt-completion pairs to Datum objects with tokenization
Forward-Backward Pass: Compute loss and gradients
Optimizer Step: Update LoRA adapter weights using Adam
Checkpointing: Save adapter weights periodically

Batch Processing:

For each batch:

Load examples from dataset
Tokenize each prompt-completion pair
Create Datum objects with input tokens, target tokens, and weights
Send batch to Tinker for forward-backward computation

Forward-Backward Pass:

Tinker handles:

Forward pass through transformer layers
Loss computation (cross-entropy with weights)
Backpropagation through LoRA adapters only
Gradient computation for B and A matrices

Optimizer Step:

Adam optimizer updates LoRA weights:

Maintains running averages of gradients (momentum)
Adapts learning rate per parameter (adaptive)
Updates: θ_new = θ_old - lr × m_t / (√v_t + ε)

Hyperparameters:

Batch size: 4 (small batches reduce memory, add gradient noise for regularization)
Learning rate: 1e-4 (conservative for fine-tuning, prevents catastrophic forgetting)
Epochs: 3 (sufficient convergence without overfitting on 400 examples)
Optimizer: Adam with weight_decay=0.0 (no L2 regularization, Tinker default)

Loss Function: Cross-entropy with weighted masking. Only completion tokens contribute to loss, enforcing the prompt→completion mapping while preserving prompt understanding.

Forward and Backward Pass: Deep Dive

Forward Pass Mechanics:

Tokenization: Text → token IDs via HuggingFace tokenizer (subword tokenization)
Embedding Lookup: Token IDs → dense vectors (embedding matrix lookup, typically 4096-dim for Qwen)
Positional Encoding: Add positional information to embeddings (sinusoidal or learned)
Transformer Layers (repeated N times):
- Multi-head attention:
  - Query, Key, Value projections (with LoRA: W_qkv + B_qkv A_qkv)
  - Attention scores: QK^T / √d_k
  - Weighted sum: Attention(Q, K, V) = softmax(QK^T / √d_k) V
  - LoRA adapters modify attention patterns to focus on task-specific features
- Feed-forward networks:
  - Two linear transformations with activation (with LoRA adapters)
  - Typically: FFN(x) = ReLU(xW₁ + b₁)W₂ + b₂
- Layer normalization: Normalize activations (stabilizes training)
- Residual connections: x + sublayer(x) (enables gradient flow)
Output projection: Final hidden states → vocabulary logits (|V| dimensions)
Loss computation: Cross-entropy between predicted logits and target tokens, masked by weights

Backward Pass Mechanics:

Loss computation: L = Σᵢ wᵢ × CrossEntropy(logits_i, target_i)
- Only completion tokens contribute (wᵢ = 1)
- Prompt tokens ignored (wᵢ = 0)
Gradient computation: Backpropagation computes ∂L/∂θ for LoRA parameters
- Chain rule: ∂L/∂θ = ∂L/∂output × ∂output/∂hidden × ... × ∂hidden/∂θ
- Only LoRA adapters (B, A matrices) receive gradients
- Base model weights frozen (no gradients)
Gradient accumulation: Gradients averaged over batch
- ∇θ = (1/batch_size) × Σᵢ ∇θᵢ
- Reduces variance, stabilizes training
Optimizer step: Adam updates LoRA adapter weights
- Momentum: m_t = β₁m_{t-1} + (1-β₁)∇θ
- Variance: v_t = β₂v_{t-1} + (1-β₂)(∇θ)²
- Update: θ_new = θ_old - lr × m_t / (√v_t + ε)

Why LoRA Works Theoretically:

LoRA exploits the low-rank property of weight updates. During fine-tuning, the model needs to adjust attention patterns, not completely restructure representations. The low-rank matrices B and A can express these pattern changes efficiently:

Attention modification: LoRA adapters change how the model attends to different parts of the input
Feature specialization: Instead of generic attention, the model learns to focus on task-specific features relevant to the training data
Parameter efficiency: Low-rank factorization reduces trainable parameters while maintaining expressiveness

The key insight: fine-tuning updates have low intrinsic rank, so we can capture them with much fewer parameters than full fine-tuning.

What the Model Learns

After training, the LoRA adapters encode:

Format adherence: Always output Step 1-4 structure
Category recognition: Email content → category mapping
Priority logic: Category → priority level (verify_code=High, spam=Low)
Action generation: Context → appropriate actions
Response drafting: Category → suitable draft response
Stopping behavior: EOS token placement prevents repetition

The base model's general language understanding remains intact. LoRA adapters specialize attention mechanisms for the specific task patterns learned during training.

What Tinker Offloads: Infrastructure Abstraction

What We Implement:

Data Preparation: Format data into prompt-completion pairs
Tokenization: Convert text to tokens with proper masking and EOS handling
Training Loop: Iterate through batches, call forward-backward, optimizer step
Checkpoint Management: Save adapter weights to Tinker dashboard

What Tinker Handles:

GPU Infrastructure:
- GPU allocation and management
- CUDA setup and configuration
- Memory management across distributed nodes
- No need to provision or configure GPUs
Distributed Training:
- Multi-GPU training orchestration
- Gradient synchronization across nodes
- Batch distribution and aggregation
- Fault tolerance and recovery
Model Loading and Management:
- Base model loading from HuggingFace
- Model sharding across GPUs
- LoRA adapter initialization
- Weight checkpointing and restoration
Training Operations:
- Forward pass through transformer layers
- Backpropagation computation
- Gradient computation for LoRA adapters only
- Optimizer state management (Adam momentum/variance)
Infrastructure Details:
- Batch scheduling and queuing
- Resource allocation
- Network communication between nodes
- Checkpoint storage and retrieval

API Abstraction:

Tinker provides a simple API:

create_lora_training_client(): Initialize training with base model and rank
get_tokenizer(): Get tokenizer matching Tinker's configuration
forward_backward(): Compute loss and gradients (async future)
optim_step(): Update weights with optimizer (async future)
save_weights_for_sampler(): Save checkpoint to dashboard

What This Means:

Instead of managing:

PyTorch distributed training setup
CUDA device allocation
Multi-GPU communication
Gradient synchronization
Checkpoint storage infrastructure

We focus on:

Data preparation and formatting
Training loop logic
Hyperparameter selection
Model evaluation

Trade-offs:

Vendor lock-in: Weights stored in Tinker's format, tied to their infrastructure
Less control: Can't customize batch scheduling, gradient accumulation strategies
Cost: Scales with training time (pay for compute usage)
Abstraction: Less visibility into low-level training details

Benefits:

Faster iteration: No infrastructure setup, start training immediately
Scalability: Distributed training handled automatically
Reliability: Fault tolerance and recovery built-in
Simplicity: Focus on model and data, not infrastructure

What Changed After Training

Base Model Behavior:

\{"category": "spam", "priority": "medium", "analysis": "This appears to be promotional content..."}

Inconsistent format, incorrect priority mappings, generic responses. The model doesn't follow the structured format reliably.

After Fine-Tuning:

Step 1: Classification → Spam
Step 2: Priority → Low
Step 3: Suggested actions → Delete or mark as spam; no reply needed
Step 4: Drafted response → No response required. Marking as spam/promotion.

Consistent format, correct mappings, task-specific responses. The model learned the structure and mappings.

Observations: Format adherence improved significantly. The model learned to follow the Step 1-4 structure and apply category-to-priority mappings correctly. This was useful for understanding what fine-tuning actually changes in model behavior.

Hyperparameter Choices

Base Model: Qwen3-4B-Instruct

Instruction-tuned models understand structured formats better than base models
4B parameters: reasonable size for experimentation without excessive cost
Good starting point for understanding fine-tuning mechanics

Rank 32:

Common default for LoRA fine-tuning
Lower ranks (8-16) may underfit; higher ranks (64+) show diminishing returns
~260K trainable params per attention layer—manageable for learning

3 Epochs:

1 epoch: insufficient for pattern learning
3 epochs: reasonable convergence on 400 examples
More epochs: risk of overfitting (memorization vs. generalization)

Batch Size 4:

Small batches: more gradient updates per epoch
Adds gradient noise (regularization effect)
Tinker handles batching, so this is more about update frequency

Learning Rate 1e-4:

Standard range for LoRA fine-tuning (1e-5 to 1e-4)
Conservative: prevents catastrophic forgetting
Higher rates can destabilize training; lower rates converge slowly

These were reasonable defaults for exploration. Systematic hyperparameter tuning would require more experimentation.

What I Learned

Key Takeaways:

LoRA's efficiency: The low-rank factorization works well—small adapter matrices capture task-specific patterns without retraining the entire model. The math checks out in practice.
Tokenization matters: Prompt masking and EOS token placement significantly affect what the model learns. Small details in tokenization have large impacts on behavior.
Training mechanics: Understanding forward/backward passes helps debug issues. Seeing how gradients flow through LoRA adapters clarifies why they work.
Infrastructure abstraction: Tinker handles a lot—distributed training, GPU management, checkpointing. Understanding what's abstracted away helps evaluate when to use APIs vs. building custom infrastructure.

Implementation Outline

What We Build:

Data Pipeline:
- Load dataset from HuggingFace
- Format into prompt-completion pairs
- Apply task-specific mappings (category-to-priority in this case)
- Save as JSONL for training
Tokenization Layer:
- Convert prompt-completion pairs to Datum objects
- Handle prompt masking (weights = 0)
- Handle completion weighting (weights = 1)
- Add EOS tokens for stopping behavior
- Shift tokens for next-token prediction
Training Orchestration:
- Initialize Tinker training client with base model and rank
- Get tokenizer from Tinker (matches their configuration)
- Iterate through dataset in batches
- Call forward-backward for each batch
- Call optimizer step to update weights
- Save checkpoints periodically
Checkpoint Management:
- Save adapter weights to Tinker dashboard
- Download weights as .tar.gz archives
- Extract for local use or merge with base model

Key Implementation Details:

Tokenizer synchronization: Use training_client.get_tokenizer() to ensure tokenizer matches Tinker's configuration (critical for compatibility)
Datum format: Use raw lists for loss_fn_inputs, not TensorData (Tinker handles tensor conversion internally)
Synchronous execution: Use .result() on futures for simplicity (could be async for better throughput)
Checkpoint format: Save via save_weights_for_sampler() for dashboard visibility and easy retrieval

Using Trained Adapters:

Tinker SamplingClient: Use Tinker's API for inference (no local setup)
Local PEFT: Download adapters, use with Transformers + PEFT library
Merged model: Merge adapters into base model for faster inference (one-time merge cost)

Code: GitHub · Model: Hugging Face