Back to Blog

Choosing the Right Small Model to Fine-Tune

Carlos Mendez
6 min read
Share:

Most guides on fine-tuning start with the training process. But the decision that matters most happens before you write a single line of training code: choosing which model to fine-tune. I got this wrong for years by defaulting to "pick the biggest model that fits in VRAM." The approach I've landed on is more nuanced, and the results back it up.

When I needed a production extraction model for my knowledge graph system, I had benchmark data on seven models. The top performer was Gemma 3 at 4 billion parameters: 100% valid JSON, 0.81 entity F1, solid across the board. The model I actually chose to fine-tune was Gemma 4 E2B at 2.3 billion parameters. It scored lower on almost every metric. And that turned out to be exactly the right call.

Model Selection Decision Matrix

The Obvious Choice vs. the Right Choice

Here's the benchmark data that drove the decision:

Model Params Valid JSON Entity F1 Entity Recall Speed
gemma3:4b 4B 100% 0.81 0.85 44.9 tok/s
gemma4:e2b 2.3B 75% 0.76 1.00 42.1 tok/s

If you're picking a model for deployment without fine-tuning, Gemma 3 wins. It works out of the box. But I wasn't picking a model for deployment. I was picking a model to train, which changes the calculus entirely.

Look at that entity recall column. Gemma 4 E2B hits 1.00. It finds every entity in every test case. Its weakness is format compliance: 25% of the time, the output isn't valid JSON. It knows what to extract; it just hasn't learned how to format it consistently.

That's exactly the kind of weakness LoRA fine-tuning is built to fix. Format compliance is a surface behavior, a pattern the model can learn from a few hundred examples. Entity recall, on the other hand, reflects deeper semantic understanding. You can't easily train a model to "see" entities it's currently missing.

The Throughput Calculation

Beyond accuracy, there was a hardware constraint driving the decision. The extraction model needed to run on a Radeon RX 6600 with 8GB of VRAM. That GPU costs about $180 and draws 100 watts. It sits in an LXC container alongside an entity classification model that's already using part of the available memory.

The 4B model at F16 quantization needs roughly 8GB of VRAM. That's the entire capacity of the card, leaving nothing for KV cache or concurrent model loading. At Q8_0 quantization, it's around 4.5GB, which works but cuts into the headroom for the classification model.

The 2.3B model at Q8_0 sits at 4.9GB. That includes overhead from Gemma 4's multimodal architecture (the vision encoder weights are carried along even though we only use the language model). It fits comfortably with room for the classification model to stay warm in memory.

But the throughput argument was what sealed it. The 4B model's advertised inference speed ranges from 45 to 90 tokens per second depending on the hardware review you read. In practice, on consumer GPUs with real-world batch sizes, models tend to settle near the bottom of their advertised range. I expected 45 tokens per second, maybe 50 on a good day.

The E2B model, being smaller, has less work to do per token. I expected it to comfortably clear 50 tokens per second. In production, it hit 55.6. For a pipeline that processes every piece of content entering the knowledge graph, that 10-20% throughput advantage compounds.

Throughput and VRAM Analysis

The Fine-Tuning Hypothesis

My thesis going in: take the model with perfect entity recall and train it to format consistently. LoRA with rank 16 and alpha 32, targeting all attention and MLP projection layers. 513 training examples, 3 epochs.

The hypothesis played out exactly as expected. The fine-tuned E2B model produces valid structured JSON on every test I've thrown at it. Not 75% (its baseline score). Not 95%. Every single response is valid, parseable JSON with correctly typed entities and contextual relationships.

The entity recall that was already at 1.00 held steady. The format compliance that was at 75% jumped to effective 100%. The model learned the output schema without losing the semantic understanding that made it worth training in the first place.

What This Teaches About Model Selection

The pattern generalizes beyond my specific use case. When evaluating base models for fine-tuning, the selection criteria should be different from the criteria you'd use for zero-shot deployment:

Prioritize capabilities that are hard to train. Semantic understanding, entity recognition, reasoning quality: these are deep model behaviors that reflect the pre-training data and architecture. They're expensive to improve through fine-tuning. Pick a model that's strong where it's hard to improve.

Deprioritize capabilities that are easy to train. Output formatting, JSON compliance, prompt-following, style adherence: these are surface behaviors that respond well to LoRA with a few hundred examples. A model that's weak on formatting but strong on understanding is a better fine-tuning candidate than one that formats perfectly but misses entities.

Model the deployment constraints. VRAM, throughput, power consumption, concurrent model loading. These constraints eliminate options before accuracy even enters the conversation. There's no point fine-tuning a model that won't fit on your production hardware.

Benchmark the candidates, don't guess. I ran all seven models through the same 20-case evaluation before choosing. Without that data, I would have picked Gemma 3 based on parameter count and general reputation. The benchmark revealed that E2B's recall was perfect, which isn't something you'd learn from a model card.

Fine-Tuning Before and After

The Cost of Getting This Wrong

Choosing the wrong base model doesn't just waste training time. It locks you into hardware constraints and performance ceilings that are difficult to escape later.

If I had chosen the 4B model, I'd have a model that fills the RX 6600's VRAM, runs 10-20% slower, and starts from a baseline that's already strong on formatting but weaker on recall. The fine-tuning might improve accuracy by a few percentage points, but the throughput ceiling is lower and the VRAM pressure is higher. Every downstream decision in the pipeline would be constrained by that initial choice.

The 2.3B model gave me a faster model, more VRAM headroom, and a fine-tuning baseline that responded perfectly to targeted training. The total training time was under two hours. The deployment cost is effectively zero (hardware I already own, power I'm already paying for).

The irony of model selection is that the model with the "worse" benchmark scores was the better engineering decision. Benchmarks measure current capability. Fine-tuning investments should optimize for trainable potential.

Carlos Mendez

Solo developer and entrepreneur building personal AI infrastructure. With a background in systems administration and web development, he writes about the systems, tools, and ideas that shape how independent developers work with AI.

Related Posts

Enjoyed this article?

Subscribe to get notified about new posts on software engineering, AI development, and infrastructure.

No spam, unsubscribe anytime.

Comments Coming Soon

We're working on adding a comment system to enable discussion and feedback on blog posts.

In the meantime, feel free to share your thoughts on Twitter or LinkedIn.