Why a Fine-Tuned 1.5B Model Destroys Off-the-Shelf 4B Models at Structured Tasks
I spent the last week benchmarking seven small language models on a task that sounds simple: read a sentence, extract entities and relationships, return valid JSON. Four of them scored literally zero. Not low. Zero. They couldn't produce parseable output even once across 20 test cases.
The model that handles a similar structured task best in my production stack? A 1.5 billion parameter model I fine-tuned with LoRA on a single consumer GPU. It runs at 71% accuracy on entity classification, beating every off-the-shelf model I've tested, including models three times its size.
This isn't a fluke. It's a pattern that has significant implications for how we should think about deploying AI in production systems.
Figure 1: Seven models benchmarked on entity extraction. Thinking models (orange) scored 0% valid JSON across the board.
The Benchmark
I run a multi-layer knowledge graph system called MemBrane as part of my personal AI infrastructure. One of its core operations is entity extraction: given natural language text, identify entities and their relationships, then return structured JSON that downstream systems can consume.
The extraction prompt is straightforward. Given a text like "Docker is deployed on the production Kubernetes cluster managed by the DevOps team," the model should return:
{
"entities": [
{"name": "Docker", "role": "subject"},
{"name": "Kubernetes cluster", "role": "object"},
{"name": "DevOps team", "role": "subject"}
],
"relationships": [
{"source": "Docker", "relation": "DEPLOYED_ON", "target": "Kubernetes cluster"},
{"source": "DevOps team", "relation": "MANAGES", "target": "Kubernetes cluster"}
]
}
I built a 20-case evaluation set covering simple through complex scenarios and ran seven models through it on local hardware (Tesla M40 24GB GPUs). Here's what happened.
The Results
| Model | Parameters | Valid JSON | Entity F1 | Relationship F1 | Speed |
|---|---|---|---|---|---|
| gemma3:4b | 4B | 100% | 0.81 | 0.29 | 44.9 tok/s |
| gemma4:e2b | 2.3B | 75% | 0.76 | 0.29 | 42.1 tok/s |
| gemma4:e4b | 4B | 50% | 0.78 | 0.30 | 9.7 tok/s |
| qwen3:4b | 4B | 0% | 0.00 | 0.00 | 20.9 tok/s |
| qwen3.5:4b | 4B | 0% | 0.00 | 0.00 | 42.2 tok/s |
| nemotron-mini:4b | 4B | 0% | 0.00 | 0.00 | 32.3 tok/s |
| nemotron-3-nano:4b | 4B | 0% | 0.00 | 0.00 | 13.5 tok/s |
Read that table again. Four out of seven models, all with 4 billion parameters, couldn't produce valid JSON even once. Not "low accuracy." Zero. The F1 scores are 0.00 because there was nothing to score.
The Thinking Model Problem
The four models that scored zero share a common trait: they're all "thinking" models. Qwen3, Qwen3.5, nemotron-mini, and nemotron-3-nano all implement some form of chain-of-thought reasoning as a default behavior. When asked to extract entities and return JSON, they instead produce paragraphs of internal reasoning.
Here's what nemotron-3-nano actually outputs when asked for structured extraction:
We need to identify the entities in this text. Let me think about what
constitutes an entity here. The text mentions Docker, which is a
containerization platform. It also mentions Kubernetes...
It never gets to the JSON. It burns its entire token budget reasoning about the task instead of doing the task. You can pass num_predict: 1024 and the model will fill all 1024 tokens with chain-of-thought text, never producing a single byte of the requested output format.
This is a fundamental architectural mismatch. Thinking models are optimized for multi-step reasoning problems where showing your work matters. Structured extraction is the opposite: you want the model to recognize a pattern, apply it, and output formatted data. No deliberation needed.
Some of these models advertise a "no thinking" flag or mode. In practice, the results don't improve meaningfully. The reasoning behavior appears to be deeply embedded in the fine-tuning, not a simple toggle.
Figure 2: Thinking models burn their token budget on chain-of-thought reasoning instead of producing the requested structured output.
The Fine-Tuning Advantage
Now consider entity-classifier-v5, a model I trained for a related task in the MemBrane pipeline. It's based on Qwen2.5-1.5B-Instruct with a LoRA adapter trained on a few hundred examples. At 1.5 billion parameters, it's smaller than every model in the benchmark.
It runs at 71% accuracy on entity classification. It produces valid structured output 100% of the time. It processes roughly 3.2 examples per second on a Radeon RX 6600 with 8GB of VRAM.
The lesson here isn't that 1.5B models are inherently better. It's that a focused model trained on your actual task will outperform a generic model that's two to three times larger, every single time, for structured production workloads.
The math is intuitive once you see it. A general-purpose 4B model distributes its capacity across thousands of potential tasks: translation, coding, creative writing, math, conversation. Your structured extraction task gets a thin slice of that capacity. A fine-tuned 1.5B model concentrates all of its capacity on exactly the task you need. The focused model has more effective capacity for your specific use case despite having fewer total parameters.
What This Means for Production Deployments
There's a persistent assumption in the AI space that bigger models are always better and that the path to better results is upgrading to the next model size. The benchmark data tells a different story for structured tasks.
Inference cost scales with parameters. A 1.5B model requires roughly 3GB of VRAM at FP16 quantization. A 4B model requires 8GB or more. On local infrastructure, that's the difference between running on a $150 consumer GPU and needing enterprise hardware. In cloud deployments, it's the difference between a T4 instance and an A10G.
Speed matters at scale. entity-classifier-v5 processes 3.2 examples per second. The best off-the-shelf model in the benchmark, gemma3:4b, processes about 1.7 per second for a more complex task but on beefier hardware. When you're processing thousands of knowledge graph captures daily, that throughput difference compounds.
Reliability is non-negotiable. A model that produces valid JSON 75% of the time means 25% of your pipeline invocations need error handling, retries, and fallback logic. A fine-tuned model that hits 100% format compliance eliminates an entire category of production failure modes.
The Training Investment
Fine-tuning sounds expensive. In practice, here's what entity-classifier-v5 required:
- Training data: A few hundred labeled examples in the target format
- Hardware: A single Radeon RX 6600 with 8GB of VRAM
- Method: LoRA (Low-Rank Adaptation), which freezes the base model and trains small adapter layers
- Time: A few hours of training across multiple experiment sweeps
- Cost: Effectively zero beyond the electricity, since I used hardware I already own
LoRA is the key enabler here. Traditional fine-tuning requires modifying all model weights, which demands enormous compute. LoRA adds small trainable matrices alongside frozen base model weights. For a 1.5B model, you're training maybe 0.5% of total parameters. This makes consumer GPU training viable.
The process is repeatable. When the extraction format changes or I need to target a new model architecture, I generate updated training data, run the LoRA training, merge the adapter into the base model, convert to GGUF format, and register it with Ollama. The whole pipeline takes an afternoon.
Figure 3: The LoRA fine-tuning to production deployment pipeline. From training data to Ollama model in five steps.
Choosing Your Base Model
The benchmark revealed important differences even among non-thinking models. Gemma3:4b achieved 100% valid JSON and 0.81 entity F1 out of the box, making it an excellent production model without any fine-tuning. Gemma4:e2b, despite being smaller at 2.3B effective parameters, showed 75% valid JSON and strong entity recall (1.00 R), suggesting it's learning the right patterns but needs format discipline.
This makes gemma4:e2b an ideal fine-tuning candidate. Its high entity recall means it already understands what constitutes an entity; it just needs training to consistently output valid JSON. That's exactly the kind of weakness LoRA excels at correcting.
The approach I'm taking for the next iteration of the extraction pipeline: fine-tune gemma4:e2b to get the format compliance of gemma3:4b with the parameter efficiency of a 2.3B model. If the entity-classifier-v5 experience is any guide, a few hundred training examples should be sufficient to close that 25% JSON validity gap entirely.
The Broader Pattern
This finding extends well beyond entity extraction. Any structured production task (classification, format conversion, data normalization, schema validation) follows the same dynamic. Generic models spread their capacity thin. Fine-tuned models concentrate it.
The implication for teams evaluating AI for production use: stop benchmarking generic models against each other and start benchmarking against what a fine-tuned small model can do. The comparison usually isn't close.
The tools for fine-tuning have matured dramatically. Unsloth, LoRA, QLoRA, and frameworks like TRL have reduced the barrier to entry from "you need a cluster" to "you need a single GPU and an afternoon." The economics of training a task-specific model are now competitive with the API costs of running a larger generic model for a few weeks.
If you're running structured AI workloads in production and you haven't tried fine-tuning a small model for your specific task, you're almost certainly leaving performance on the table.
Carlos Mendez
Solo developer and entrepreneur building personal AI infrastructure. With a background in systems administration and web development, he writes about the systems, tools, and ideas that shape how independent developers work with AI.
Related Posts
I Trained a Production AI Model on GPUs from 2015
Three Tesla M40 GPUs, released in 2015, considered e-waste by most. I used one to train a production extraction model in under two hours. The AI hardware gatekeeping narrative is wrong.
Choosing the Right Small Model to Fine-Tune
When picking a base model for fine-tuning, the best raw performer isn't always the right choice. I chose a smaller model over a better-scoring one, and it paid off.
From Benchmark to Production in One Weekend
I went from benchmark results to a deployed extraction model in 48 hours. LoRA training, GGUF conversion, and Ollama registration on consumer hardware, no cloud required.
Midnight Surgery: Migrating a Degraded ZFS Pool While I Slept
Four aging 4TB drives. One already dead. 1.67 terabytes of production data serving five Docker containers, a GitLab instance, and an AI memory system. I handed the migration to my AI assistant at 10pm, went to bed, and woke up to a faster, healthier storage array. Here's what actually happened.
Enjoyed this article?
Subscribe to get notified about new posts on software engineering, AI development, and infrastructure.
No spam, unsubscribe anytime.