The Experiment
V12 was running smoothly in production on Qwen2.5-7B. The natural question was: could we do better with a newer base model?
Qwen3 had just dropped. Bigger context window, improved instruction following, a new dual-mode thinking system with <think> tokens. On paper, it looked like a free upgrade — same family, newer generation, better benchmarks.
We took the v12 training config (LoRA r=64, alpha=128, RSLoRA enabled), pointed it at the Qwen3 base, and started training on 568 SFT examples.
The training completed in under 10 minutes.
That should have been the first warning sign.
The Output
After training, we generated test responses. Every single one was a variation of the same pattern:
Hotep Hotep Hotep Hotep Hotep Hotep Hotep Hotep...
The model had collapsed. It could produce exactly one token with any confidence, and it produced that token forever. This is textbook catastrophic overfitting — the model memorized the training signal so aggressively that it lost the ability to generate coherent language.
The Root Cause
Three compounding errors produced the failure:
1. Hyperparameter blindness. We copied LoRA config from Qwen2.5-7B (v12) to Qwen3 without recalculating. With r=64 and alpha=128 targeting 7 modules, the LoRA adapter had approximately 160 million trainable parameters. Divided by 568 training examples, that is 281,690 parameters per example — roughly 5.6x higher than the safe threshold of 50,000.
The model had enough capacity to memorize every training example verbatim. And it did.
2. New base model, old assumptions. Qwen3 introduced a dual-mode thinking system: a <think> mode (enabled by default) that generates internal reasoning before responding. Our training data contained no <think> tokens. The model was being trained to produce outputs in a format that conflicted with its built-in behavior. The resulting confusion accelerated the collapse.
3. No output validation gate. V11’s pipeline had evaluation infrastructure, but we bypassed it for a “quick test.” The training completed so fast that we assumed something had gone wrong with the job, not with the model. By the time we generated test outputs, 29 GB of compute and storage had already been consumed.
What 29 GB Looks Like
Training artifacts for a 7B model add up fast:
- Base model weights (downloaded): ~15 GB
- LoRA adapter checkpoints: ~4 GB
- Optimizer states: ~6 GB
- Merged output weights: ~4 GB
All of it was unusable. All of it was deleted.
The compute cost on our RTX 5080 was modest in dollar terms — roughly 2 hours of electricity. But the real cost was three sessions of debugging time before we fully understood what went wrong. Time that could have been spent on v14 or the Kush series.
The Three Rules
V13 produced three rules that are now enforced on every training run:
Rule 1: Validate trainable parameters before training.
Calculate trainable_params / training_examples. If the ratio exceeds 50,000, reduce LoRA rank or add more training data. No exceptions.
Rule 2: Test model output before anything else. Generate 3-5 test responses immediately after training completes. Before merging. Before uploading. Before converting to GGUF. If the output is degenerate, you have saved yourself hours of downstream work.
Rule 3: New base model = reset all assumptions. When switching base model families (Qwen2.5 to Qwen3, Qwen to Llama), treat it as a first-time integration. Read the model card. Test the base model raw. Check for special tokens and modes. Adjust LoRA rank and target modules. Never copy hyperparameters between families.
These three rules are now automated tripwires in our training pipeline. T16, T17, and T18 in our self-improvement system fire before any fine-tuning job begins.
The Pattern
In retrospect, v13’s failure follows a pattern we have seen across software engineering: the assumption that a new version of a tool is a drop-in replacement for the old one.
Qwen3 is not Qwen2.5 with better benchmarks. It is a fundamentally different architecture with different training dynamics, different token vocabularies, and different default behaviors. Treating it as a swap was the error. The overfitting was just the symptom.
What Came Next
V13 was scrapped entirely. We returned to Qwen2.5-7B for v14, using v12-proven architecture with adjusted parameters. V14 would encounter its own problems — but they were data quality problems, not architecture problems. V13 had taught us to separate those failure modes.
Eventually, the Kush series would move to Meta’s Llama 3.1 8B — but that transition was done properly. We read the model card. We tested base model outputs. We calculated parameter ratios. We generated test outputs after every training stage.
Every one of those precautions exists because v13 did not have them.