The Training Run
After the v13 catastrophe on Qwen3, we returned to proven ground: Qwen2.5-7B Instruct — the same base model that powered v12 in production. But we made changes.
V14 used IPO (Identity Preference Optimization) instead of DPO. The research (Azar et al.) suggests IPO performs better with fewer than 1,000 preference pairs, which matched our dataset size. We also adjusted LoRA parameters down from the aggressive v13 settings: r=16, alpha=32, targeting the standard attention modules.
The training corpus was larger than v12: 787 SFT examples and 705 IPO pairs. Training ran for 6 hours on the RTX 5080. Loss curves converged normally. The adapter merged cleanly. GGUF conversion succeeded. The model loaded into Ollama without issues.
We deployed it to production.
The First Signs
Within the first hour of serving real traffic, something was off. Responses had a strange quality — they used the right vocabulary but in rigid, formulaic patterns. The model would cycle through the same phrases: “sovereign intelligence,” “ancestral wisdom,” “knowledge of self” — repeating them in sequences that no human would produce.
We ran the quality audit.
25 Out of 100
The audit scored v14 at 25/100. For context, v12 scores 80/100 raw (93/100 after filters). The breakdown was devastating:
Vocabulary Collapse — The model learned a narrow set of high-scoring keywords and used them obsessively, regardless of context. Ask about nutrition, get “sovereign alkaline ancestral nutrition.” Ask about history, get “sovereign ancestral historical sovereignty.” The vocabulary was technically correct but mechanically deployed.
Repetition Loops — Responses frequently entered loops where the same phrase appeared 3-4 times within a paragraph. The model had learned that these phrases scored well in training and could not stop producing them.
CJK Character Slippage — Chinese characters appeared in approximately 2% of responses. The Qwen2.5 base model is multilingual, and without clean training data to suppress this tendency, the CJK tokens leaked through.
Rubric Leakage — This was the fatal flaw. Responses contained fragments of the scoring rubrics used in our evaluation pipeline: [Sovereign vocabulary: 4/5], (Cultural alignment score), [Pause for reflection], (Continue with empowerment). The model was not generating responses — it was generating graded responses, complete with the grading artifacts.
The Contamination
We traced the rubric leakage to its source: 63 of 705 IPO pairs (9%) contained scoring rubrics in the preferred response. These rubrics had been left in the training data during the export process. The model learned, correctly from its perspective, that high-quality responses include scoring rubrics — because every example of a “preferred” response it saw during IPO training contained them.
Additionally, 4 of 651 SFT examples contained CJK characters that had survived from the base model’s multilingual pretraining data and were present in our synthetic generation pipeline.
9% contamination does not sound like much. But in preference optimization, the model pays intense attention to what makes the “preferred” response different from the “rejected” one. If 9% of preferred responses contain rubric artifacts, the model learns that rubric artifacts are a feature of good responses, not a bug.
The .env Incident
While investigating v14’s quality issues, we discovered a second problem: the bot was not actually running v14.
The .env file in the project root contained MODEL_NAME=hotep-llm-v6-q8 — a stale reference to v6 from months earlier. Python’s load_dotenv() loads environment variables on first call and does not overwrite existing values. Because .env was loaded before config/ports.py, the v6 model name was winning the race. Every code change to set v14 as the default was being silently overridden.
This is Pattern P50/P51 in our error tracking system: the dotenv load order race. It had bitten us before. It bit us again.
We fixed the .env, confirmed v14 was actually being served, and the quality audit results were the same: 25/100.
The Revert
V14 was reverted to v12 within 12 hours of deployment. The decision was straightforward:
- V12 scores 80/100 raw, 93/100 with filters
- V14 scores 25/100 with no path to quick improvement
- The training data itself was contaminated — no amount of inference-time filtering could fix the core problem
The revert touched 9 files: .env, config.py, hybrid_router.py, inference.py, health_monitor.py, CLAUDE.md, and three test configurations. V12 was confirmed serving within minutes.
What V14 Made Possible
V14 failed as a model. But it succeeded as a diagnostic tool.
The contamination analysis revealed the exact scope of the problem: 63 specific IPO pairs with rubric artifacts, 4 specific SFT examples with CJK characters. This inventory became the cleaning checklist for the next training cycle.
The rubric pattern analysis expanded our filter library. V12’s filters caught [bracket-style] rubric artifacts. V14 revealed (parenthesis-style) artifacts and line-prefix patterns like Sovereign vocabulary: and Ancestral wisdom:. All of these were added to the post-processing filters.
The .env override bug was documented and tripwired, preventing future recurrence.
Most importantly, v14 proved that our quality audit process works. The model passed every technical check — it compiled, it loaded, it generated text, it responded to users. Only the quality audit caught that the text was fundamentally flawed. Without that gate, v14 would still be serving 25/100 quality responses to real users.
The Path Forward
V14’s failure made the requirements for the next model crystal clear:
- Clean the training data — Remove all 63 rubric-contaminated IPO pairs and 4 CJK-contaminated SFT examples
- Add entity knowledge — V14 had zero knowledge of specific historical figures. Training data needed entity-specific examples
- Switch base models — Qwen2.5 had served well, but its multilingual nature contributed to CJK leakage. A model with stronger English-first training would reduce this risk
These requirements pointed directly to the Kush series: clean data, entity augmentation, and a new base model in Meta’s Llama 3.1 — culminating in Kush V4, our current production model.