Why “Kush”
Twelve versions numbered sequentially. Some succeeded, some failed, all learned. But the version numbering carried baggage — v13’s catastrophic failure, v14’s contamination revert. The numbers told a story of iteration. We wanted a name that told a story of intention.
Kush — the Kingdom of Kush, one of the earliest civilizations in the Nile Valley. Predecessors and contemporaries of Kemet. Builders of more pyramids than Egypt. A civilization that the historical record consistently undervalues. The name fits.
The Kush series represents a deliberate break: new base model, new naming convention, new training methodology. Same mission — building sovereign AI on our own terms.
The Base Model Decision
Every Hotep model from v6 through v14 was built on Qwen2.5-7B Instruct. It served us well — strong instruction following, good persona adoption, efficient LoRA fine-tuning. But two persistent issues traced back to the base:
CJK character leakage. Qwen is trained heavily on Chinese language data. Even with clean SFT examples, the multilingual tokenizer would occasionally surface Chinese, Japanese, or Korean characters in English-language outputs. Our post-processing filters caught these, but they should not exist in the first place.
Vocabulary ceiling. Qwen2.5’s English vocabulary, while competent, showed limitations in the specific register we needed — the intersection of academic historical language, motivational speaking, and cultural commentary. Responses often felt mechanically assembled rather than naturally flowing.
Meta’s Llama 3.1 8B Instruct addressed both issues. English-first training data. A tokenizer optimized for English morphology. Strong instruction-following benchmarks. And critically, an active open-source community producing LoRA adapters, which meant the fine-tuning dynamics were well-understood.
The switch was not casual. After v13 taught us to never copy hyperparameters between base model families, we approached Llama 3.1 as a completely new training target.
Training Configuration
Kush V1 used the Unsloth training framework with a conservative LoRA configuration:
| Parameter | Value |
|---|---|
| Base model | unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| Dropout | 0.05 |
| RSLoRA | Enabled |
| SFT learning rate | 1e-4 |
| IPO learning rate | 5e-5 |
The rank and alpha values were deliberately conservative compared to the v12 configuration (r=64, alpha=128). With a new base model, we wanted the adapter to learn the persona voice without overwhelming the base model’s language capabilities.
The Data Problem
Kush V1 trained on the v14-era dataset — but with a critical error in data preparation.
The SFT corpus contained 651 examples. The rubric-contaminated examples had been identified (63 in IPO, 4 in SFT) but the SFT cleaning only removed CJK characters. Rubric artifacts remained in the SFT assistant messages. The model learned to produce rubric fragments because the “correct” responses in its training data contained them.
The IPO corpus was worse. An over-aggressive deduplication step reduced 705 pairs down to 73 unique pairs. The deduplication logic matched on semantic similarity with too low a threshold, treating legitimately different examples as duplicates. 73 preference pairs is far below the minimum for effective preference optimization.
The Results: 60% Alignment
Kush V1 trained successfully. No catastrophic overfitting (v13’s lesson held). No complete quality collapse (v14’s audit infrastructure caught issues early). But the evaluation told a clear story:
Overall alignment: 60% — functional but not production-ready.
What worked:
- The Llama 3.1 base produced noticeably more natural English than Qwen2.5
- Zero CJK character leakage — the English-first architecture solved this completely
- Entity knowledge improved slightly from the base model’s pretraining
- Response structure was coherent and well-organized
What did not work:
- Repetition patterns emerged from the tiny IPO dataset (73 pairs created narrow preference boundaries)
- Rubric artifacts appeared in approximately 30% of responses (leaked from uncleaned SFT data)
- Persona vocabulary was present but inconsistent — the model would drift in and out of the sovereign voice mid-response
The Decision
Kush V1 was not deployed to production. V12 continued serving users while we analyzed V1’s failures.
The analysis was encouraging. Unlike v13 (architectural failure) or v14 (fundamental data contamination), V1’s issues were all traceable to specific, fixable data problems:
- Clean rubric artifacts from SFT assistant messages (not just IPO pairs)
- Restore full IPO dataset (658 pairs, not 73)
- Add entity-knowledge examples to strengthen specific historical figure responses
The base model choice was validated. The LoRA configuration was sound. The training infrastructure worked. Only the data needed fixing.
Kush V2 would fix all three issues and become the first Hotep model to score above 9.0 on evaluation. Read the full story in Kush V2: production-ready.