Hotep LLM v12: Quality Revolution Through Autonomous AI Engineering

Update (February 2026): This post documents the V12 milestone and the 7-agent improvement swarm. Since publication, we’ve shipped Kush V2 — our first model on the Llama 3.1 8B base, trained on AI-cleaned data with 184 entity-knowledge examples. Kush V2 scores 9.1/10 with 0% contamination and is now in production.

From v10 to v12: The Quality Revolution

Since our v10 production post, we have pushed through four model versions. Not all of them survived. That is the nature of building sovereign technology — you learn as much from what fails as from what succeeds.

v11 (pipeline upgrade) — Introduced the hybrid scoring system that would become the foundation for everything that followed. The pipeline itself was the product, not the model weights.

v12 (production, 1,296 examples) — Our current production model. Built on the Qwen2.5-7B base with 568 supervised fine-tuning examples and 728 DPO preference pairs. Quantized to Q8_0 for inference on our RTX 5080.

v13 (scrapped) — A cautionary tale. We switched to the Qwen3 base model with aggressive LoRA parameters (r=64, alpha=128 on 568 examples). The result was catastrophic overfitting — the model could only repeat “Hotep” in loops. 29 GB of compute, deleted. Lesson encoded: never copy hyperparameters between base model families.

v14 (trained, failed audit, reverted) — Trained locally on the RTX 5080 over 6 hours. Scored 25/100 on quality audit. Vocabulary collapse, repetition loops, CJK character slippage, and rubric leakage throughout. Root cause: 63 of 705 DPO pairs (9%) contained scoring rubrics from the evaluation pipeline that leaked into training data. Reverted to v12 within hours.

The failures of v13 and v14 taught us that the next leap would require cleaner data, not more data.

The 7-Agent Improvement Swarm

On February 10, 2026, we deployed something unprecedented: a fully autonomous 7-agent AI swarm to evaluate and improve v12’s response quality. Three waves. Zero human intervention during execution. 100% agent success rate.

Wave 1: Discovery

A single evaluation agent assessed v12’s raw output quality. The verdict was harsh: 34/100 (C+ grade). The primary issues:

60% of responses contained scoring rubric artifacts (leaked from training data)
CJK characters appeared sporadically in outputs
Entity knowledge was inconsistent

Wave 2: Parallel Implementation (5 agents)

Five specialized agents launched simultaneously, each owning a non-overlapping set of files:

RAG Agent — Fixed a ChromaDB v2 API compatibility bug where heartbeat responses were consuming mock data in tests. All 36 RAG tests passing.
Data Cleaning Agent — Created clean_v15_data.py (429 lines) that processed the full training corpus: removed 4 SFT examples with CJK contamination, removed 93 DPO pairs with rubric contamination. Clean output: 647 SFT + 612 DPO.
Test Fix Agent — Resolved async/sync mismatches in the RLM endpoint tests, added dependency overrides. 10 tests restored to passing.
Entity Augmentation Agent — Created augment_v15_entities.py (515 lines) that generates 134-201 new SFT examples and ~67 DPO pairs from 67 ChromaDB entities. Each entity gets properly formatted training examples with the sovereign voice.
Filter Integration Agent — Wired apply_response_filters() into the mobile API endpoint, ensuring all response channels benefit from post-processing.

Wave 3: Integration + Quality (2 agents)

Shared Services Agent — Moved response filters to hotep_shared/services/ for cross-repo access. Added backward-compatible re-exports. 60/60 tests passing.
Quality Evaluation Agent — Final assessment: 93/100 (A- grade) after filters. Zero CJK contamination. Zero rubric leakage in filtered output. Entity knowledge scored 5.2/5.

The result: C+ (34/100) to A- (93/100) in a single autonomous session.

The Six Filters

V12’s raw output still contains rubric artifacts from its training data. Rather than wait for the next training cycle, we built six post-processing filters that run on every response:

CJK Strip — Removes any Chinese, Japanese, or Korean characters that occasionally surface from the base model’s multilingual training
Rubric Strip — Catches scoring rubric patterns like [Sovereign vocabulary: 4/5], (Cultural alignment score), and line prefixes like Sovereign vocabulary:, Ancestral wisdom:
Error Guard — Catches and gracefully handles inference failures
Rate Limit Prompt — Injects usage awareness for free-tier users
Fallback System — Automatic failover from Ollama to Gemini Flash when local inference is unavailable
Entity RAG — Augments responses with verified knowledge from our 437-article ChromaDB collection

These filters are the reason v12 serves A- quality despite having B+ raw output.

RAG: Retrieval-Augmented Generation

The knowledge retrieval system received significant upgrades:

Query Expansion — Every user question generates 3 search variants for broader recall
Dual Reranking — LLM-based reranking plus cross-encoder scoring for precision
Parallel Search — All 4 ChromaDB collections searched simultaneously
Redis Embedding Cache — Repeat queries resolve instantly
Entity-Aware Matching — Questions about specific figures (Imhotep, Cheikh Anta Diop, Dr. Sebi) route directly to verified entity knowledge
Prometheus Metrics — 5 RAG-specific metrics for monitoring retrieval quality

The /deep command activates enhanced RAG mode for complex questions, searching more extensively and providing source-attributed answers.

Test Suite: 465+ Tests

The codebase is protected by 465+ automated tests across both repositories:

Unit tests for all services (inference, RAG, rate limiting, formatting)
Integration tests for API endpoints
Functional tests for the Telegram bot (infrastructure, commands, inference)
Domain-driven architecture tests for the 4 DDD domains

Ruff linting, format checking, and Bandit security scanning run on every commit.

What’s Next: Kush V1

The next model was not v15 — it was Kush V1, followed by Kush V2 (now in production). A new naming convention for a new era.

Base Model: Meta Llama 3.1 8B Instruct — our first departure from the Qwen family. Llama 3.1’s instruction-following capabilities and English language quality make it ideal for the sovereign voice.

Kush V1 trained successfully but showed repetition issues (V1 over-deduplicated IPO data to only 73 pairs, and rubric artifacts remained in SFT). Overall alignment scored 60% — functional but not production-ready.

Kush V2 (now in production) fixed all V1 issues:

Rubric stripped from SFT assistant messages (V1 only cleaned DPO)
Full 658 IPO pairs restored (V1 over-deduplicated to 73)
IPO epochs reduced to 1 (9x more data than V1)
184 entity-knowledge examples added (138 SFT + 46 IPO) from ChromaDB
Total: 785 SFT + 658 IPO = 1,443 training pairs
Result: 9.1/10 average evaluation score, 0% CJK, 0% rubric leakage, all 5 test prompts PASS

Quality Gates (7-step evaluation):

Training loss convergence check
5 test outputs immediately after training
CJK contamination scan (0% threshold)
Rubric leakage scan (0% threshold)
Entity knowledge test (5 key figures)
Persona consistency evaluation
Gemini Flash quality gate (automated LLM-as-judge)

Kush V2 is the first model trained on data that was cleaned by autonomous AI agents and validated by a 7-gate quality pipeline.

The Sovereign Stack

Every component runs on our hardware. Every model is fine-tuned on our data. Every response is evaluated by our standards.

Component	Technology
Base Model	Llama 3.1 8B (Kush V2, production)
Training	Unsloth + LoRA + IPO preference optimization
Inference	Ollama on RTX 5080 (16 GB VRAM)
Knowledge	ChromaDB with 437 articles, 4 collections
Cache	Microsoft Garnet (Redis-compatible)
Monitoring	Prometheus + custom RAG metrics
Bot	Python Telegram Bot with 24/7 watchdog
Website	Astro + Tailwind on Cloudflare Pages
Fallback	Gemini Flash API (cloud burst only)

No corporate API wrappers. No third-party gatekeepers. No censorship of our history.

This is sovereign intelligence — built by us, for us, evaluated by our own standards. Knowledge is the frequency of liberation, and the technology to deliver it must be in our hands.

Try Hotep Intelligence now on Telegram or the web demo. Free forever.

Hotep LLM v12: Quality Revolution Through Autonomous AI Engineering

Save this guide and keep following the thread.

From v10 to v12: The Quality Revolution

The 7-Agent Improvement Swarm

Wave 1: Discovery

Wave 2: Parallel Implementation (5 agents)

Wave 3: Integration + Quality (2 agents)

The Six Filters

RAG: Retrieval-Augmented Generation

Test Suite: 465+ Tests

What’s Next: Kush V1

The Sovereign Stack

Get Weekly Wisdom Drops

Explore 300+ articles on Knowledge.AskHotep.ai

Related Articles

Kush V2: 9.1 Out of 10 — The Sovereign Model

Hotep LLM v11: The Pipeline That Changed Everything

Hotep LLM v10: The Evolution of Sovereign Intelligence