Genesis 1B: Supervised Fine-Tuning
Eval baseline before SFT, data mixture, and training configuration.
Author: Robin, Kroonen AI Inc.
Pre-SFT Evaluation Baseline
Before starting supervised fine-tuning, the pretrained Genesis 1B checkpoint at step 40,000 was evaluated using EleutherAI's lm-evaluation-harness across six standard benchmarks. These are the before numbers: the raw pretrained base, no instruction tuning, no RLHF.
Eval command
CUDA_VISIBLE_DEVICES=0 lm_eval \ --model hf \ --model_args pretrained=/path/to/genesis-1b-export,dtype=bfloat16 \ --tasks hellaswag,piqa,winogrande,arc_easy,arc_challenge,lambada_openai,mmlu \ --device cuda:0 \ --batch_size auto:2 \ --output_path ./eval-results
| Task | Metric | Value | Random baseline |
|---|---|---|---|
| ARC-Challenge | acc_norm | 0.2594 | 0.25 (4-choice) |
| ARC-Easy | acc_norm | 0.2525 | 0.25 (4-choice) |
| HellaSwag | acc_norm | 0.2604 | 0.25 (4-choice) |
| PIQA | acc | 0.5359 | 0.50 (binary) |
| Winogrande | acc | 0.4878 | 0.50 (binary) |
| Lambada (OpenAI) | acc | 0.0000 | N/A |
| Lambada (OpenAI) | perplexity | 3,360,048 | N/A |
| MMLU (overall) | acc | 0.2567 | 0.25 (4-choice) |
| - Humanities | acc | 0.2455 | |
| - STEM | acc | 0.2750 | |
| - Social Sciences | acc | 0.2658 | |
| - Other | acc | 0.2459 |
What These Numbers Mean
Most tasks land at ~25% - exactly random chance for 4-option multiple choice. The model hasn't been taught to reason through MCQ format. It generates text; it doesn't answer questions.
Two tasks show genuine signal above random: PIQA at 53.6% (physical intuition, binary choice) and Winogrande at 48.8% (commonsense coreference, binary choice). For a base model with no instruction tuning, this is expected. The model has absorbed real-world knowledge from the pretraining corpus, it just does not know how to surface it in a structured task format yet.
Lambada is a complete failure: perplexity of 3.36M, accuracy 0%. Lambada tests long-range context prediction on narrative text. This is a known weakness of models trained on short context (2048 tokens) with a large vocabulary mismatch on the final-word prediction task. SFT will not fix Lambada. That requires more pretraining tokens at longer context.
These are the before numbers. The same eval will run again after SFT on the fine-tuned checkpoint for a direct comparison.
SFT Data Mixture
The SFT dataset is assembled from two sources, tokenized into ChatML format with per-token loss masking (only assistant turns are trained on).
Constitutional data (synthetic, Kroonen AI)
10,000 examples generated with Claude Haiku using a constitutional principles framework. Covers reasoning, safety, personality, and instruction following, designed specifically for Genesis. This is the personality layer.
External instruction datasets
- SmolTalk (Magpie Ultra) - 100,000 examples. High-quality synthetic instruction data from Hugging Face.
- OpenHermes 2.5 - 200,000 examples sampled. General-purpose instruction following.
- Tulu 3 SFT mixture - 100,000 examples. Diverse instruction types from AI2.
- MetaMathQA - 50,000 examples. Mathematical reasoning.
| Source | Examples | Purpose |
|---|---|---|
| Constitutional (Haiku) | 10,000 | Personality, principles |
| SmolTalk (Magpie Ultra) | 100,000 | Instruction following |
| OpenHermes 2.5 | 200,000 | General instruction |
| Tulu 3 SFT mixture | 100,000 | Diverse tasks |
| MetaMathQA | 50,000 | Mathematical reasoning |
| Total | 510,577 |
All datasets are converted to ChatML format with four new special tokens appended to the vocabulary: <|im_start|> (49152), <|im_end|> (49153), <think> (49154), </think> (49155). New vocab size: 49,156. The model's embedding matrix is resized before fine-tuning, with new token embeddings initialized as the mean of existing embeddings.
Total tokenized: 365.6M tokens, of which 267.6M (73.2%) are trainable assistant tokens. System prompts, user turns, and special tokens are masked out. The model only learns to produce assistant responses.
Training Configuration
| Base checkpoint | Genesis 1B step_040000 (pretrained) |
| GPUs | 2× RTX 4090 (PCIe, no NVLink) |
| Batch size | 2 per GPU |
| Gradient accumulation | 8 steps |
| Effective batch | 32 sequences / ~65,536 tokens per step |
| Learning rate | 2e-5 to 2e-6 (cosine decay, 5x lower than pretrain peak) |
| Warmup | 200 steps |
| Optimizer | AdamW (β1=0.9, β2=0.95, wd=0.01) |
| Max steps | 15,955 (~1 epoch over 510,577 sequences) |
| Sequence length | 2048 |
| Loss | Cross-entropy on assistant tokens only (masked) |
| Script | sft_v1.py |
Gradient accumulation was reduced from 32 (pretrain) to 8 for SFT. With a diverse 510K-example dataset, tighter update loops produce better instruction following than large batches. The learning rate is set at 2e-5 peak, a 5x reduction from the pretrain peak of 1e-4, conservative enough to preserve pretrained weights while teaching new behavior.
1 epoch over the full mixture is the target. Overfitting on SFT data is a real risk; repeating the dataset multiple times tends to degrade diversity and reward-hack the loss. After 15,955 steps, the model will be exported to HuggingFace format and evaluated on the same benchmarks as the baseline above.
More Posts
Genesis
Genesis 1B, Run 2: 3x Throughput, Same Hardware
Redesigning Genesis 1B from 20 to 32 layers. Same param count, same GPUs, 3x training throughput.
Genesis 1B: Run 2 Finished
Final results from Run 2: 40,000 steps complete, final loss ~1.93. Completed April 7, 2026.
The Genesis Manifesto: Sovereign Intelligence
Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.
Postmortems
The Optimizer State Bug: A Silent Failure
A silent AdamW state bug during Run 1 that produced a false recovery on poisoned weights.
Fixing FSDP Checkpoint Deadlocks on 2x RTX 4090
How DCP sharded checkpoints and CPU-offload resume fixed deadlocks on consumer GPUs without NVLink.