April 7, 2026

Genesis 1B: Supervised Fine-Tuning

Eval baseline before SFT, data mixture, and training configuration.

Author: Robin, Kroonen AI Inc.

Genesis SFT 1B fine-tuning eval

Pre-SFT Evaluation Baseline

Before starting supervised fine-tuning, the pretrained Genesis 1B checkpoint at step 40,000 was evaluated using EleutherAI's lm-evaluation-harness across six standard benchmarks. These are the before numbers: the raw pretrained base, no instruction tuning, no RLHF.

Eval command

CUDA_VISIBLE_DEVICES=0 lm_eval \
  --model hf \
  --model_args pretrained=/path/to/genesis-1b-export,dtype=bfloat16 \
  --tasks hellaswag,piqa,winogrande,arc_easy,arc_challenge,lambada_openai,mmlu \
  --device cuda:0 \
  --batch_size auto:2 \
  --output_path ./eval-results

Task	Metric	Value	Random baseline
ARC-Challenge	acc_norm	0.2594	0.25 (4-choice)
ARC-Easy	acc_norm	0.2525	0.25 (4-choice)
HellaSwag	acc_norm	0.2604	0.25 (4-choice)
PIQA	acc	0.5359	0.50 (binary)
Winogrande	acc	0.4878	0.50 (binary)
Lambada (OpenAI)	acc	0.0000	N/A
Lambada (OpenAI)	perplexity	3,360,048	N/A
MMLU (overall)	acc	0.2567	0.25 (4-choice)
- Humanities	acc	0.2455
- STEM	acc	0.2750
- Social Sciences	acc	0.2658
- Other	acc	0.2459

What These Numbers Mean

Most tasks land at ~25% - exactly random chance for 4-option multiple choice. The model hasn't been taught to reason through MCQ format. It generates text; it doesn't answer questions.

Two tasks show genuine signal above random: PIQA at 53.6% (physical intuition, binary choice) and Winogrande at 48.8% (commonsense coreference, binary choice). For a base model with no instruction tuning, this is expected. The model has absorbed real-world knowledge from the pretraining corpus, it just does not know how to surface it in a structured task format yet.

Lambada is a complete failure: perplexity of 3.36M, accuracy 0%. Lambada tests long-range context prediction on narrative text. This is a known weakness of models trained on short context (2048 tokens) with a large vocabulary mismatch on the final-word prediction task. SFT will not fix Lambada. That requires more pretraining tokens at longer context.

These are the before numbers. The same eval will run again after SFT on the fine-tuned checkpoint for a direct comparison.

SFT Data Mixture

The SFT dataset is assembled from two sources, tokenized into ChatML format with per-token loss masking (only assistant turns are trained on).

Constitutional data (synthetic, Kroonen AI)

10,000 examples generated with Claude Haiku using a constitutional principles framework. Covers reasoning, safety, personality, and instruction following, designed specifically for Genesis. This is the personality layer.

External instruction datasets

SmolTalk (Magpie Ultra) - 100,000 examples. High-quality synthetic instruction data from Hugging Face.
OpenHermes 2.5 - 200,000 examples sampled. General-purpose instruction following.
Tulu 3 SFT mixture - 100,000 examples. Diverse instruction types from AI2.
MetaMathQA - 50,000 examples. Mathematical reasoning.

Source	Examples	Purpose
Constitutional (Haiku)	10,000	Personality, principles
SmolTalk (Magpie Ultra)	100,000	Instruction following
OpenHermes 2.5	200,000	General instruction
Tulu 3 SFT mixture	100,000	Diverse tasks
MetaMathQA	50,000	Mathematical reasoning
Total	510,577

All datasets are converted to ChatML format with four new special tokens appended to the vocabulary: <|im_start|> (49152), <|im_end|> (49153), <think> (49154), </think> (49155). New vocab size: 49,156. The model's embedding matrix is resized before fine-tuning, with new token embeddings initialized as the mean of existing embeddings.

Total tokenized: 365.6M tokens, of which 267.6M (73.2%) are trainable assistant tokens. System prompts, user turns, and special tokens are masked out. The model only learns to produce assistant responses.

Training Configuration

Base checkpoint	Genesis 1B step_040000 (pretrained)
GPUs	2× RTX 4090 (PCIe, no NVLink)
Batch size	2 per GPU
Gradient accumulation	8 steps
Effective batch	32 sequences / ~65,536 tokens per step
Learning rate	2e-5 to 2e-6 (cosine decay, 5x lower than pretrain peak)
Warmup	200 steps
Optimizer	AdamW (β1=0.9, β2=0.95, wd=0.01)
Max steps	15,955 (~1 epoch over 510,577 sequences)
Sequence length	2048
Loss	Cross-entropy on assistant tokens only (masked)
Script	`sft_v1.py`

Gradient accumulation was reduced from 32 (pretrain) to 8 for SFT. With a diverse 510K-example dataset, tighter update loops produce better instruction following than large batches. The learning rate is set at 2e-5 peak, a 5x reduction from the pretrain peak of 1e-4, conservative enough to preserve pretrained weights while teaching new behavior.

1 epoch over the full mixture is the target. Overfitting on SFT data is a real risk; repeating the dataset multiple times tends to degrade diversity and reward-hack the loss. After 15,955 steps, the model will be exported to HuggingFace format and evaluated on the same benchmarks as the baseline above.

Genesis

Genesis 5 min

Genesis 1B, Run 2: 3x Throughput, Same Hardware

Redesigning Genesis 1B from 20 to 32 layers. Same param count, same GPUs, 3x training throughput.

Genesis 8 min

Genesis 1B: Run 2 Finished

Final results from Run 2: 40,000 steps complete, final loss ~1.93. Completed April 7, 2026.

Genesis 10 min

The Genesis Manifesto: Sovereign Intelligence

Data sovereignty, constitutional alignment, and why the future of AI is local, private, and personality-first.

Postmortems

Postmortems 8 min

The Optimizer State Bug: A Silent Failure

A silent AdamW state bug during Run 1 that produced a false recovery on poisoned weights.

Postmortems 8 min

Fixing FSDP Checkpoint Deadlocks on 2x RTX 4090

How DCP sharded checkpoints and CPU-offload resume fixed deadlocks on consumer GPUs without NVLink.

Research

Research 5 min

Mapping the Mind of Qwen 3.5 9B

A sparse autoencoder for mechanistic interpretability: zero dead features, 16,384 dimensions.