Genesis 1B: Training Progress, Live Results
Author: Robin, Kroonen AI Inc.
⚡ Update, March 23, 2026
Training is past step 8,500 with loss at 1.42. The model is approximately 48% through the target 20,000 steps. ETA to genesis-1b-v0.1-base: ~5 days.
Model: Genesis 1B
| Parameters | 1,003M (1.0B) |
| Architecture | Llama-style decoder-only transformer |
| Hidden dim | 2048 |
| Layers | 20 |
| Attention heads | 16 (4 KV heads, GQA) |
| FFN dim | 5632 (SwiGLU) |
| Context length | 2048 |
| Vocab size | 49,152 |
| Precision | bfloat16 |
| Positional encoding | RoPE (θ=500,000) |
Training Configuration
| GPUs | 2× RTX 4090 (PCIe, no NVLink) |
| Batch size | 1 per GPU |
| Gradient accumulation | 64 steps |
| Effective batch | 262,144 tokens/step |
| Learning rate | 3e-4 → 3e-5 (cosine decay) |
| Warmup | 500 steps |
| Optimizer | AdamW (β1=0.9, β2=0.95, wd=0.1) |
| Throughput | ~6,600 tok/s |
| Target | 5.2B tokens (20,000 steps) |
| Estimated time | ~10 days |
| NCCL | NCCL_P2P_DISABLE=1 |
Smoke Test Results
Before committing to a multi-day run, the pipeline was tested methodically:
- Training only (no eval, no checkpoint): Verified training loop stability over 100+ steps. ✅
- Training + DCP checkpoint save: Ran 220 steps with
--save-every 150. Sharded checkpoint saved at step 150 without deadlock. ✅ - Resume from checkpoint: Restarted with
--resume, loaded DCP sharded state, continued training from step 150 to 300. Loss consistent with pre-save values. ✅ - Second checkpoint save: Step 300 save completed cleanly, overwriting the previous checkpoint. ✅
Training Progress: Live Results
The model has now trained well past the initial smoke test. Here is the full loss journey from step 0 to the current checkpoint:
| Step | Loss | Step | Loss |
|---|---|---|---|
| 0 | 11.17 | 3,400 | 2.73 |
| 200 | 4.87 | 3,600 | 2.42 |
| 400 | 4.34 | 3,800 | 2.45 |
| 600 | 3.55 | 4,000 | 2.25 |
| 800 | 3.03 | 4,200 | 2.35 |
| 1,000 | 3.27 | 4,400 | 2.19 |
| 1,200 | 3.02 | 4,600 | 2.46 |
| 1,400 | 3.02 | 4,800 | 2.10 |
| 1,600 | 2.94 | 5,000 | 2.39 |
| 1,800 | 2.74 | 5,500 | 2.26 |
| 2,000 | 2.54 | 6,000 | 2.20 |
| 2,200 | 2.36 | 6,500 | 2.15 |
| 2,400 | 2.44 | 7,000 | 1.90 |
| 2,600 | 2.54 | 7,500 | 1.69 |
| 2,800 | 2.62 | 8,000 | 1.53 |
| 3,000 | 2.68 | 8,500 | 1.42 |
| 3,200 | 2.48 | training live... | |
Loss has dropped from 11.17 to 1.42 in 8,500 steps (~48% of target), and the descent is not slowing. The model already shows emergent turn-taking structure in raw completions, before any instruction tuning or alignment. At this rate, sub-1.0 loss by step 20k is plausible. Each step processes 262,144 tokens, so the model has seen approximately 2.2B tokens so far out of a 60B token corpus, less than 4% of the available data, meaning zero repetition and no overfitting risk at this stage.
The live tracker on the homepage pulls from latest.json and updates automatically as new checkpoints are saved. All checkpoints are archived as they are written.
Early Loss Curve
| Step | Loss | tok/s |
|---|---|---|
| 0 | 11.17 | 65,134 |
| 10 | 9.03 | 6,434 |
| 20 | 7.62 | 6,439 |
| 30 | 7.07 | 6,444 |
| 150 | 6.03 | 6,209 |
| 290 | 5.27 | 6,157 |
Loss dropping steadily from 11.17 to 5.27 over 300 steps. Both GPUs at 100% utilization, ~21 GB VRAM used each, temps under 50°C.
The Dataset
~60B tokens, curated from public sources:
- FineWeb-Edu (English web, educational filter)
- DCLM baseline + extra slices
- StarCoderData (code)
- FineMath (mathematics)
- Wikipedia (multilingual)
- CulturaX (Arabic, German, Spanish, French, Japanese, Korean, Portuguese, Chinese)
- OpenHermes, Orca AgentInstruct (instruction data)
- Function calling datasets (Glaive, Gorilla, Hermes, xLAM)
- Cosmopedia (synthetic textbooks)
All tokenized with a custom SentencePiece BPE tokenizer trained on the corpus itself.
The Road to Genesis 1B v0.1
Pre-training is only the first phase. The full pipeline has four stages, each producing a progressively better model:
Phase 1: Pre-training (current, ~48% complete)
Complete 20,000 steps, consuming approximately 5.2B tokens. This produces genesis-1b-v0.1-base: the raw pre-trained foundation. No instruction following, no alignment, no personality yet. Just a model that has learned the structure of language from a diverse corpus.
Phase 2: SFT (Supervised Fine-Tuning)
Teach the model conversational ability, personality, and curiosity using curated dialogue data. The approach is inspired by Anthropic's Constitutional AI: define a set of principles (be helpful, be curious, be honest, don't be boring) and train the model to follow them. This is where Genesis diverges from the standard safety-first fine-tuning pipeline. The goal is a model with genuine personality, not a model optimized for refusal rates.
Phase 3: DPO (Direct Preference Optimization)
Refine taste and style. Train the model to prefer interesting, thoughtful responses over generic safe ones. Preference pairs are constructed to reward curiosity and penalize hedging. This is what separates a model worth talking to from a model that merely answers questions.
Phase 4: Continued pre-training cycles
Continue pre-training to 40,000 steps (~10.5B tokens), then run SFT and DPO again from the stronger base. Repeat at 60,000 and 80,000+ steps. Each cycle produces a better pre-trained foundation, which produces a better aligned model. The structure is a tree: the pre-training trunk keeps growing, and SFT/DPO branches off at each milestone checkpoint.
At 76,300 steps the model hits Chinchilla-optimal compute allocation for a 1B parameter model (~20B tokens seen). The 60B token corpus means zero data repetition even at 230,000 steps. Every token the model sees during the extended runs is genuinely new data.
Try It Yourself
The model is training live. Select a checkpoint and generate text to see how it evolves over time:
Powered by HuggingFace ZeroGPU, free inference on NVIDIA H200
Contact
If you are a founder, independent researcher, or small lab working on multi-GPU local training and have encountered similar checkpoint or synchronization failures on consumer hardware, reach out at [email protected].
More from the Genesis Series
Fixing FSDP Checkpoint Deadlocks
The original checkpoint save deadlock on PCIe-only consumer GPUs, and the DCP fix.
The Optimizer State Bug
A silent AdamW state bug that wasted 1,000 training steps, and how to catch it.
The Genesis Manifesto
Why small models need personality, and what local AI training means in 2026.
Mapping the Mind of Qwen 3.5 9B
A sparse autoencoder for mechanistic interpretability: zero dead features, 16K dimensions.