Skip to main content
Libre Claw joins the Libre stack

Genesis 1B: Supervised Fine-Tuning

Eval baseline before SFT, data mixture, and training configuration.

Author: Robin, Kroonen AI Inc.

Genesis SFT 1B fine-tuning eval

Pre-SFT Evaluation Baseline

Before starting supervised fine-tuning, the pretrained Genesis 1B checkpoint at step 40,000 was evaluated using EleutherAI's lm-evaluation-harness across six standard benchmarks. These are the before numbers: the raw pretrained base, no instruction tuning, no RLHF.

Eval command

CUDA_VISIBLE_DEVICES=0 lm_eval \
  --model hf \
  --model_args pretrained=/path/to/genesis-1b-export,dtype=bfloat16 \
  --tasks hellaswag,piqa,winogrande,arc_easy,arc_challenge,lambada_openai,mmlu \
  --device cuda:0 \
  --batch_size auto:2 \
  --output_path ./eval-results
TaskMetricValueRandom baseline
ARC-Challengeacc_norm0.25940.25 (4-choice)
ARC-Easyacc_norm0.25250.25 (4-choice)
HellaSwagacc_norm0.26040.25 (4-choice)
PIQAacc0.53590.50 (binary)
Winograndeacc0.48780.50 (binary)
Lambada (OpenAI)acc0.0000N/A
Lambada (OpenAI)perplexity3,360,048N/A
MMLU (overall)acc0.25670.25 (4-choice)
- Humanitiesacc0.2455
- STEMacc0.2750
- Social Sciencesacc0.2658
- Otheracc0.2459

What These Numbers Mean

Most tasks land at ~25% - exactly random chance for 4-option multiple choice. The model hasn't been taught to reason through MCQ format. It generates text; it doesn't answer questions.

Two tasks show genuine signal above random: PIQA at 53.6% (physical intuition, binary choice) and Winogrande at 48.8% (commonsense coreference, binary choice). For a base model with no instruction tuning, this is expected. The model has absorbed real-world knowledge from the pretraining corpus, it just does not know how to surface it in a structured task format yet.

Lambada is a complete failure: perplexity of 3.36M, accuracy 0%. Lambada tests long-range context prediction on narrative text. This is a known weakness of models trained on short context (2048 tokens) with a large vocabulary mismatch on the final-word prediction task. SFT will not fix Lambada. That requires more pretraining tokens at longer context.

These are the before numbers. The same eval will run again after SFT on the fine-tuned checkpoint for a direct comparison.

SFT Data Mixture

The SFT dataset is assembled from two sources, tokenized into ChatML format with per-token loss masking (only assistant turns are trained on).

Constitutional data (synthetic, Kroonen AI)

10,000 examples generated with Claude Haiku using a constitutional principles framework. Covers reasoning, safety, personality, and instruction following, designed specifically for Genesis. This is the personality layer.

External instruction datasets

SourceExamplesPurpose
Constitutional (Haiku)10,000Personality, principles
SmolTalk (Magpie Ultra)100,000Instruction following
OpenHermes 2.5200,000General instruction
Tulu 3 SFT mixture100,000Diverse tasks
MetaMathQA50,000Mathematical reasoning
Total510,577

All datasets are converted to ChatML format with four new special tokens appended to the vocabulary: <|im_start|> (49152), <|im_end|> (49153), <think> (49154), </think> (49155). New vocab size: 49,156. The model's embedding matrix is resized before fine-tuning, with new token embeddings initialized as the mean of existing embeddings.

Total tokenized: 365.6M tokens, of which 267.6M (73.2%) are trainable assistant tokens. System prompts, user turns, and special tokens are masked out. The model only learns to produce assistant responses.

Training Configuration

Base checkpointGenesis 1B step_040000 (pretrained)
GPUs2× RTX 4090 (PCIe, no NVLink)
Batch size2 per GPU
Gradient accumulation8 steps
Effective batch32 sequences / ~65,536 tokens per step
Learning rate2e-5 to 2e-6 (cosine decay, 5x lower than pretrain peak)
Warmup200 steps
OptimizerAdamW (β1=0.9, β2=0.95, wd=0.01)
Max steps15,955 (~1 epoch over 510,577 sequences)
Sequence length2048
LossCross-entropy on assistant tokens only (masked)
Scriptsft_v1.py

Gradient accumulation was reduced from 32 (pretrain) to 8 for SFT. With a diverse 510K-example dataset, tighter update loops produce better instruction following than large batches. The learning rate is set at 2e-5 peak, a 5x reduction from the pretrain peak of 1e-4, conservative enough to preserve pretrained weights while teaching new behavior.

1 epoch over the full mixture is the target. Overfitting on SFT data is a real risk; repeating the dataset multiple times tends to degrade diversity and reward-hack the loss. After 15,955 steps, the model will be exported to HuggingFace format and evaluated on the same benchmarks as the baseline above.