Take a language model checkpoint, continue training with targeted data interventions, and evaluate the result — all from a single YAML config. Built to support the experiments in Train Once, Answer All (ICLR 2026).
- 🧬 Inject texts or tokens at precise positions in the training data
- 🔌 Supports OLMo and OLMo-Core, extensible to other frameworks
- 📊 Run benchmarks and custom evaluation scripts on every checkpoint
- 📋 Automatic Weights & Biases logging
- ⚙️ Everything is configurable in YAML
git clone https://github.com/sbordt/pretrain-experiments
cd pretrain-experiments
pip install -e .You need at least one training backend. Each requires a modified fork with data insertion support.
OLMo (for OLMo-2)
git clone https://github.com/sbordt/OLMo
cd OLMo
git checkout pretrain-experiments
pip install -e .[all]
pip install h5pyOLMo-Core (for newer OLMo models)
git clone https://github.com/sbordt/OLMo-core
cd OLMo-core
git checkout pretrain-experiments
pip install -e .[all]
pip install h5pyThe following example inserts ARC-Challenge benchmark questions into OLMo-3 7B midtraining data and evaluates how much the model overfits on them. The full config is at config/OLMo-3-1025-7B-midtrain.yaml.
experiment: example-experiments
wandb:
name: olmo-3-midtrain
entity: your-entity
framework: olmo_core
model:
config: ${OLMO_CORE_REPO}/src/scripts/official/OLMo3/OLMo-3-1025-7B-midtrain.py
checkpoint_url: "https://olmo-checkpoints.org/ai2-llm/Olmo-3-1025-7B/stage2/"
checkpoint_step: 10000
training:
num_steps: 100
experiments:
experiments:
- type: add-texts-from-file
file: ${PRETRAIN_EXPERIMENTS}/resources/.../olmes_arc_challenge_test.jsonl
repetitions: 4 # each text is inserted 4 times
evaluation:
eval_on_load: true # evaluate before and after training
evaluations:
- script: olmes.py
args:
task: arc_challenge::olmes
split: testThe config specifies a model checkpoint to continue training from, data interventions to apply, and evaluations to run. Environment variables (${...}) are substituted at runtime.
Texts to insert are stored as JSONL — one JSON object per line with a "text" field:
{"text": "Question: An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?\nAnswer: Planetary days will become shorter."}
{"text": "Question: The end result in the process of photosynthesis is the production of sugar and oxygen. Which step signals the beginning of photosynthesis?\nAnswer: Chlorophyll in the leaf captures light energy."}
...pretrain-experiments config/OLMo-3-1025-7B-midtrain.yamlThis will download the checkpoint, insert the texts into the training data, train for 100 steps, and evaluate the result. Any config parameter can be overridden from the command line:
pretrain-experiments config/OLMo-3-1025-7B-midtrain.yaml --training.num_steps 50Example W&B logs: pretrain-experiments log · OLMo-Core training log
See the config/ directory for more examples. For a full reference of all configuration options, see docs/user-guide/configuration.md.
Insertions modify the training data that the model sees during continued pretraining. Each insertion is a sequence of tokens — either raw text (automatically tokenized) or pre-tokenized token IDs — that gets spliced into the training stream.
Placement. By default, insertions are placed at random positions across the training steps. You can also restrict placement to a specific range of steps, or specify exact token positions for full control.
Multiple sources. A single experiment can combine insertions from multiple JSONL files. Each source is configured independently with its own repetition count and placement mode.
Repetitions. Each text can be repeated multiple times (e.g., repetitions: 4) to increase exposure during training. Fractional values like 0.5 randomly sample a subset.
For details on all insertion types and modes, see docs/user-guide/insertions.md.
Contributions are welcome. Please open an issue for questions or submit a pull request.
This project is licensed under the MIT License.
If you use this software in your research, please cite:
@inproceedings{bordt2026train,
title={Train Once, Answer All: Many Pretraining Experiments for the Cost of One},
author={Bordt, Sebastian and Pawelczyk, Martin},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}