Inference docs | Training docs
A curated collection of ready-to-use training recipes for machine learning on Baseten. Whether you’re starting from scratch or fine-tuning an existing model, these recipes provide practical, copy-paste solutions for every stage of your ML pipeline.
- Training recipes - End-to-end examples for training models from scratch
- Fine-tuning workflows - Adapt pre-trained models to your specific use case
- Best practices - Optimized configurations and common patterns
From data preprocessing to checkpointed and trained models, these recipes cover the complete ML lifecycle on Baseten's platform.
Before getting started, ensure you have the following:
- A Baseten account. Sign up here if you don't have one.
- Python 3.8 to 3.11 installed. Conda env recommended.
Use the appropriate command for your package manager
# pip
pip install -U truss
# uv
uv add truss && uv sync --upgrade-package truss# for any example (replace with the specific example name)
truss train init --examples <example-name> && cd <example-name>Make sure you've plugged in proper secrets (e.g. Hugging Face token) via Baseten Secrets and Environment Variables, and kick off your job
truss train push config.pyFor more details, take a look at the docs
If you'd like to fire off jobs from within this repository directly, you can clone the respository and navigagte to the approriate workspaces.
git clone https://github.com/basetenlabs/ml-cookbook.gitexamples/ are runnable, model/framework-specific projects you can launch directly with truss train push config.py.
recipes/ are reusable implementation guides and patterns that help you choose an approach and adapt it to your own project.
The Programmatic Training API lets you launch and manage machine learning training jobs directly from your Python code, rather than relying solely on CLI commands or configuration files.
recipes/programmatic-training-api/README.md
"Long-context supervised fine-tuning (SFT)" refers to adapting large language models to handle and learn from sequences with a much greater length than standard context windows. This enables models to process, reason about, and generate long-form documents, conversations, or codebases in a single pass.
This example demonstrates how to set up a supervised fine-tuning project targeting long-context models.
For detailed instructions and code, see recipes/sft/long_context/README.md.
recipes/sft/long_context/README.md
Fine-tune GPT OSS 20B with LoRa and trl
If using a model with gated access, make sure you have access to the model on HuggingFace and your API token uploaded to your secrets. This example requires an HF access token.
examples/oss-gpt-20b-lora/training/train.py contains all training code.
examples/oss-gpt-20b-lora/training/config.py will be the entry point to start training, where you can define your training configuration. This also includes the start commands to launch your training job. Make sure these commands also include any file permission changes to make shell scripts run. We do not change any file system permissions.
Make sure to update hf_access_token in config.py with the same name for this access token saved in your secrets. In this example, we will be writing trained checkpoints directly to Huggingface, the Hub IDs for models and datasets are configured in examples/oss-gpt-20b-lora/training/run.sh. Update run.sh with a repo you have access to write to.
cd examples/oss-gpt-20b-lora/training
truss train push config.pyUpon successful submission, the CLI will output helpful information about your job:
✨ Training job successfully created!
🪵 View logs for your job via `truss train logs --job-id e3m512w [--tail]`
🔍 View metrics for your job via `truss train metrics --job-id e3m512w`
Keep the Job ID handy, as you’ll use it for managing and monitoring your job.
Alternatively, you can view all your training jobs at (https://app.baseten.co/training/)[https://app.baseten.co/training/].
- As checkpoints are generated, you can access them on Huggingface at the same location defined in
run.sh.
Fine-tune Qwen3 8B with LoRa and trl
If using a model with gated access, make sure you have access to the model on HuggingFace and your API token uploaded to your secrets.
examples/qwen3-8b-lora-dpo-trl/training/train.py contains the training code.
examples/qwen3-8b-lora-dpo-trl/training/config.py will be the entry point to start training, where you can define your training configuration. This also includes the start commands to launch your training job. Make sure these commands also include any file permission changes to make shell scripts run. We do not change any file system permissions.
cd examples/qwen3-8b-lora-dpo-trl/training
truss train push config.pyUpon successful submission, the CLI will output helpful information about your job:
✨ Training job successfully created!
🪵 View logs for your job via `truss train logs --job-id e3m512w [--tail]`
🔍 View metrics for your job via `truss train metrics --job-id e3m512w`
Alternatively, you can view all your training jobs at (https://app.baseten.co/training/)[https://app.baseten.co/training/].
In this example, since checkpointing is enabled in config.py, checkpoints are stored in cloud storage and can be accessed with
truss train get_checkpoint_urls --job-id $JOB_ID
examples/mnist-single-gpu/training/train_mnist.py contains the a Pytorch example of an MNIST classifier with CNNs.
examples/mnist-single-gpu/training/config.py will be the entry point to start training, where you can define your training configuration. This also includes the start commands to launch your training job. Make sure these commands also include any file permission changes to make shell scripts run. We do not change any file system permissions.
cd examples/mnist-single-gpu/training
truss train push config.pyUpon successful submission, the CLI will output helpful information about your job:
✨ Training job successfully created!
🪵 View logs for your job via `truss train logs --job-id e3m512w [--tail]`
🔍 View metrics for your job via `truss train metrics --job-id e3m512w`
Keep the Job ID handy, as you’ll use it for managing and monitoring your job.
In this example, since checkpointing is enabled in config.py, checkpoints are stored in cloud storage and can be accessed with
truss train get_checkpoint_urls --job-id $JOB_ID
Contributions are welcome! Please open issues or submit pull requests.