TMRC

Transformer model research codebase

TMRC (Transformer Model Research Codebase) is a simple, explainable codebase to train transformer-based models. It was developed with simplicity and ease of modification in mind, particularly for researchers. The codebase will eventually be used to train foundation models and experiment with architectural and training modifications.

Documentation

TMRC Documentation

Installation

Step 1: Install uv (skip if already installed)

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2: Clone the repository

git clone git@github.com:KempnerInstitute/tmrc.git

Step 3: Create the environment and install the package

From the repo root, uv sync will create a .venv/ using the Python version pinned in .python-version (3.12) and install all dependencies from uv.lock:
```
cd tmrc
uv sync
```
Activate the environment with:
```
source .venv/bin/activate
```
Alternatively, prefix any command with uv run to run it inside the environment without activating it (e.g., uv run python src/tmrc/core/training/train.py).

Note

uv brings its own Python, and PyTorch wheels bundle the CUDA runtime and cuDNN, so no module load is required on the Kempner AI cluster for the standard training path. Only load cuda/12.4.1-fasrc01 if you build custom CUDA extensions (nvcc) or compile something like flash-attention from source. TMRC has been tested with torch 2.6.0 and Python 3.12.

Running Experiments

Step 1: Login to Weights & Biases to enable experiment tracking
```
wandb login
```

Single-GPU Training

Step 2: Request compute resources. For example, on the Kempner AI cluster, to request an H100 80GB GPU run
```
salloc --partition=kempner_h100 --account=<fairshare account> --nodes=1 --ntasks=1 --cpus-per-task=24 --mem=375G --gres=gpu:1  --time=00-07:00:00
```
If you are not using the Kempner AI cluster, you can run experiments on your local machine (if you have a GPU) or on cloud services like AWS, GCP, or Azure. TMRC should automatically find the available GPU. If there are no GPUs available, it will run on CPU (though this is not recommended, since training will be prohibitively slow for any reasonable model size).
Step 3: Activate the environment
```
source .venv/bin/activate
```
Step 4: Launch training
```
python src/tmrc/core/training/train.py
```

Multi-node multiple-GPU Training

Step 2: Request compute resources. For example, on the Kempner AI cluster, to request eight H100 80GB GPUs on two nodes run

salloc --partition=kempner_h100 --account=<fairshare account> --nodes=2 --ntasks-per-node=4 --ntasks=8 --cpus-per-task=24 --mem=375G --gres=gpu:4  --time=00-07:00:00

Step 3: Activate the environment
```
source .venv/bin/activate
```

Step 4: Launch training

srun python src/tmrc/core/training/train.py

Note

For distributed training, TMRC uses Distributed Data Parallelism (DDP) by default. For larger models, to use Fully Sharded Data Parallelism (FSDP), set distributed_strategy to fsdp in the training part of the config file or see the next section on how to have a custom config file.

Configuration

By default, the training script uses the configuration defined in configs/training/default_train_config.yaml.

To use a custom configuration file

python src/tmrc/core/training/train.py --config-name YOUR_CONFIG

Note

The --config-name parameter should be specified without the .yaml extension.

Tip

Configuration files should be placed in the configs/training/ directory. For example, if your config is named my_experiment.yaml, use --config-name my_experiment

Make sure to change the path under datasets block in the config file.

Build the documentation locally

Step 1: Install the required packages
```
uv sync --group docs
```
Step 2: Build the documentation
```
cd docs
make html
```
Step 3: Open the documentation in your browser
```
open _build/html/index.html
```

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
.github/workflows		.github/workflows
configs/training		configs/training
docs		docs
notebooks		notebooks
scripts		scripts
src/tmrc		src/tmrc
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TMRC

Documentation

Installation

Running Experiments

Single-GPU Training

Multi-node multiple-GPU Training

Configuration

Build the documentation locally

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TMRC

Documentation

Installation

Running Experiments

Single-GPU Training

Multi-node multiple-GPU Training

Configuration

Build the documentation locally

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages