GPT-2 Implementation from Scratch

A custom implementation of GPT-2 from scratch using PyTorch, designed for training on the TinyStories dataset with cloud deployment capabilities.

Features

Custom GPT-2 Architecture: Implemented from scratch with both standard PyTorch transformer components and custom implementations
Decoder-Only Transformer: Implements causal attention masking for autoregressive text generation
BPE Tokenization: Custom Byte-Pair Encoding tokenizer trained on the dataset
Cloud-Ready: Designed for deployment on Modal cloud platform with GPU support
Complete Training Pipeline: From data preprocessing to model training and text generation

Architecture

Model Configuration: 8-layer decoder with 16 attention heads, 512-dimensional embeddings, and 10K vocabulary
Total Parameters: ~34M parameters (including 5.12M for token embeddings with weight tying)
Sequence Length: 1024 tokens
Optimizer: AdamW with learning rate scheduling and warmup
Training: Up to 900M tokens with configurable logging intervals

Requirements

Python 3.8+
PyTorch ≥ 2.0.0
Transformers, Datasets, Tokenizers
Additional dependencies listed in requirements.txt

Setup

Install dependencies:
```
pip install -r requirements.txt
```
Configure the model by modifying JSON files in the config/ directory:
- model_config.json: Model architecture parameters
- preprocess_config.json: Data preprocessing settings
- train_config.json: Training hyperparameters
- generator_config.json: Text generation settings

Usage

Local Training

python main.py

Cloud Training with Modal

The project is designed for cloud deployment on Modal with GPU support. Detailed instructions for cloud training are available in README-modal.md.

Project Structure

main.py: Main execution file orchestrating the training pipeline
config/: JSON configuration files for model, training, preprocessing, and generation
model/: GPT-2 architecture, optimizer, training loop, and text generation
preprocessing/: Data loading, tokenization, and dataset preparation
utils/: Helper functions for logging, configuration loading, and utilities
packages/: Centralized imports for all required libraries

Model Details

The implementation includes both a custom decoder implementation and the option to use PyTorch's built-in transformer encoder with causal masking. The model supports:

Checkpointing for memory efficiency
Weight tying between token embeddings and output projection
Configurable dropout rates
Gradient clipping for stable training

Text Generation

After training, the model can generate text with configurable parameters:

Temperature for sampling diversity
Maximum length for generated sequences
Top-k sampling for controlled generation

License

This project is licensed under the terms found in the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-2 Implementation from Scratch

Features

Architecture

Requirements

Setup

Usage

Local Training

Cloud Training with Modal

Project Structure

Model Details

Text Generation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
model		model
packages		packages
preprocessing		preprocessing
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README-modal.md		README-modal.md
README.md		README.md
main.py		main.py
modal-transfer.py		modal-transfer.py
requirements.txt		requirements.txt
test_main.py		test_main.py

Folders and files

Latest commit

History

Repository files navigation

GPT-2 Implementation from Scratch

Features

Architecture

Requirements

Setup

Usage

Local Training

Cloud Training with Modal

Project Structure

Model Details

Text Generation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages