A custom implementation of GPT-2 from scratch using PyTorch, designed for training on the TinyStories dataset with cloud deployment capabilities.
- Custom GPT-2 Architecture: Implemented from scratch with both standard PyTorch transformer components and custom implementations
- Decoder-Only Transformer: Implements causal attention masking for autoregressive text generation
- BPE Tokenization: Custom Byte-Pair Encoding tokenizer trained on the dataset
- Cloud-Ready: Designed for deployment on Modal cloud platform with GPU support
- Complete Training Pipeline: From data preprocessing to model training and text generation
- Model Configuration: 8-layer decoder with 16 attention heads, 512-dimensional embeddings, and 10K vocabulary
- Total Parameters: ~34M parameters (including 5.12M for token embeddings with weight tying)
- Sequence Length: 1024 tokens
- Optimizer: AdamW with learning rate scheduling and warmup
- Training: Up to 900M tokens with configurable logging intervals
- Python 3.8+
- PyTorch ≥ 2.0.0
- Transformers, Datasets, Tokenizers
- Additional dependencies listed in
requirements.txt
-
Install dependencies:
pip install -r requirements.txt
-
Configure the model by modifying JSON files in the
config/directory:model_config.json: Model architecture parameterspreprocess_config.json: Data preprocessing settingstrain_config.json: Training hyperparametersgenerator_config.json: Text generation settings
python main.pyThe project is designed for cloud deployment on Modal with GPU support. Detailed instructions for cloud training are available in README-modal.md.
main.py: Main execution file orchestrating the training pipelineconfig/: JSON configuration files for model, training, preprocessing, and generationmodel/: GPT-2 architecture, optimizer, training loop, and text generationpreprocessing/: Data loading, tokenization, and dataset preparationutils/: Helper functions for logging, configuration loading, and utilitiespackages/: Centralized imports for all required libraries
The implementation includes both a custom decoder implementation and the option to use PyTorch's built-in transformer encoder with causal masking. The model supports:
- Checkpointing for memory efficiency
- Weight tying between token embeddings and output projection
- Configurable dropout rates
- Gradient clipping for stable training
After training, the model can generate text with configurable parameters:
- Temperature for sampling diversity
- Maximum length for generated sequences
- Top-k sampling for controlled generation
This project is licensed under the terms found in the LICENSE file.