Skip to content

Amir-Hofo/GPT2

Repository files navigation

GPT-2 Implementation from Scratch

A custom implementation of GPT-2 from scratch using PyTorch, designed for training on the TinyStories dataset with cloud deployment capabilities.

Features

  • Custom GPT-2 Architecture: Implemented from scratch with both standard PyTorch transformer components and custom implementations
  • Decoder-Only Transformer: Implements causal attention masking for autoregressive text generation
  • BPE Tokenization: Custom Byte-Pair Encoding tokenizer trained on the dataset
  • Cloud-Ready: Designed for deployment on Modal cloud platform with GPU support
  • Complete Training Pipeline: From data preprocessing to model training and text generation

Architecture

  • Model Configuration: 8-layer decoder with 16 attention heads, 512-dimensional embeddings, and 10K vocabulary
  • Total Parameters: ~34M parameters (including 5.12M for token embeddings with weight tying)
  • Sequence Length: 1024 tokens
  • Optimizer: AdamW with learning rate scheduling and warmup
  • Training: Up to 900M tokens with configurable logging intervals

Requirements

  • Python 3.8+
  • PyTorch ≥ 2.0.0
  • Transformers, Datasets, Tokenizers
  • Additional dependencies listed in requirements.txt

Setup

  1. Install dependencies:

    pip install -r requirements.txt
  2. Configure the model by modifying JSON files in the config/ directory:

    • model_config.json: Model architecture parameters
    • preprocess_config.json: Data preprocessing settings
    • train_config.json: Training hyperparameters
    • generator_config.json: Text generation settings

Usage

Local Training

python main.py

Cloud Training with Modal

The project is designed for cloud deployment on Modal with GPU support. Detailed instructions for cloud training are available in README-modal.md.

Project Structure

  • main.py: Main execution file orchestrating the training pipeline
  • config/: JSON configuration files for model, training, preprocessing, and generation
  • model/: GPT-2 architecture, optimizer, training loop, and text generation
  • preprocessing/: Data loading, tokenization, and dataset preparation
  • utils/: Helper functions for logging, configuration loading, and utilities
  • packages/: Centralized imports for all required libraries

Model Details

The implementation includes both a custom decoder implementation and the option to use PyTorch's built-in transformer encoder with causal masking. The model supports:

  • Checkpointing for memory efficiency
  • Weight tying between token embeddings and output projection
  • Configurable dropout rates
  • Gradient clipping for stable training

Text Generation

After training, the model can generate text with configurable parameters:

  • Temperature for sampling diversity
  • Maximum length for generated sequences
  • Top-k sampling for controlled generation

License

This project is licensed under the terms found in the LICENSE file.

About

Implementation of the GPT-2 architecture using PyTorch, trained on the TinyStories dataset. Features custom training pipelines on Modal (cloud computing) and integration with the Hugging Face ecosystem.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages