العربية • English • Español • Français • 简体中文
THL is a novel hierarchical recurrent architecture that enables large language model inference on consumer hardware with as little as 4GB VRAM. Unlike traditional Transformers that suffer from KV cache memory explosion, THL achieves O(1) memory complexity per layer through sequence-length independent memory design.
Traditional Transformer models face a critical bottleneck: their KV cache grows linearly with sequence length O(T), making long-context generation impossible on consumer hardware. A 7B parameter model processing 8K tokens can easily exceed 24GB of VRAM.
THL replaces the unbounded KV cache with a fixed-slot memory bank (default: 1024 slots), enabling:
- ✅ Infinite context length without memory overflow
- ✅ Inference on 4GB VRAM devices
- ✅ Competitive performance with Transformer architectures
- ✅ Mobile and edge device deployment
- Bounded Memory (O(1)): Fixed memory slots eliminate KV cache explosion
- Hierarchical Recurrence: Multi-timescale GRU tiers process information at exponential intervals (τ = 2^k)
- Sparse Routing: Multi-head Top-K routing accesses relevant memories efficiently
- Low VRAM Inference: Layered inference engine enables 7B+ models on <4GB VRAM
- Production Ready: Comprehensive test suite and documented APIs
- Python 3.8+
- PyTorch 1.12+
- CUDA 11.0+ (for GPU acceleration)
# Clone the repository
git clone https://github.com/EGen-V/Transformer-Hierarchical-Layers.git
cd Transformer-Hierarchical-Layers/Core
# Install dependencies
pip install -r requirements.txt
# Install THL
pip install -e .pip install Transformer-Hierarchical-Layersimport torch
from thl.config import THLConfig
from thl.model import THLModel
# Configure model for 4GB VRAM
config = THLConfig(
num_tiers=3, # Hierarchical depth
memory_slots=1024, # Fixed memory size
dim=768, # Model dimension
vocab_size=50257 # Vocabulary size
)
# Initialize model
model = THLModel(config)
# Run inference
input_ids = torch.randint(0, 50257, (1, 32))
logits, state = model(input_ids)
print(f"Output shape: {logits.shape}") # [1, 32, 50257]For larger models, use the layered inference engine to stream layers through the GPU:
from thl.inference.layered import LayeredInferenceEngine
from thl.inference.state import InferenceState
# Initialize streaming engine
engine = LayeredInferenceEngine(model, device="cuda")
# Create inference state
state = InferenceState.init(
batch_size=1,
config=config,
tiers=model.tiers,
memory_bank=model.memory_bank
)
# Generate tokens one at a time
generated_tokens = []
for _ in range(100):
token = torch.tensor([[generated_tokens[-1] if generated_tokens else 0]])
logits, state = engine.step(token, state)
next_token = logits.argmax(dim=-1)
generated_tokens.append(next_token.item())from thl.generation import generate_text
prompt = "The future of AI is"
output = generate_text(
model=model,
tokenizer=tokenizer,
prompt=prompt,
max_length=200,
temperature=0.8,
top_k=50
)
print(output)THL employs a hierarchical recurrent architecture with four key components:
| Component | Symbol | Description |
|---|---|---|
| Memory Bank | M_t | Fixed-size matrix (J × d) storing long-term context |
| Sparse Router | r_t | Top-K attention mechanism for efficient memory access |
| Hierarchical Tiers | s_t^(k) | Stack of GRU cells updating at exponential intervals τ = 2^k |
| Novelty Writer | w_t | Gated mechanism writing only novel information to memory |
- Read: Sparse router retrieves Top-K relevant memory slots
- Process: Hierarchical tiers update at different timescales
- Write: Novelty gate determines what new information to store
- Predict: Output layer generates next-token logits
| Metric | THL-7B | Transformer-7B |
|---|---|---|
| VRAM (8K ctx) | 3.8 GB | 26.4 GB |
| Perplexity | ~12.4 | ~11.8 |
| Throughput | 42 tok/s | 38 tok/s |
| Max Context | Unlimited | 8K tokens |
Benchmarked on NVIDIA RTX 3060 (12GB)
We maintain comprehensive test coverage. Run the full suite:
# Run all tests
./scripts/run_tests.sh
# Run specific test categories
pytest tests/test_model.py # Model tests
pytest tests/test_inference.py # Inference tests
pytest tests/test_memory.py # Memory management tests- Pre-trained model checkpoints
- PyPI package release
- ONNX export support
- Mobile deployment (iOS/Android)
- Web deployment (WASM)
- Multi-GPU training support
- Quantization (INT8/INT4)
We welcome contributions! Please see our Contributing Guidelines for details.
# Set up development environment
git clone https://github.com/EGen-V/Transformer-Hierarchical-Layers.git
cd Transformer-Hierarchical-Layers
pip install -e ".[dev]"
pre-commit installIf you use THL in your research, please cite:
@software{thl2026,
title={THL: Transformer Hierarchical Layers},
author={EGen Team},
year={2026},
url={https://github.com/EGen-V/Transformer-Hierarchical-Layers}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by recurrent memory architectures and efficient transformers research
- Built with PyTorch and the open-source ML community
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Made with ❤️ by the EGen Team
