LoRA (Low-Rank Adaptation) is an efficient fine-tuning technique for neural networks that dramatically reduces the number of trainable parameters by using low-rank decomposition methods.
Instead of updating all parameters during fine-tuning, LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the neural network. This approach:
- Significantly reduces memory requirements
- Speeds up training time
- Maintains model performance comparable to full fine-tuning
Traditionally, when fine-tuning a model, we update the original weights W directly:
W_updated = W + ΔW
LoRA approximates the weight updates (ΔW) using the product of two low-rank matrices A and B:
W_updated = W + A·B
Where:
- W is the frozen pre-trained weight matrix
- A and B are smaller matrices with a bottleneck dimension r
- r is a hyperparameter controlling the rank of the decomposition
Let's illustrate with concrete numbers:
- Suppose a layer has a weight matrix W of size 5,000×10,000 (50M parameters)
- With LoRA using rank r=8:
- Matrix A has dimensions 5,000×8 (40,000 parameters)
- Matrix B has dimensions 8×10,000 (80,000 parameters)
Total trainable parameters: 40,000 + 80,000 = 120,000
That's 400× fewer parameters than full fine-tuning!
During the forward pass of input x through a LoRA-adapted layer:
output = x·W + α·(x·A·B)
Where α is a scaling factor to control the magnitude of the LoRA update.
- Efficiency: Fine-tune large models on consumer hardware
- Parameter Sharing: Multiple LoRA adapters can be trained for different tasks while sharing the base model
- Quick Adaptation: Train specialized versions of a model for different domains or tasks
- Reduced Storage: Store only the small adapter weights (A and B matrices) instead of full model copies
A basic LoRA layer adds low-rank updates to the original layer output:
The α hyperparameter controls the magnitude of the LoRA adaptation:
- Higher α = larger modifications to model behavior
- Lower α = more subtle changes
- Freeze weights of the pre-trained model
- Initialize LoRA matrices (A with random small weights, B with zeros)
- Train only the LoRA matrices A and B
- At inference time, you can either:
- Continue computing W + A·B separately
- Merge the weights: W_merged = W + α·(A·B)
- Rank Selection: Lower ranks save memory but provide less modeling capacity
- Which Layers to Adapt: Commonly applied to attention layers in transformers, but can be used on any linear layers
- Initialization Strategy: Proper initialization of A and B matrices is important for stable training
While LoRA was initially developed for large language models, the technique works for any neural network with linear layers, including:
- Computer vision models
- Audio processing networks
- Multimodal architectures
- Reinforcement learning policies
The simplest way to use LoRA is to replace standard linear layers with LoRA-augmented versions:
# Instead of:
layer = nn.Linear(input_dim, output_dim)
# Use:
layer = LinearWithLoRA(nn.Linear(input_dim, output_dim), rank=4, alpha=8)Then freeze the original weights and train only the LoRA parameters.
For a deeper dive into LoRA, check the original paper:
"LoRA: Low-Rank Adaptation of Large Language Models" by Hu et al.
For any questions or issues, please open an issue on GitHub: @Siddharth Mishra
Made with ❤️ and lots of ☕

