Model Compression Techniques: Pruning & Quantization with PyTorch

This project explores two primary techniques for model compression: Pruning and Quantization. The repository contains the work and findings from a series of exercises designed to evaluate the impact of these methods on model size, inference speed, and accuracy.

The project is divided into two main parts:

Pruning: Investigating various methods (layer-wise, global, iterative) to remove redundant weights.
Quantization: Analyzing the effects of different data formats (float16, bfloat16) and manual integer quantization (8-bit, 16-bit).

🏋️‍♂️ Part 1: Pruning

This section focuses on reducing model size by removing redundant weights using different pruning strategies.

Exercise 1 & 2: Layer-by-Layer L1 Unstructured Pruning

Task: Applied L1 unstructured pruning to each layer individually. The model's accuracy was evaluated at different sparsity ratios (0% to 80%) to find the optimal compression trade-off.

Exercise 3: Global Unstructured Pruning

Task: Applied global L1 unstructured pruning, allowing the algorithm to remove the lowest-magnitude weights from the entire model, not just within individual layers.
Findings:
- Global pruning preserved accuracy better than the layer-by-layer approach.

Exercise 4: Re-training after Pruning

Task: Fine-tuned the globally pruned models from Exercise 3 for 5 epochs to allow the network to recover from the weight removal.
Findings:
- Fine-tuning caused a significant recovery in accuracy for all models.

Exercise 5: Iterative Pruning and Re-training

Task: Implemented an iterative process: prune the model by a small amount, fine-tune for 3 epochs, and repeat this cycle until a high sparsity (80%) was reached.
Findings:
- This was the most effective method, producing the best results.
- The model was able to adapt to the gradual pruning at each step.

Bonus: Structured Pruning

Task: Implemented structured pruning to remove entire filters/channels instead of individual weights.
Findings:
- Unstructured pruning is much more effective at preserving accuracy, as structured pruning removes entire learned features.

⚡ Part 2: Quantization

This section explores reducing model size and latency by changing the numerical precision of the model's weights.

Exercise 1: Evaluation of Floating-Point Data Formats

Task: Evaluated the model's performance (accuracy, latency, size) using float32, float16, and bfloat16 data types on both CPU and GPU. A large model (ResNet152) was also evaluated on the GPU.
Findings (ConvNet):
- On GPU: float16 and bfloat16 reduced model size by 50% (1.31MB to 0.65MB) while maintaining identical accuracy (all ~0.83). Latency saw a minor increase (0.53ms to ~0.61ms), likely due to conversion overhead.
- On CPU: float32 was significantly faster (2.04ms) than float16 (37.69ms) and bfloat16 (29.17ms).
- Conclusion (CPU): float32 is the preferred format for CPU inference.
Findings (ResNet152 on GPU):
- float16 and bfloat16 cut the model size in half (229.6MB to 114.8MB).
- float16 and bfloat16 also decreased latency (7.83ms to ~6.68ms).
- Conclusion (GPU w/ 4GB RAM): For a large model on a memory-constrained GPU, float16 or bfloat16 are essential to reduce memory usage.

Exercise 2: Manual Linear Quantization

Task: Performed manual linear quantization on a single layer to 8-bit and 16-bit integer formats. This involved calculating the scale (S) and zero-point (z) parameters to map the float weights to integers.
Findings:
- The de-quantized weights were compared to the originals using Mean Squared Error (MSE).
- 8-bit: Produced a very small error (MSE = 1.02 × 10⁻⁵), showing it's an accurate approximation.
- 16-bit: Was nearly lossless, with an extremely small error (MSE = 1.57 × 10⁻¹⁰).
- Parameters to Save:
  To store and later reconstruct the quantized model, three components must be saved for each layer: 1. Quantized weights (Q) 2. Scale (S) 3. Zero-point (z)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Pruning		Pruning
Quantization		Quantization
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Model Compression Techniques: Pruning & Quantization with PyTorch

🏋️‍♂️ Part 1: Pruning

Exercise 1 & 2: Layer-by-Layer L1 Unstructured Pruning

Exercise 3: Global Unstructured Pruning

Exercise 4: Re-training after Pruning

Exercise 5: Iterative Pruning and Re-training

Bonus: Structured Pruning

⚡ Part 2: Quantization

Exercise 1: Evaluation of Floating-Point Data Formats

Exercise 2: Manual Linear Quantization

About

Uh oh!

Releases

Packages

Languages

sevdaimany/Model-Compression-Pruning-Quantization

Folders and files

Latest commit

History

Repository files navigation

Model Compression Techniques: Pruning & Quantization with PyTorch

🏋️‍♂️ Part 1: Pruning

Exercise 1 & 2: Layer-by-Layer L1 Unstructured Pruning

Exercise 3: Global Unstructured Pruning

Exercise 4: Re-training after Pruning

Exercise 5: Iterative Pruning and Re-training

Bonus: Structured Pruning

⚡ Part 2: Quantization

Exercise 1: Evaluation of Floating-Point Data Formats

Exercise 2: Manual Linear Quantization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages