This project explores two primary techniques for model compression: Pruning and Quantization. The repository contains the work and findings from a series of exercises designed to evaluate the impact of these methods on model size, inference speed, and accuracy.
The project is divided into two main parts:
- Pruning: Investigating various methods (layer-wise, global, iterative) to remove redundant weights.
- Quantization: Analyzing the effects of different data formats (float16, bfloat16) and manual integer quantization (8-bit, 16-bit).
This section focuses on reducing model size by removing redundant weights using different pruning strategies.
- Task: Applied L1 unstructured pruning to each layer individually. The model's accuracy was evaluated at different sparsity ratios (0% to 80%) to find the optimal compression trade-off.
- Task: Applied global L1 unstructured pruning, allowing the algorithm to remove the lowest-magnitude weights from the entire model, not just within individual layers.
- Findings:
- Global pruning preserved accuracy better than the layer-by-layer approach.
- Task: Fine-tuned the globally pruned models from Exercise 3 for 5 epochs to allow the network to recover from the weight removal.
- Findings:
- Fine-tuning caused a significant recovery in accuracy for all models.
- Task: Implemented an iterative process: prune the model by a small amount, fine-tune for 3 epochs, and repeat this cycle until a high sparsity (80%) was reached.
- Findings:
- This was the most effective method, producing the best results.
- The model was able to adapt to the gradual pruning at each step.
- Task: Implemented structured pruning to remove entire filters/channels instead of individual weights.
- Findings:
- Unstructured pruning is much more effective at preserving accuracy, as structured pruning removes entire learned features.
This section explores reducing model size and latency by changing the numerical precision of the model's weights.
- Task: Evaluated the model's performance (accuracy, latency, size) using
float32,float16, andbfloat16data types on both CPU and GPU. A large model (ResNet152) was also evaluated on the GPU. - Findings (ConvNet):
- On GPU:
float16andbfloat16reduced model size by 50% (1.31MB to 0.65MB) while maintaining identical accuracy (all ~0.83). Latency saw a minor increase (0.53ms to ~0.61ms), likely due to conversion overhead. - On CPU:
float32was significantly faster (2.04ms) thanfloat16(37.69ms) andbfloat16(29.17ms). - Conclusion (CPU):
float32is the preferred format for CPU inference.
- On GPU:
- Findings (ResNet152 on GPU):
float16andbfloat16cut the model size in half (229.6MB to 114.8MB).float16andbfloat16also decreased latency (7.83ms to ~6.68ms).- Conclusion (GPU w/ 4GB RAM): For a large model on a memory-constrained GPU,
float16orbfloat16are essential to reduce memory usage.
-
Task: Performed manual linear quantization on a single layer to 8-bit and 16-bit integer formats. This involved calculating the scale (
S) and zero-point (z) parameters to map the float weights to integers. -
Findings:
- The de-quantized weights were compared to the originals using Mean Squared Error (MSE).
- 8-bit: Produced a very small error (
MSE = 1.02 × 10⁻⁵), showing it's an accurate approximation. - 16-bit: Was nearly lossless, with an extremely small error (
MSE = 1.57 × 10⁻¹⁰). - Parameters to Save:
To store and later reconstruct the quantized model, three components must be saved for each layer: 1. Quantized weights (Q) 2. Scale (S) 3. Zero-point (z)