Skip to content

This project explores model compression methods, specifically Pruning and Quantization, applied to deep neural networks using PyTorch. These techniques reduce model size, computation cost, and inference latency while aiming to preserve accuracy.

Notifications You must be signed in to change notification settings

sevdaimany/Model-Compression-Pruning-Quantization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Model Compression Techniques: Pruning & Quantization with PyTorch

This project explores two primary techniques for model compression: Pruning and Quantization. The repository contains the work and findings from a series of exercises designed to evaluate the impact of these methods on model size, inference speed, and accuracy.

The project is divided into two main parts:

  • Pruning: Investigating various methods (layer-wise, global, iterative) to remove redundant weights.
  • Quantization: Analyzing the effects of different data formats (float16, bfloat16) and manual integer quantization (8-bit, 16-bit).

🏋️‍♂️ Part 1: Pruning

This section focuses on reducing model size by removing redundant weights using different pruning strategies.

Exercise 1 & 2: Layer-by-Layer L1 Unstructured Pruning

  • Task: Applied L1 unstructured pruning to each layer individually. The model's accuracy was evaluated at different sparsity ratios (0% to 80%) to find the optimal compression trade-off.

Exercise 3: Global Unstructured Pruning

  • Task: Applied global L1 unstructured pruning, allowing the algorithm to remove the lowest-magnitude weights from the entire model, not just within individual layers.
  • Findings:
    • Global pruning preserved accuracy better than the layer-by-layer approach.

Exercise 4: Re-training after Pruning

  • Task: Fine-tuned the globally pruned models from Exercise 3 for 5 epochs to allow the network to recover from the weight removal.
  • Findings:
    • Fine-tuning caused a significant recovery in accuracy for all models.

Exercise 5: Iterative Pruning and Re-training

  • Task: Implemented an iterative process: prune the model by a small amount, fine-tune for 3 epochs, and repeat this cycle until a high sparsity (80%) was reached.
  • Findings:
    • This was the most effective method, producing the best results.
    • The model was able to adapt to the gradual pruning at each step.

Bonus: Structured Pruning

  • Task: Implemented structured pruning to remove entire filters/channels instead of individual weights.
  • Findings:
    • Unstructured pruning is much more effective at preserving accuracy, as structured pruning removes entire learned features.

⚡ Part 2: Quantization

This section explores reducing model size and latency by changing the numerical precision of the model's weights.

Exercise 1: Evaluation of Floating-Point Data Formats

  • Task: Evaluated the model's performance (accuracy, latency, size) using float32, float16, and bfloat16 data types on both CPU and GPU. A large model (ResNet152) was also evaluated on the GPU.
  • Findings (ConvNet):
    • On GPU: float16 and bfloat16 reduced model size by 50% (1.31MB to 0.65MB) while maintaining identical accuracy (all ~0.83). Latency saw a minor increase (0.53ms to ~0.61ms), likely due to conversion overhead.
    • On CPU: float32 was significantly faster (2.04ms) than float16 (37.69ms) and bfloat16 (29.17ms).
    • Conclusion (CPU): float32 is the preferred format for CPU inference.
  • Findings (ResNet152 on GPU):
    • float16 and bfloat16 cut the model size in half (229.6MB to 114.8MB).
    • float16 and bfloat16 also decreased latency (7.83ms to ~6.68ms).
    • Conclusion (GPU w/ 4GB RAM): For a large model on a memory-constrained GPU, float16 or bfloat16 are essential to reduce memory usage.

Exercise 2: Manual Linear Quantization

  • Task: Performed manual linear quantization on a single layer to 8-bit and 16-bit integer formats. This involved calculating the scale (S) and zero-point (z) parameters to map the float weights to integers.

  • Findings:

    • The de-quantized weights were compared to the originals using Mean Squared Error (MSE).
    • 8-bit: Produced a very small error (MSE = 1.02 × 10⁻⁵), showing it's an accurate approximation.
    • 16-bit: Was nearly lossless, with an extremely small error (MSE = 1.57 × 10⁻¹⁰).
    • Parameters to Save:
      To store and later reconstruct the quantized model, three components must be saved for each layer: 1. Quantized weights (Q) 2. Scale (S) 3. Zero-point (z)

About

This project explores model compression methods, specifically Pruning and Quantization, applied to deep neural networks using PyTorch. These techniques reduce model size, computation cost, and inference latency while aiming to preserve accuracy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages