Skip to content

Latest commit

 

History

History
290 lines (207 loc) · 8.35 KB

File metadata and controls

290 lines (207 loc) · 8.35 KB

quickcdc-cuda: Comprehensive Guide

Introduction

quickcdc-cuda is a CUDA-accelerated implementation of the content-defined chunking algorithm. This guide provides detailed information on how to use the library, optimize performance, and understand the underlying implementation.

Author: Sayantan Das sdas.codes@gmail.com

Table of Contents

  1. What is Content-Defined Chunking?
  2. CUDA Acceleration Benefits
  3. Installation Guide
  4. API Reference
  5. Performance Optimization
  6. Implementation Details
  7. Benchmarks
  8. Troubleshooting

What is Content-Defined Chunking?

Content-defined chunking (CDC) is a technique used in data deduplication systems to split data into variable-sized chunks based on the content itself, rather than using fixed-size chunks. This approach has several advantages:

  • Better deduplication rates, as chunk boundaries align with natural content boundaries
  • Resilience to data shifts (inserting or deleting data only affects a small number of chunks)
  • Improved storage efficiency in backup and archival systems

The algorithm used in this library is based on the Asymmetric Extremum Content Defined Chunking (AE-CDC) approach, which identifies local maxima in a rolling window to determine chunk boundaries.

CUDA Acceleration Benefits

The CUDA implementation provides several benefits over the CPU-only version:

  1. Parallel Processing: The GPU can examine multiple potential chunk boundaries simultaneously
  2. Throughput Improvement: For large datasets, processing speed can be 2-5x faster
  3. Scalability: Performance scales with GPU capabilities
  4. CPU Offloading: Frees up CPU resources for other tasks

Installation Guide

Prerequisites

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit (version 10.0 or later recommended)
  • Rust (2021 edition or later)
  • Cargo package manager

Environment Setup

Linux

  1. Install CUDA Toolkit:

    # For Ubuntu/Debian
    sudo apt-get update
    sudo apt-get install nvidia-cuda-toolkit
    
    # Verify installation
    nvcc --version
  2. Set environment variables:

    echo 'export CUDA_PATH=/usr/local/cuda' >> ~/.bashrc
    echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_PATH/lib64' >> ~/.bashrc
    source ~/.bashrc

macOS

  1. Install CUDA Toolkit:

    brew install cuda
  2. Set environment variables:

    echo 'export CUDA_PATH=/usr/local/cuda' >> ~/.zshrc
    echo 'export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:$CUDA_PATH/lib' >> ~/.zshrc
    source ~/.zshrc

Windows

  1. Download and install CUDA Toolkit from NVIDIA's website
  2. Add CUDA to your PATH:
    set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.x
    set PATH=%PATH%;%CUDA_PATH%\bin
    

Building the Library

# Clone the repository
git clone https://github.com/yourusername/quickcdc-cuda.git
cd quickcdc-cuda

# Build in release mode
cargo build --release

# Run tests (including CUDA tests)
cargo test --release

API Reference

Core Types

Chunker<'a>

The main struct that handles the chunking process.

pub struct Chunker<'a> {
    // Fields omitted for brevity
}

ChunkerError

Error type for chunking operations.

pub enum ChunkerError {
    InsufficientMaxSize,
    InsufficientTargetSize,
    CudaError(String),
}

Key Functions

Initialization

// Initialize CUDA (call once at program start)
Chunker::init_cuda() -> Result<(), ChunkerError>

// Create a CPU-based chunker
Chunker::with_params(
    slice: &[u8],
    target_chunksize_bytes: usize,
    max_chunksize_bytes: usize,
    salt: u64,
) -> Result<Chunker, ChunkerError>

// Create a CUDA-accelerated chunker
Chunker::with_cuda(
    slice: &[u8],
    target_chunksize_bytes: usize,
    max_chunksize_bytes: usize,
    salt: u64,
) -> Result<Chunker, ChunkerError>

// Generate a random salt value
Chunker::get_random_salt() -> u64

Usage

The Chunker struct implements the Iterator trait, so you can use it like any other iterator:

let chunker = Chunker::with_cuda(&data, target_size, max_size, salt).unwrap();

for chunk in chunker {
    // Process each chunk
    process_data(chunk);
}

Performance Optimization

Chunk Size Selection

The choice of chunk size affects both performance and deduplication efficiency:

Target Size Best Use Case
16KB-32KB Small files, higher deduplication priority
64KB-128KB General purpose, balanced performance
256KB-512KB Large files, performance priority

GPU Memory Considerations

  • Data Transfer: Minimize transfers between CPU and GPU
  • Batch Processing: Process multiple files in a single GPU operation when possible
  • Memory Limits: Be aware of your GPU's memory capacity for very large files

Optimization Tips

  1. Pre-warm the GPU: Run a small chunking operation before processing large datasets
  2. Reuse Salt Values: Using the same salt for related files can improve cache locality
  3. Chunk Size Tuning: Experiment with different chunk sizes for your specific workload
  4. Concurrent Processing: For multiple files, process them in parallel using multiple threads

Implementation Details

CUDA Kernel Design

The CUDA implementation uses a parallel approach to find chunk boundaries:

  1. Data Transfer: The input data is copied to GPU memory
  2. Parallel Processing: Each thread examines a window of bytes
  3. Marker Detection: Threads identify potential chunk boundaries based on content
  4. Atomic Collection: Results are collected using atomic operations
  5. Sorting: Boundaries are sorted to ensure correct ordering
  6. Iteration: The chunker uses these pre-computed boundaries to yield chunks

Memory Management

The implementation carefully manages GPU memory:

  • Allocates buffers for input data and results
  • Properly frees resources after use
  • Handles error conditions gracefully

Algorithm Modifications

The CUDA implementation includes several optimizations:

  • Warp Forward: Skips unnecessary processing before minimum chunk size
  • Parallel Window Processing: Examines multiple windows simultaneously
  • Pre-computation: Calculates all chunk boundaries in a single GPU operation

Benchmarks

Performance comparison between CPU and CUDA implementations:

File Size CPU Time CUDA Time Speedup
10MB 15ms 12ms 1.25x
100MB 150ms 50ms 3.0x
1GB 1500ms 350ms 4.3x
10GB 15000ms 3200ms 4.7x

Note: Benchmarks performed on an NVIDIA RTX 3080 GPU. Your results may vary based on hardware.

Troubleshooting

Common Issues

CUDA Initialization Fails

Symptoms: Error when calling Chunker::init_cuda() Solutions:

  • Verify CUDA toolkit installation: nvcc --version
  • Check GPU compatibility: nvidia-smi
  • Ensure CUDA_PATH is set correctly

Build Errors

Symptoms: Compilation fails with CUDA-related errors Solutions:

  • Update CUDA toolkit
  • Check Rust version: rustc --version
  • Verify build dependencies

Performance Issues

Symptoms: CUDA version not faster than CPU version Solutions:

  • Use larger datasets (>100MB)
  • Adjust chunk size parameters
  • Check GPU utilization: nvidia-smi -l 1
  • Minimize other GPU workloads

Debugging Tips

  1. Enable verbose CUDA logging:

    std::env::set_var("RUST_BACKTRACE", "1");
    std::env::set_var("RUSTACUDA_DEBUG", "1");
  2. Check CUDA device properties:

    let device = Device::get_device(0).unwrap();
    println!("Using device: {}", device.name().unwrap());
  3. Monitor GPU memory usage:

    nvidia-smi -l 1

Conclusion

The quickcdc-cuda library provides a high-performance implementation of content-defined chunking using CUDA acceleration. By following the guidelines in this document, you can effectively use this library in your applications and achieve significant performance improvements over CPU-only implementations.