Name	Name	Last commit message	Last commit date
parent directory ..
src	src
README.md	README.md
dvc.yaml	dvc.yaml
params.yaml	params.yaml

Chapter 11: Data Version Control

Transform your data science projects with reproducible data management and ML pipelines using DVC.

Setup

# From project root
uv sync --group chapter11

Complete DVC Workflow

This chapter demonstrates a complete customer churn prediction project using DVC for:

Data versioning: Track and switch between dataset versions
Remote storage: Share large datasets efficiently
Reproducible pipelines: Automate ML workflows
Experiment tracking: Compare model performance across runs

Tutorial: Customer Churn Prediction with DVC

1. Initialize the Project

# Initialize DVC (run once per project)
uv run dvc init
git add .dvc/
git commit -m "Initialize DVC"

# Track initial dataset
uv run dvc add data/raw/customer_churn.csv
git add data/raw/customer_churn.csv.dvc .gitignore
git commit -m "Add customer churn dataset"

What just happened?

DVC created .dvc/ directory with configuration
customer_churn.csv.dvc contains metadata about your dataset
data/raw/ was added to .gitignore
Git tracks only metadata, not the actual large files

2. Build Reproducible ML Pipeline

Our pipeline has 4 stages: prepare → featurize → train → evaluate

# Run the complete pipeline
uv run dvc repro

Pipeline benefits:

Only modified stages re-run (smart caching)
All outputs automatically tracked and versioned
Reproducible across different machines

3. Work with Data Versions

Scenario: Your team receives updated customer data with additional features.

# Simulate new data arrival (example)
# In practice, you'd receive new data files
echo "customer_id,age,tenure,monthly_charges,total_charges,churn
1001,25,12,65.5,786.0,0
1002,45,24,89.2,2140.8,1
1003,32,6,45.0,270.0,0" > data/raw/customer_churn_v2.csv

# Replace with new version
cp data/raw/customer_churn_v2.csv data/raw/customer_churn.csv

# Track new version
uv run dvc add data/raw/customer_churn.csv
git add data/raw/customer_churn.csv.dvc
git commit -m "Update to customer data v2"

# Re-run pipeline with new data
uv run dvc repro

Switch between data versions:

# Go back to version 1
git checkout HEAD~1
uv run dvc checkout

# Return to version 2  
git checkout main
uv run dvc checkout

4. Remote Storage and Team Collaboration

Setup local remote storage (for testing):

# Configure remote storage
mkdir /tmp/dvc-storage
uv run dvc remote add -d storage /tmp/dvc-storage
git add .dvc/config
git commit -m "Configure DVC remote"

# Push data to remote
uv run dvc push

Team member workflow:

# New team member joins
git clone <repository>
cd chapter11_data_version_control

# Install dependencies
uv sync --group chapter11

# Pull data from remote
uv run dvc pull

# Reproduce results
uv run dvc repro

5. Experiment with Parameters

Modify hyperparameters in params.yaml:

model:
  n_estimators: 200  # changed from 100
  max_depth: 10      # changed from 5

Run experiment:

uv run dvc repro

Compare results:

uv run dvc metrics show
uv run dvc metrics diff HEAD~1

6. Pipeline Management

Useful DVC commands:

# Visualize pipeline
uv run dvc dag

# Check pipeline status
uv run dvc status

# Force re-run specific stage
uv run dvc repro --force train

Advanced: Cloud Storage (Optional)

AWS S3 setup:

# Configure AWS credentials first
aws configure

# Add S3 remote
uv run dvc remote add -d s3remote s3://my-dvc-bucket/dvc-store
git add .dvc/config
git commit -m "Add S3 remote"

# Push to S3
uv run dvc push

Key Benefits Demonstrated

Reproducibility: Anyone can recreate exact results
Efficiency: Smart caching prevents unnecessary recomputation
Collaboration: Team members work with identical data
Experimentation: Easy parameter testing and comparison
Storage: Keep Git repository lightweight while tracking large datasets

Why This Matters

Data version control eliminates reproducibility nightmares, enables efficient data pipeline management, and ensures teams can collaborate confidently with large datasets while maintaining Git repository performance.

← Back to Main README | Next: Chapter 12: Continuous Integration →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Chapter 11: Data Version Control

Setup

Complete DVC Workflow

Tutorial: Customer Churn Prediction with DVC

1. Initialize the Project

2. Build Reproducible ML Pipeline

3. Work with Data Versions

4. Remote Storage and Team Collaboration

5. Experiment with Parameters

6. Pipeline Management

Advanced: Cloud Storage (Optional)

Key Benefits Demonstrated

Why This Matters

FilesExpand file tree

chapter11_data_version_control

Directory actions

More options

Directory actions

More options

Latest commit

History

chapter11_data_version_control

Folders and files

parent directory

README.md

Chapter 11: Data Version Control

Setup

Complete DVC Workflow

Tutorial: Customer Churn Prediction with DVC

1. Initialize the Project

2. Build Reproducible ML Pipeline

3. Work with Data Versions

4. Remote Storage and Team Collaboration

5. Experiment with Parameters

6. Pipeline Management

Advanced: Cloud Storage (Optional)

Key Benefits Demonstrated

Why This Matters