FastLloyd: Federated, Accurate, Secure, and Tunable k-Means Clustering with Differential Privacy

FastLloyd addresses the challenge of performing $k$-means clustering in a horizontally federated setting while preserving data privacy. Existing methods suffer from either high computational overhead or significantly degrade clustering utility. FastLloyd overcomes these limitations through a novel federated protocol that enhances both the differential privacy mechanism and the secure computation components.

A new DP clustering algorithm is introduced, which incorporates a radius constraint on clusters and uses relative updates to improve utility. This algorithm is integrated into a federated setting using a lightweight, secure aggregation protocol (Masked Secure Aggregation - MSA) that leverages the computational DP model. This allows intermediate differentially private updates to be published, significantly reducing the overhead of secure computation by performing expensive operations like assignments and divisions locally.

FastLloyd significantly outperforms previous work; It achieves up to a five-orders-of-magnitude speed-up in runtime compared to state-of-the-art secure federated $k$-means approaches, while also reducing communication by up to seven orders of magnitude. Furthermore, FastLloyd not only matches the utility of central DP models but improves upon state-of-the-art DP $k$-means algorithms, especially in higher dimensions and for a larger number of clusters, demonstrating up to an 88% reduction in clustering error.

Overview

This repository implements the FastLloyd protocol described in the paper "FastLloyd: Federated, Accurate, Secure, and Tunable k-Means Clustering with Differential Privacy". FastLloyd addresses the challenging problem of collaborative clustering across multiple data owners without compromising privacy, through:

A novel differentially private k-means algorithm with radius constraints
A lightweight secure aggregation protocol for federated settings

Installation

Requirements

Python 3.8 or higher
Open MPI (for multiparty communication)
Required Python packages listed in env.yml

Setup

Clone the repository:

git clone https://github.com/D-Diaa/FastLloyd.git
cd FastLloyd

Create and activate the conda environment:

conda env create -f env.yml
conda activate fastlloyd

Repository Structure

├── configs/                                                              
│   ├── defaults.py       # Default configuration settings, dataset definitions
│   └── params.py         # Parameter class for clustering and privacy settings
│                                                                                                                                                    
├── data_io/                                                                                                                                          
│   ├── comm.py           # MPI communication wrapper with delay simulation
│   ├── data_handler.py   # Functions for loading and processing datasets
│   └── fixed.py          # Fixed-point arithmetic implementation
│                                                                                                                                                    
├── parties/                                                                                                                                          
│   ├── client.py         # Client implementations (masked and unmasked)
│   └── server.py         # Server implementation with DP mechanisms
│                                                                                                                                                    
├── plots/                                                                                                                                           
│   ├── ablation_plots.py # Visualization for ablation studies
│   ├── per_dataset.py    # Dataset-specific result visualization
│   ├── scale_heatmap.py  # Heatmap generation for scalability results
│   ├── synthetic_bar.py  # Bar charts for synthetic dataset results
│   └── timing_analysis.py # Analysis of timing experiments
│                                                                                                                                                    
├── scripts/                                                                                                                                         
│   ├── generator.R        # R script for generating synthetic datasets
│   ├── experiment_runner.sh # Script for running accuracy and scale experiments
│   └── timing_runner.sh   # Script for running timing experiments
│                                                                                                                                                    
├── utils/                                                                                                                                           
│   ├── evaluations.py    # Clustering quality evaluation metrics
│   └── utils.py          # General utility functions
│                                                                                                                                                    
├── experiments.py        # Main experiment runner
├── env.yml               # Conda environment specification
└── README.md             # Project documentation

Usage

Running Experiments

FastLloyd supports multiple experiment types:

Accuracy: Evaluate clustering quality across different privacy settings

python experiments.py --exp_type "accuracy"

Scale: Analyze scalability with dataset size, dimensions, and number of clusters

python experiments.py --exp_type "scale"

Timing: Measure communication and computation time

mpirun -np 3 python experiments.py --exp_type "timing"

You can also use the provided scripts to run multiple experiment types:

bash scripts/experiment_runner.sh  # For accuracy and scale experiments
bash scripts/timing_runner.sh      # For timing experiments with varying numbers of clients

Visualization

The repository includes several visualization tools in the plots directory:

per_dataset.py: Creates performance visualizations for individual datasets
scale_heatmap.py: Generates heatmaps to analyze scalability
synthetic_bar.py: Creates bar plots comparing performance on synthetic datasets
ablation_plots.py: Creates plots for ablation studies
timing_analysis.py: Analyzes and reports execution timing data

Customization

You can customize various aspects of the experiments through the argument parser in experiments.py:

python experiments.py --exp_type "test" --datasets "mnist" "adult" --method "diagonal_then_frac" --alpha 0.8 --post "fold" --results_folder "my_results"

Key parameters include:

--exp_type: Type of experiment to run (accuracy, scale, timing, test)
--datasets: Datasets to use for the experiment
--method: Maximum distance method to use
--alpha: Maximum distance parameter
--post: Post-processing method for centroids
--results_folder: Folder to store results

Citation

If you use FastLloyd in your research, please cite the paper:

@article{diaa2024fastlloyd,
  title={FastLloyd: Federated, Accurate, Secure, and Tunable $ k $-Means Clustering with Differential Privacy},
  author={Diaa, Abdulrahman and Humphries, Thomas and Kerschbaum, Florian},
  journal={arXiv preprint arXiv:2405.02437},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastLloyd: Federated, Accurate, Secure, and Tunable k-Means Clustering with Differential Privacy

Overview

Installation

Requirements

Setup

Repository Structure

Usage

Running Experiments

Visualization

Customization

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
data_io		data_io
parties		parties
plots		plots
scripts		scripts
utils		utils
README.md		README.md
env.yml		env.yml
experiments.py		experiments.py

D-Diaa/FastLloyd

Folders and files

Latest commit

History

Repository files navigation

FastLloyd: Federated, Accurate, Secure, and Tunable k-Means Clustering with Differential Privacy

Overview

Installation

Requirements

Setup

Repository Structure

Usage

Running Experiments

Visualization

Customization

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages