FastLloyd addresses the challenge of performing
A new DP clustering algorithm is introduced, which incorporates a radius constraint on clusters and uses relative updates to improve utility. This algorithm is integrated into a federated setting using a lightweight, secure aggregation protocol (Masked Secure Aggregation - MSA) that leverages the computational DP model. This allows intermediate differentially private updates to be published, significantly reducing the overhead of secure computation by performing expensive operations like assignments and divisions locally.
FastLloyd significantly outperforms previous work; It achieves up to a five-orders-of-magnitude speed-up in runtime compared to state-of-the-art secure federated
This repository implements the FastLloyd protocol described in the paper "FastLloyd: Federated, Accurate, Secure, and Tunable k-Means Clustering with Differential Privacy". FastLloyd addresses the challenging problem of collaborative clustering across multiple data owners without compromising privacy, through:
- A novel differentially private k-means algorithm with radius constraints
- A lightweight secure aggregation protocol for federated settings
- Python 3.8 or higher
- Open MPI (for multiparty communication)
- Required Python packages listed in
env.yml
- Clone the repository:
git clone https://github.com/D-Diaa/FastLloyd.git
cd FastLloyd- Create and activate the conda environment:
conda env create -f env.yml
conda activate fastlloyd├── configs/
│ ├── defaults.py # Default configuration settings, dataset definitions
│ └── params.py # Parameter class for clustering and privacy settings
│
├── data_io/
│ ├── comm.py # MPI communication wrapper with delay simulation
│ ├── data_handler.py # Functions for loading and processing datasets
│ └── fixed.py # Fixed-point arithmetic implementation
│
├── parties/
│ ├── client.py # Client implementations (masked and unmasked)
│ └── server.py # Server implementation with DP mechanisms
│
├── plots/
│ ├── ablation_plots.py # Visualization for ablation studies
│ ├── per_dataset.py # Dataset-specific result visualization
│ ├── scale_heatmap.py # Heatmap generation for scalability results
│ ├── synthetic_bar.py # Bar charts for synthetic dataset results
│ └── timing_analysis.py # Analysis of timing experiments
│
├── scripts/
│ ├── generator.R # R script for generating synthetic datasets
│ ├── experiment_runner.sh # Script for running accuracy and scale experiments
│ └── timing_runner.sh # Script for running timing experiments
│
├── utils/
│ ├── evaluations.py # Clustering quality evaluation metrics
│ └── utils.py # General utility functions
│
├── experiments.py # Main experiment runner
├── env.yml # Conda environment specification
└── README.md # Project documentation
FastLloyd supports multiple experiment types:
- Accuracy: Evaluate clustering quality across different privacy settings
python experiments.py --exp_type "accuracy"- Scale: Analyze scalability with dataset size, dimensions, and number of clusters
python experiments.py --exp_type "scale"- Timing: Measure communication and computation time
mpirun -np 3 python experiments.py --exp_type "timing"You can also use the provided scripts to run multiple experiment types:
bash scripts/experiment_runner.sh # For accuracy and scale experiments
bash scripts/timing_runner.sh # For timing experiments with varying numbers of clientsThe repository includes several visualization tools in the plots directory:
per_dataset.py: Creates performance visualizations for individual datasetsscale_heatmap.py: Generates heatmaps to analyze scalabilitysynthetic_bar.py: Creates bar plots comparing performance on synthetic datasetsablation_plots.py: Creates plots for ablation studiestiming_analysis.py: Analyzes and reports execution timing data
You can customize various aspects of the experiments through the argument parser in experiments.py:
python experiments.py --exp_type "test" --datasets "mnist" "adult" --method "diagonal_then_frac" --alpha 0.8 --post "fold" --results_folder "my_results"Key parameters include:
--exp_type: Type of experiment to run (accuracy, scale, timing, test)--datasets: Datasets to use for the experiment--method: Maximum distance method to use--alpha: Maximum distance parameter--post: Post-processing method for centroids--results_folder: Folder to store results
If you use FastLloyd in your research, please cite the paper:
@article{diaa2024fastlloyd,
title={FastLloyd: Federated, Accurate, Secure, and Tunable $ k $-Means Clustering with Differential Privacy},
author={Diaa, Abdulrahman and Humphries, Thomas and Kerschbaum, Florian},
journal={arXiv preprint arXiv:2405.02437},
year={2024}
}