This project implements a robust machine learning pipeline to predict Chronic Kidney Disease (CKD) using 24 clinical features (e.g., Albumin, Hemoglobin, Specific Gravity).
Unlike standard textbook examples, this project focuses on real-world data engineering challenges: handling corrupt ARFF formats, imputing missing clinical data using k-Nearest Neighbors (KNN) to preserve biological relationships, and building an automated inference engine for deployment.
Key Achievement: The model achieves >98% Accuracy with a focus on high Recall, ensuring that potential CKD cases are not missed (False Negatives minimized).
- Challenge: The raw UCI dataset contained corrupt formats, mixed encodings (bytes/strings), and inconsistent typos (e.g.,
\tno,yes). - Solution: Built a custom parser to bypass standard library errors, utilizing Regex for deep cleaning and standardizing 24 feature columns.
- Methodology: Instead of simple mean/median imputation (which distorts clinical variance), I implemented KNNImputer.
- Logic: Missing values are estimated based on the "nearest" patients in the n-dimensional feature space, preserving the underlying medical correlations.
- Algorithm: k-Nearest Neighbors (k-NN).
-
Scaling: Applied Min-Max Scaling to normalize all features to
$[0, 1]$ , preventing features with larger magnitudes (e.g., Blood Pressure) from dominating the Euclidean distance calculation. -
Tuning: Used the Elbow Method to scientifically determine the optimal
$k$ value ($k=5$ ) to balance bias and variance.
The project is designed as a modular pipeline following standard data science lifecycles:
| Notebook | Description | Key Tech |
|---|---|---|
| 01_EDA_and_Cleaning | Raw ARFF loading, byte decoding, and typo resolution. | Pandas, Regex |
| 02_Feature_Engineering | Label Encoding, KNN Imputation, and Scaling. | KNNImputer, MinMaxScaler |
| 03_Model_Tuning_and_Training | Cross-Validation and Hyperparameter tuning (Elbow Plot). | GridSearch, Elbow Method |
| 04_Final_Evaluation_and_Inference | ROC-AUC analysis, Confusion Matrix, and Inference. | Seaborn, Scikit-learn |
| 05_Executive_Dashboard | High-level summary and interactive prediction tool. | Data Visualization |
- Optimal K: 5
- Accuracy: ~98%
- ROC-AUC Score: 0.99
- Key Predictors: According to Permutation Importance, the most critical features for diagnosis are:
- Hemoglobin (hemo)
- Specific Gravity (sg)
- Albumin (al)
- Red Blood Cell Count (rc)
To run this project, install the required dependencies:
pip install pandas numpy scikit-learn seaborn matplotlib scipyMobin Zamani Data Scientist & Machine Learning Engineer