This project builds a complete customer churn prediction pipeline for a telecom company using the IBM Telco Customer Churn dataset (7,043 customers, 21 features). The goal is to identify which customers are at risk of leaving and understand the key drivers behind churn — delivering actionable insights through machine learning and an interactive Power BI dashboard.
This mirrors a real-world data analyst/scientist workflow: raw data → cleaning → SQL analysis → ML modelling → business dashboard.
Telecom companies lose millions annually to customer churn. Acquiring a new customer costs 5–7× more than retaining an existing one. By predicting which customers are likely to churn before they leave, the business can intervene with targeted retention offers — saving revenue and improving customer lifetime value.
Key questions this project answers:
- What is the overall churn rate and which segments are most at risk?
- Which factors most strongly predict churn?
- Which specific customers should the retention team contact first?
customer-churn-analysis/
├── data/ # Raw and processed data (not tracked by git)
│ ├── churn_cleaned.csv
│ ├── churn_summary.csv
│ └── churn_predictions.csv
├── notebooks/
│ ├── 01_eda.ipynb # Exploratory data analysis
│ ├── 02_sql.ipynb # SQL analysis with SQLite
│ └── 03_model_comparison.ipynb # ML model training & comparison
├── sql/
│ └── queries.sql # All SQL queries documented
├── .gitignore
└── README.md
IBM Telco Customer Churn — publicly available via Kaggle
- 7,043 customer records
- 21 features including contract type, tenure, monthly charges, internet service and support options
- Target variable:
Churn(Yes / No) - Churn rate: 26.5%
| Layer | Tools |
|---|---|
| Data cleaning & EDA | Python, pandas, matplotlib, seaborn |
| Data storage & querying | SQLite, SQL (window functions, views, CASE WHEN) |
| Machine learning | scikit-learn, XGBoost, LightGBM |
| Explainability | Feature importance, correlation analysis |
| Visualisation | Power BI (slicers, KPI cards, drill-through) |
| Version control | Git, GitHub |
- Loaded and inspected 7,043 rows across 21 columns
- Fixed data quality issue:
TotalChargesstored as object instead of numeric - Analysed churn distribution, numerical feature distributions and categorical breakdowns
- Identified class imbalance: 73% retained vs 27% churned
Key findings:
- Churned customers average 18 months tenure vs 38 months for retained customers
- Churned customers pay ~$15/month more on average
- Customers without TechSupport churn significantly more
- Loaded cleaned data into a SQLite database using Python
- Wrote analytical queries using
CASE WHEN,GROUP BY,ROUNDand window functions - Created reusable summary views for Power BI consumption
Queries written:
- Overall churn rate calculation
- Churn rate by contract type, internet service and tenure bucket
RANK() OVER (PARTITION BY ...)to rank customers by monthly charges within contract groups- High-risk customer segment identification
- Aggregated summary view for dashboard use
Top SQL finding: Month-to-month contract customers churn at 43% vs only 3% for two-year contracts.
Three models trained and compared head-to-head:
| Model | Accuracy | AUC Score | F1 (Churn) |
|---|---|---|---|
| Logistic Regression | ~81% | ~0.84 | ~0.60 |
| Random Forest | ~85% | ~0.88 | ~0.65 |
| XGBoost | ~87% | ~0.91 | ~0.69 |
Best model: XGBoost
Top churn drivers identified:
- Contract type (month-to-month)
- Tenure (short tenure = higher risk)
- Monthly charges (higher charges = higher risk)
- Internet service type (Fiber optic)
- Absence of TechSupport
Every customer was scored with a churn probability and assigned to a risk segment: Low / Medium / High risk — exported for Power BI.
- KPI cards: total churn rate, high-risk customer count, average monthly charges
- Churn breakdown by contract type, tenure group and internet service
- Customer risk segment distribution
- Slicers: filter by contract type, risk segment and internet service
1. Clone the repository
git clone https://github.com/Thesineo/customer-churn-analysis.git
cd customer-churn-analysis2. Create a virtual environment
python3 -m venv venv
source venv/bin/activate3. Install dependencies
pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm jupyter ipykernel4. Download the dataset
Download from Kaggle and place WA_Fn-UseC_-Telco-Customer-Churn.csv in the data/ folder.
5. Run notebooks in order
01_eda.ipynb → 02_sql.ipynb → 03_model_comparison.ipynb
By applying XGBoost with SHAP-informed feature selection, the model identifies high-risk customers with 87% accuracy and an AUC of 0.91. The top retention strategy recommendation: target month-to-month customers in their first 12 months with tenure-based loyalty incentives and tech support upgrades — this segment represents 43% churn risk and the highest revenue loss exposure.
- End-to-end data pipeline design from raw CSV to deployed model
- Real-world SQL skills: window functions, CTEs, views and analytical aggregations
- Handling class imbalance in ML with
class_weightand threshold tuning - Model comparison methodology: why AUC and F1 matter more than accuracy for imbalanced data
- Translating ML outputs into business recommendations, not just metrics
- Phase 1 — EDA with Python & pandas
- Phase 2 — SQL analysis with SQLite
- Phase 3 — Model comparison (Logistic Regression, Random Forest, XGBoost)
- Phase 4 — Power BI dashboard (in progress)
Built by Aniket Nerali