A machine learning pipeline designed to identify fraudulent transactions with high precision and recall. This project focuses on handling extreme class imbalance and preventing data leakage through a robust pipeline architecture.
The project is organised as follows:
credit-card-fraud-ml/
│
├── data/ # Source CSV data (e.g. creditcard.csv)
├── docs/ # Technical report and documentation
├── figures/ # Exported Precision-Recall curves and EDA plots
├── models/ # Serialised champion model (.pkl)
├── notebooks/ # EDA and benchmarking experiments
├── src/ # Python scripts and utilities
├── requirements.txt
└── README.md
- Anti-Leakage Pipeline: Utilises
imblearn.pipeline.Pipelineto ensure that theRobustScaleris fitted strictly on training data, eliminating look-ahead bias. - Cost-Sensitive Learning: Addresses the extreme 0.17% fraud imbalance by utilising
scale_pos_weightin XGBoost andclass_weight='balanced'in Linear models. This proved more effective than naive scaling during experimentation. - Metric Focus: Prioritises AUPRC (Area Under the Precision-Recall Curve) to ensure high detection (Recall) while minimising false alarms (Precision).
- Model Suite: Compares Logistic Regression, Random Forest, Hist-Gradient Boosting, XGBoost, and Linear SVM.
-
Install dependencies:
pip install -r requirements.txt
-
Explore the Research: Review
notebooks/eda.ipynbfor data insights andnotebooks/experiments.ipynbto see the benchmarking process and leakage analysis. -
Train the Champion Model: Execute the training suite to benchmark all models and serialise the best performer to the
models/directory:python src/train.py
-
Verify Results: Run the final verification script to load the serialised pipeline and test it against the held-out dataset. This will also save a final verification plot to
figures/:python src/test.py