End-to-end phishing website detection project built with Python, scikit-learn, MongoDB, FastAPI, and GitHub Actions.
The project trains a classifier on structured phishing features, saves a preprocessor and model to final_models/, exposes prediction through a FastAPI app, and stores timestamped pipeline outputs under Artifacts/.
- Loads phishing data from MongoDB into the training pipeline
- Validates input schema and train/test drift
- Applies preprocessing with a saved scikit-learn pipeline
- Trains multiple classification models and selects the best one by F1 score
- Saves artifacts for reuse in inference
- Serves predictions through a FastAPI endpoint
- Syncs pipeline outputs and saved models to S3 during API-triggered training
NetworkSecurity/
|-- .github/workflows/main.yml
|-- Artifacts/
|-- data_schema/schema.yaml
|-- final_models/
|-- logs/
|-- Network_Data/phisingData.csv
|-- networksecurity/
| |-- cloud/s3_syncer.py
| |-- components/
| |-- constant/
| |-- entity/
| |-- exception/
| |-- logging/logger.py
| |-- pipeline/
| `-- utils/
|-- prediction_output/
|-- templates/table.html
|-- app.py
|-- Dockerfile
|-- main.py
|-- push_data.py
|-- requirements.txt
`-- setup.py
Set these in .env for local development or as secrets/env vars in deployment:
| Variable | Required | Purpose |
|---|---|---|
MONGO_DB_URL |
Yes | Primary MongoDB connection string used by ingestion, training, and push_data.py |
MONGODB_URL_KEY |
Optional | Fallback MongoDB variable supported by app.py |
PORT |
Optional | FastAPI port, defaults to 8000 locally |
TRAIN_API_KEY |
Optional | If set, /train requires an x-api-key header |
AWS_ACCESS_KEY_ID |
For S3 sync/deploy | AWS credential used when training syncs artifacts to S3 |
AWS_SECRET_ACCESS_KEY |
For S3 sync/deploy | AWS credential used when training syncs artifacts to S3 |
AWS_REGION |
For S3 sync/deploy | AWS region for deployment and AWS CLI access |
- Create and activate a virtual environment.
python -m venv venv
venv\Scripts\Activate.ps1- Install dependencies.
pip install -r requirements.txt- Add a
.envfile.
MONGO_DB_URL="your-mongodb-connection-string"
PORT=8000
TRAIN_API_KEY=your-optional-train-keyIf your MongoDB collection is empty, push the local CSV dataset into MongoDB first:
python push_data.pyThis reads Network_Data/phisingData.csv and inserts it into:
- database:
NetworkSecurity - collection:
NetworkData
python main.pyThis runs the component flow directly:
- Data ingestion
- Data validation
- Data transformation
- Model training
Outputs are written under Artifacts/<timestamp>/ and final_models/.
Start the app:
python app.pyThen call the training endpoint:
Invoke-WebRequest `
-Method Post `
-Uri "http://127.0.0.1:8000/train" `
-Headers @{ "x-api-key" = "your-optional-train-key" }Notes:
- If
TRAIN_API_KEYis not set, thex-api-keyheader is not required. - API-triggered training uses
TrainingPipeline.run_pipeline(), which also syncs:Artifacts/<timestamp>/tos3://netsecmlops/artifact/<timestamp>final_models/tos3://netsecmlops/final_models/<timestamp>
Start the FastAPI app:
python app.pyOpen the docs UI:
http://127.0.0.1:8000/docs
Or upload a CSV directly to /predict:
curl -X POST "http://127.0.0.1:8000/predict" ^
-H "accept: text/html" ^
-H "Content-Type: multipart/form-data" ^
-F "file=@valid_data/test.csv"Behavior:
- The app loads
final_models/preprocessor.pklandfinal_models/model.pklonce at startup - Predictions are rendered as an HTML table
- A CSV copy is saved to
prediction_output/output.csv
The pipeline works in this order:
data_ingestion.pyReads records from MongoDB, exports a feature-store CSV, and performs a stratified train/test split.data_validation.pyValidates expected columns fromdata_schema/schema.yamland writes a drift report.data_transformation.pyApplies aKNNImputerpreprocessing pipeline and saves transformed arrays plus the preprocessor.model_trainer.pyTrains several classifiers, selects the best one by F1 score, logs metrics to MLflow/DagsHub, and saves the final model.
Main generated outputs:
Artifacts/<timestamp>/...final_models/model.pklfinal_models/preprocessor.pklprediction_output/output.csvlogs/<timestamp>.log
GitHub Actions currently handles:
- dependency installation
- Python syntax/compile checks
pytestexecution when atests/directory exists- Docker build and push to Amazon ECR
- container deployment on the self-hosted runner
Workflow file:
.github/workflows/main.yml
- The app expects
final_models/artifacts to exist before prediction. final_models/is currently present in the repository and used directly by the app.- The repository also contains historical artifact and cache files from previous local runs.