A scalable movie recommendation system built with Apache Spark and MLlib using collaborative filtering (ALS algorithm). The system includes a FastAPI backend, Streamlit dashboard, and is fully containerized with Docker.
- Apache Spark 3.5.0 - Distributed data processing and machine learning
- PySpark - Python API for Spark
- MLlib ALS - Collaborative filtering algorithm for recommendations
- FastAPI - REST API backend
- Streamlit - Interactive web dashboard
- Docker & Docker Compose - Containerization
- MovieLens 25M Dataset - 25M ratings across 62K movies
- API: FastAPI, Uvicorn, Pydantic, NumPy
- Dashboard: Streamlit, Pandas, Plotly, Requests
- ML/Processing: PySpark, Pandas, NumPy, Matplotlib
This project uses the MovieLens 25M dataset containing:
- 25,000,095 ratings
- 62,423 movies
- 162,541 users
- Rating period: 1995-2019
- Docker and Docker Compose
- At least 8GB RAM recommended
-
Clone the repository
git clone <repository-url> cd MovieRecSystem
-
Build and start services
docker-compose up --build
-
Access the applications
- Dashboard: http://localhost:8501
- API: http://localhost:8000
- Spark UI: http://localhost:4040
-
Download MovieLens 25M Dataset and place it in the data folder
-
Run Data Preprocessing
docker exec -it pyspark-movie-rec python /app/src/data_processing/ preprocessor.py -
Train the model
docker exec -it pyspark-movie-rec python /app/src/train.py \ --rank 100 \ --max-iter 10 \ --reg-param 0.1 \ --save-model -
Run Full Evaluation
docker exec -it pyspark-movie-rec python /app/src/run_full_evaluation.py
- Streamlit web interface
- Interactive movie recommendations
- Model performance metrics
- User-friendly recommendation interface
MovieRecSystem/
├── src/
│ ├── api/ # FastAPI backend
│ ├── dashboard/ # Streamlit frontend
│ ├── data_processing/ # Data preprocessing
│ ├── model/ # ALS model and evaluation
│ ├── train.py # Model training script
│ └── run_full_evaluation.py # to get metrics of the trained model.
├── data/
│ ├── ml-25m/ # MovieLens dataset
│ └── processed/ # Processed data
├── models/ # Trained model artifacts
├── docker/ # Dockerfiles
├── docker-compose.yml
└── requirements-*.txt # Dependencies
- Start the system with
docker-compose up - Open the dashboard at http://localhost:8501
- Change to API url in the sidebar: http://api:8000
- Enter a user ID to get personalized movie recommendations
- Use the API directly at http://localhost:8000/docs for programmatic access
- Algorithm: Alternating Least Squares (ALS)
- Framework: Apache Spark MLlib
- Training Data: MovieLens 25M ratings
- Evaluation: RMSE, Precision, Recall metrics
- Scalability: Designed for distributed processing