This project performs in-depth SQL-based analysis on the MovieLens dataset — a popular dataset containing millions of movie ratings, tags, and metadata. The analysis spans beginner to advanced levels, covering SQL queries, views, CTEs, stored procedures, and window functions.
- Dataset setup and table creation using PostgreSQL
- Designed 30+ custom SQL questions across beginner, intermediate, and advanced levels
- Dataset setup and table creation using PostgreSQL
- Imported and managed 1.5+ GB of real-world movie rating data from MovieLens
- Performed genre classification, tag relevance analysis, and user behavior insights
- Implemented views, stored procedures, window functions, and recursive CTEs
- Wrote optimized queries with
EXPLAIN ANALYZEfor performance evaluation - Simulated ACID transactions and used temporary/materialized views for advanced use cases
| Tool/Service | Purpose |
|---|---|
| Kaggle | Dataset exploration |
| PostgreSQL | Database management |
| pgAdmin 4 | GUI for database operations |
- 📥 Downloaded data from MovieLens
- 🗃 Created database and tables using PostgreSQL
- 📌 Imported CSV data (~1.5 GB) via pgAdmin 4
- 🔎 Executed 30+ SQL queries across 3 difficulty levels
- 🧠 Created views, functions, materialized views, and CTEs
- 📈 Extracted insights on movie trends, top genres, user behavior, and tag relevance
📄 Click here to view the full project log (PDF)
MoviesAnalysisSQL/
├── Project_log.pdf
├── Questions.md
└── README.md
- 📊 Analyzing user behavior through ratings and tags
- 🎯 Finding most rated or top-rated movies by genre
- 🧬 Working with tag genome and relevance analysis
- 🏗 Practicing SQL queries from beginner to advanced levels
- 🧪 Demonstrating database normalization and function creation
- 📚 Building a PostgreSQL portfolio project
- Install PostgreSQL and pgAdmin 4
- Download dataset from: MovieLens Dataset
- Run the SQL scripts from the project log to create tables
- Import CSV files into respective tables (Format: CSV, Encoding: UTF-8)
- Run the SQL queries as per your learning phase (Beginner → Advanced)
- The dataset includes 33,832,162 ratings and 2,328,315 tag applications for 86,537 movies
- User and movie IDs are anonymized and consistent across all files
- Ratings are made on a 0.5–5.0 star scale; genres are pipe-separated (e.g., Action|Adventure|Sci-Fi)
- All files are encoded in UTF-8 — ensure correct encoding during import in pgAdmin
- Timestamps are stored in UNIX format (seconds since January 1, 1970 UTC)
Anushka Sharma
🌐 LinkedIn • 🐱 GitHub
🎓 Learning Data Science, Analytics & Machine Learning
If you found this project helpful or inspiring:
- ⭐ Star this repository
- 🛠️ Fork it to build upon or adapt it for your own use
- 💬 Share feedback or suggestions via Issues/Discussions