Load song play events of a mobile application and register to Redshift database.
This project's goal is analyze data to find what attribute of users often play songs, what songs or artists are played often and when songs are often played.
First, script create_tables.py creates staging, dimension and fact tables.
Then, script etl.py loads json data from S3 to Redshift staging table.
Finally, it loads from staging table to dimension and fact table by using INSERT INTO ... SELECT ... syntax.
-
log_dataprovide event data recorded in json file underlog_datadirectory -
song_dataprovide song and artist data recorded in json file undersong_datadirectory
-
songsprovide song data which were played -
artistsprovide artist data whose songs were played -
usersprovide user data who played songs -
timeprovide time data when users played songs
songplaysprovide event data which describe who, when and what songs are played
Those tables are designed using STAR schema, which can easily analyze various aspects. I assumed we can specify artists and songs of event data by its name and title.
# pull docker images and build docker images
docker-compose up --build -d
# create tables
docker-compose exec app python3 create_tables.py
# run ETL
docker-compose exec app python3 etl.pyQuery:
SELECT sp.artist_id, a.name, COUNT(sp.artist_id)
FROM songplays sp
INNER JOIN artists a on sp.artist_id = a.artist_id
GROUP BY sp.artist_id, a.name
ORDER BY 3 DESC
LIMIT 5
Result:
| artist_id | name | cnt |
|---|---|---|
| AR5E44Z1187B9A1D74 | Dwight Yoakam | 37 |
| ARD46C811C8A414F3F | Kid Cudi / Kanye West / Common | 10 |
| ARD46C811C8A414F3F | Kid Cudi | 10 |
| AR37SX11187FB3E164 | Ron Carter | 9 |
| AR5EYTL1187B98EDA0 | Lonnie Gordon | 9 |
Query:
SELECT sp.song_id, s.title, COUNT(sp.song_id) AS cnt
FROM songplays sp
INNER JOIN songs s on sp.song_id = s.song_id
INNER JOIN time t ON sp.start_time = t.start_time
WHERE t.start_time BETWEEN '2018-11-01 00:00:00' AND '2018-11-08 00:00:00'
GROUP BY sp.song_id, s.title
ORDER BY 3 DESC
LIMIT 5
| song_id | title | cnt |
|---|---|---|
| SOBONKR12A58A7A7E0 | You're The One | 8 |
| SONTFNG12A8C13FF69 | If I Ain't Got You | 3 |
| SOHTKMO12AB01843B0 | Catch You Baby (Steve Pitron & Max Sanna Radio Edit) | 3 |
| SOULTKQ12AB018A183 | Nothin' On You [feat. Bruno Mars] (Album Version) | 2 |
| SOLZOBD12AB0185720 | Hey Daddy (Daddy's Home) | 2 |
-
create_tables.pycreates necessary tables -
docker-compose.ymlpreference file to organize docker images -
Dockerfileprovides docker image template -
dwh.cfgparameter file for paths of json data, IAM Role and redshift connection -
etl.pydefines etl processes -
init.shscript launched initially when docker image started -
README.mddescribes about this project -
requirements.txtdefines libraries need to execute processes -
sql_queries.pydefines SQLs need to execute