Home

Awesome

Movalytics


Movalytics

Project Description


Architecture


The technical architecture for this project is as show below:

Architecture

  1. Data Extraction is done using Kaggle API and using GET request to St Louis Fred's CPI dataset.
    Set up an EC2 instance with python and pip installed. Then, run pip install kaggle. To download the movielens dataset, run
kaggle datasets download -d "rounakbanik/the-movies-dataset"

For St Louis Fred's Consumer Price Index dataset, run

wget https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CUSR0000SS62031&scale=left&cosd=1999-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-05-31&revision_date=2020-05-31&nd=1999-01-01
  1. Next, copy the files downloaded from the EC2 instance to S3. Make sure that it has the aws-cli installed. Run aws configure and then aws s3 cp {FILE} s3://{S3_BUCKET}/{S3_FOLDER}/ to transfer the files to S3. Note that if the situation changes such that this becomes a daily job, we can write a shell script containing these commands, and add the command to run this shell script in our Airflow data pipeline

  2. Run the ETL pipeline, scheduled using Airflow. Data Processing is done using Spark, and data is eventually ingested into Redshift.

Choice of Technologies


Data Model


Data Model

The approach taken, is to normalize the data. This will lead to more efficient UPDATES and DELETES as and when required.

ETL Pipeline


The ETL process runs through an Airflow DAG:

Data Model

The process is as follows:

  1. We create the tables and staging tables (if they do not exist)
  2. We perform an update and insert, based on new data coming in
  3. Run a data quality check (check that tables have more than 1 row and there are no null ids)

Potential Improvements


  1. <ins> What if data is increased by 100x? </ins>
  1. <ins> What if data pipeline needs to be run by 7am daily? </ins>
  1. <ins> What if the database needs to be accessed by 100+ users? </ins>

Development


<ins>Setting up</ins>

<ins>Add necessary connections and variables in Airflow UI</ins>
There are 4 variables to be defined:

  1. movie_s3_config. It is defined as a json format as follows:

    {
        'aws_key': {AWS_KEY},
        'aws_secret_key: {AWS_SECRET_KEY},
        's3_bucket': {AWS_S3_BUCKET},
        's3_key': {AWS_S3_KEY} 
    }
    
    • The AWS_S3_BUCKET is the S3 bucket with S3_KEY (folder) containing the csv files:
      1. credits.csv (from Kaggle Movielens dataset)
      2. links.csv (from Kaggle Movielens dataset)
      3. movies_metadata.csv (from Kaggle Movielens dataset)
      4. ratings.csv (from Kaggle Movielens dataset)
      5. consumer_price_index.csv (from Fred St Louis dataset)
  2. db_user (user name of user with access to Redshift database)

  3. db_pass (password of user with access to Redshift database)

  4. redshift_conn_string (Redshift JDBC connection string for spark dataframe to write to Redshift)

In addition, define the Hook to connect to Redshift:

Conn Id: `redshift`.  
Conn Type: `Postgres`.   
Host: Enter the endpoint of your Redshift cluster, excluding the port at the end. 
Schema: This is the Redshift database you want to connect to.  
Login: Enter Redshift user  
Password: Enter Redshift password  
Port: Enter `5439`.

After configuring, visit Airflow UI and enable DAG to start the data pipeline

Acknowledgements


Many thanks to: