Awesome
southparkr <img src="sticker/southparkr-sticker.png" align="right" width="150" />
The package is used to scrape South Park transcripts and IMDB ratings for each episode. The processed data can then be used in a text analysis.
About South Park
South Park is an American, satiric, animated TV show about four elementary school boys. Those are just the main characters. There is a lot more throughout the series. Pretty much everybody famous has already been made fun of by South Park creators Trey Parker and Matt Stone. You can watch all the episodes for free on their official site https://southpark.cc.com. Well, at least in the Czech Republic.
Installation
The development version can be installed using devtools:
devtools::install_github("pdrhlik/southparkr")
Scraping South Park scripts
The main resource for South Park scripts is a community driven website https://southpark.wikia.com/. There is a subsection Portal:Scripts that has a unified table of scripts for each episode.
episode_list <- fetch_episode_list()
episode_lines <- fetch_all_episodes(episode_list)
Scraping IMDB ratings
This is done by parsing official IMDB interfaces. South Park IMDB ID is tt0121955
. It is used to get IMDB IDs of every South Park episode from the file title.episode.tsv.gz. Once the IDs are obtained, it gathers episode information from title.basics.tsv.gz and episode ratings from title.ratings.tsv.gz.
imdb_ratings <- fetch_ratings()
Usage
It contains 3 precomputed datasets - episode_list
, episode_lines
and imdb_ratings
. It also has a set of functions that can recreate these datasets. That should be done when new episodes are created. You can experiment with those functions on your own but remember that it takes quite a lot of time.
The following function can be used to process prepared datasets. It will create a new dataset where each row will be a word. It will also add a sentiment_score
, word_stem
and a swear_word
logical flag.
episode_words <- process_episode_words(episode_lines, imdb_ratings, keep_stopwords = FALSE)
Analysis
I used the package to answer two hypotheses. The functions I used are in R/analyses.R
and R/plots.R
files.
- Are naughty episodes more popular?
- Is Eric Cartman the naughtiest character in the show?
You can try to answer these yourself!
I will be writing an article about my findings. I wrote a first part that describes how I obtained the data in more detail on my blog: South Park Analysis I - Script Scraping.
I also gave a talk about my findings at the Why R? 2018 conference. You can check the slides yourself!