Home

Awesome

Oscars Script Analysis — 1989, 2015 and 2017

This repository contains data, analytic code, and findings supporting BuzzFeed News's analysis of diversity in the dialogue of Best Picture–nominated films, published March 2, 2018. Please read that article, which contains important context and details, before proceeding.

Data

This analysis relies on two data files, both found in data/.

data/actor-metrics.csv lists each actor in our analysis, and contains these columns:

data/character-word-counts-csv counting each character + word combination (excluding "stop words"; see below for details), for each actor in our analysis. It contains these columns:

Data sources

The analyses in this repository use, as their main source material, the scripts of the 22 films nominated for Best Picture for the 1990, 2016, and 2018 Academy Awards. (Those films were released in 1989, 2015, and 2017, respectively.)

For two films, Mad Max and My Left Foot, we could not locate a script, so we instead relied on film transcripts, which we then checked against the final film. We then entered these transcripts into the Writer Duet scriptwriting program, and exported the results as XML (in the same format that we used for other screenplays).

The list of nominated films came from the Oscars Awards Database and the Oscar's website.

The character names and dialogue were extracted from film scripts, which were found on public websites (such as Script Slug and The Internet Movie Script Database) and on the websites of various film distributors.

It is important to note:

The official names for each script's characters were drawn from Variety Insights and IMDB.

The source for each actor's actor gender and race/ethnicity was primarily Variety Insights. In cases where an actor's gender race/ethnicity could not be confirmed in Variety Insights, we sometimes made a judgment call based on photos, biographies and other information. In cases where an actor's ethnicity or gender was at all in question, we confirmed the facts with their representative.

In some cases, names could not be matched to actors either because the character's part was not included in the finished film, or because the actor was not credited. These names were removed from the analysis.

Data processing steps

First, we converted PDFs of the movie scripts into XML files, using Writer Duet or Story Writer. Then, we used Python's Beautiful Soup, TextBlob, and ftfy libraries to extract the character names and dialogue from the XML files, clean them up, and "tokenize" the dialogue into sentences and words. Then, we exported each character's lines and total word and sentence counts to a CSV file.

Using that CSV file, we manually assigned each character we could to an actor, using the sources listed above. Then, we removed characters who fit any of the following criteria:

Ultimately, we removed 11 characters who did speak at least 100 words:

To generate the character-word-counts.csv file, we took the following steps:

Analysis

This repository uses Python code and Jupyter notebooks to process the data. That code can be found here:

Feedback / Questions?

Contact Lam Thuy Vo at lam.vo@buzzfeed.com and Scott Pham scott.pham@buzzfeed.com.

Looking for more from BuzzFeed News? Click here for a list of our open-sourced projects, data, and code.