Home

Awesome

The Fjelstul World Cup Database

The Fjelstul World Cup Database is a comprehensive database about the FIFA World Cup created by Joshua C. Fjelstul, Ph.D. that covers all 22 men's tournaments (1930-2022) and all 8 women's tournaments (1991-2019). The database includes 27 datasets (over 1.58 million data points) that cover all aspects of the World Cup.

The database has been featured by The Washington Post, FiveThirtyEight, The Markup, Data is Plural, The Times, Agence France-Presse (AFP), Barron's, Latinometrics, Hindustan Times, and DataCamp.

Users can use the database to calculate statistics about teams, players, managers, and referees. Users can also use the data to predict match results. With many units of analysis and opportunities for merging and reshaping data, the database is also an excellent resource for teaching data science skills.

Here are some example visualizations that use the data:

<div> <img src="https://github.com/jfjelstul/worldcup/blob/master/visualizations/match-results.png?raw=true" width="90%"> </div> <div> <img src="https://github.com/jfjelstul/worldcup/blob/master/visualizations/goals-by-European-teams.png?raw=true" width="70%"> </div> <div> <img src="https://github.com/jfjelstul/worldcup/blob/master/visualizations/goals-by-South-American-teams.png?raw=true" width="70%"> </div>

Overview of the database

The 27 datasets in the database are organized into 5 groups:

  1. A first group of datasets (containing 9 datasets) includes information about each of the 9 basic units of observation in the database: tournaments (tournaments), including the host country, the winner, the dates of the tournament, and information about the format of each tournament; the FIFA confederations (confederations); teams (teams); players (players); managers (managers), including their team and home country; referees (referees), including their home country and confederation; stadiums that have hosted World Cup matches (stadiums); matches (matches), including the stage of the tournament, the location of the match (country, city, stadium), the teams involved, and the result; and the individual awards that are handed out to players at each tournament (awards). Each of these units of observation has a unique ID number.

  2. A second group of datasets (containing 4 datasets) maps teams, players, managers, and referees to tournaments. There is a dataset about which teams qualified (qualified teams), which indicates how each team performed in the tournament; a dataset about squads (squads), which indicates the name, position, and shirt number of each player; a dataset about manager appointments (manager_appointments), which indicates the team and home country of each manager; and a dataset about referee appointments (referee_appointments), which indicates the home country and confederation of each referee.

  3. A third group of datasets (containing 4 datasets) maps teams, players, managers, and referees to individual matches. There are datasets about team appearances (team_appearances), player appearances (player_apperances), manager appearances (manager appearances), and referee appearances (referee appearances). Players who start a game on the bench but who are not substituted in appear in the squads dataset but not the player_appearances dataset.

  4. A fourth group of datasets (containing 4 datasets) cover in-match events, including: all goals (goals); all attempted penalty kicks in penalty shootouts and their outcomes (penalty_kicks); all bookings (bookings), including yellow cards and red cards; and all substitutions (substitutions). Each dataset includes the minute of the event and the player(s) and team involved. Each of these 4 types of in-match events has a unique ID number.

  5. A fifth group of datasets (containing 6 datasets) cover tournament-level attributes. There a dataset about host countries (host_countries), including the performance of each host country; a dataset about the stages in each tournament (tournament_stages), which records each stage of the tournament, the dates of the stage, and key features of the stage; a dataset about the groups in each group stage (groups), which indicates the name of each group and the number of teams in each group; a dataset about the final standings in each group (group_standings); a dataset about the final standings for each tournament (tournament_standings); and a dataset about all individual player awards handed out at each tournament (award_winners).

Downloading the data

The Fjelstul World Cup Database is available via the R package worldcup, which you can install from this repository (instructions below). Note that this repository is structured as a repository for an R package. You can also download the database directly from this repository in 4 formats: an .RData version of the database is available in the data/ folder, a .csv version is available in the data-csv/ folder, a .json version is available in the data-json/ folder, and a relational database version (SQLite) is available in the data-sqlite/ folder.

The .Rdata, .csv, and .json versions of the database are all identical except for the file format. These versions of the database are not technically relational because many tables already include variables that have been merged in from other tables for convenience (i.e., some data exists in multiple tables). The SQLite version includes all of the same variables, but variables from other tables are not already merged in. Dummy variables that are coded 0 or 1 are converted to FALSE and TRUE. Users can use the primary and foreign keys in the tables to merge in data from other tables. See the README.md file in the data-sqlite/ folder for more details on using the relational database.

Downloading the codebook

The codebook for the Fjelstul World Cup Database is available in .pdf format in the codebook/pdf/ folder. The codebook is also available in .csv format in the codebook/csv/ folder. There are 2 files: datasets.csv, which describes the contents of each dataset, and variables.csv, which describes each variable.

The codebook for the database is also included in the R package: worldcup::datasets and worldcup::variables. The same information is also available as part of the R documentation for each dataset. For example, you can see the codebook for the worldcup::matches dataset by running ?worldcup::matches.

The license

The copyright for the original structure and organization of the Fjelstul World Cup Database and for all of the documentation and replication code for the database is owned by Joshua C. Fjelstul, Ph.D.

The Fjelstul World Cup Database and the worldcup package are both published under a CC-BY-SA 4.0 license. This means that you can distribute, modify, and use all or part of the database for commercial or non-commercial purposes as long as (1) you provide proper attribution and (2) any new works you produce based on this database also carry the CC-BY-SA 4.0 license.

To provide proper attribution, according to the CC-BY-SA 4.0 license, you must provide the name of the author ("Joshua C. Fjelstul, Ph.D."), a notice that the database is copyrighted ("© 2023 Joshua C. Fjelstul, Ph.D."), a link to the CC-BY-SA 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/legalcode), and a link to this repository (https://www.github.com/jfjelstul/worldcup). You must also indicate any modifications you have made to the database.

Consistent with the CC-BY-SA 4.0 license, I provide this database as-is and as-available, and make no representations or warranties of any kind concerning the database, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable.

The datasets

Data sources and replication code

The data in the Fjelstul World Cup Database is coded based on information from Wikipedia. Some of this information is cross-referenced with the official FIFA match reports to check for accuracy. The Wikipedia pages used to code the data are archived in the data-raw/ folder. Data on tournaments is hand-coded coded based on the pages in data-raw/Wikipedia-tournament-pages/, data on squads is machine-coded based on the pages in data-raw/Wikipedia-squad-pages/, data on matches is machine-coded based on the pages in data-raw/Wikipedia-match-pages/, and data on awards is hand-coded based on the page in data-raw/Wikipedia-awards-page/. The raw data for all of the datasets that are hand-coded are available in data-raw/hand-coded-tables/.

The replication code for downloading the Wikipedia pages used to code the database is available in data-raw/code/download-wikipedia-pages.R. The replication code for extracting data from these pages is also available in data-raw/code/. The file data-raw/code/parse-wikipedia-match-pages.R extracts data from the match pages in data-raw/Wikipedia-match-pages/ and the file data-raw/code/parse-wikipedia-squad-pages.R extracts data from the squad pages in data-raw/Wikipedia-squad-pages/. The working files produced by this code are stored in data-raw/Wikipedia-data/. The file data-raw/code/build-database.R compiles the database from the parsed data in data-raw/Wikipedia-data/ and the hand-coded data in data-raw/hand-coded-tables/.

The files data-raw/code/build-database-csv.R, data-raw/code/build-database-json.R, and data-raw/code/build-database-sqlite.R build the .csv, .json, and SQLite versions of the database, respectively.

The replication code for the codebook is available in codebook/code/.

Data notes

Installing the R package

You can install the latest development version of the worldcup package from GitHub:

# install.packages("devtools")
devtools::install_github("jfjelstul/worldcup")

Citating the database

If you use the database in a paper or project, please cite the database:

Fjelstul, Joshua C. "The Fjelstul World Cup Database v.1.2.0." July 19, 2023. https://www.github.com/jfjelstul/worldcup.

The BibTeX entry for the database is:

@Manual{Fjelstul2023,
  author = {Fjelstul, Joshua C.},
  title = {The Fjelstul World Cup Database v.1.2.0},
  year = {2023}
}

If you access the database via the worldcup package, please also cite the package:

Joshua C. Fjelstul (2023). worldcup: The Fjelstul World Cup Database. R package version 1.2.0.

The BibTeX entry for the R package is:

@Manual{,
  title = {worldcup: The Fjelstul World Cup Database},
  author = {Fjelstul, Joshua C.},
  year = {2023},
  note = {R package version 1.2.0},
}

Reporting problems

If you notice an error in the data or a bug in the R package, please report it here.