Awesome
Dataset: BuzzFeed News “Trending” Strip, 2018–2023
A tribute to a trailblazing newsroom.
BuzzFeedNews.com launched in July 2018 as the dedicated homepage for BuzzFeed News. (Previously, BuzzFeed’s news coverage was published on BuzzFeed’s main domain, buzzfeed.com
.) One key feature of the site was its “Trending” strip, curated by editors and highlighting up to eight articles at a time:
In mid-November 2018, a few months after the site launched, I wrote a script to fetch that list of articles and to save that information to a simple file. The script ran every five minutes (with occasional interruptions) until the newsroom’s final day of operation in May 2023. This repository contains all the data the script collected, in raw and deduplicated forms.
Disclosure: I worked for BuzzFeed’s news division from March 2014 to January 2022. I undertook this project on personal time and out of personal interest, using only the publicly-available homepage; nothing here should be considered to represent the views of BuzzFeed or BuzzFeed News.
Raw data
The file data/bfn-trending-strip-raw.tsv.gz
contains the raw data I collected. I have compressed it with gzip
, which reduces the size from 390MB to 11MB.
Structure
The file contains 3.1 million rows, each representing one article observed at one point in time.
The file uses these columns:
timestamp
: The time (in UTC) of the fetch. All articles from the same fetch will have the same timestamp.position
: The article's zero-indexed position in the trending strip, from left to right.text
: The text of the link used to highlight the article. Note: Sometimes the same article is associated with different text at different points in time.url
: The link's URL. Note: Sometimes (although relatively rarely) the URL for the same underlying article changes over time.
Note: Although the script generally ran every five minutes, there are some gaps in the data, accounting for roughly 3% of the total time period covered. These gaps owe to two main factors: technical complications (such as server downtime) and periods during which the website swapped out the trending strip with breaking news alerts, single-story highlights, or other notices. Unfortunately, I did not have the foresight to collect data that would distinguish between those scenarios.
Example data
Here are six rows of the dataset, from one particular point in time on August 27, 2020:
timestamp | position | text | url |
---|---|---|---|
2020-08-27T13:35:04 | 0 | Kenosha Protests | https://www.buzzfeednews.com/article/ellievhall/kenosha-suspect-kyle-rittenhouse-trump-rally |
2020-08-27T13:35:04 | 1 | Xinjiang Internment Camps | https://www.buzzfeednews.com/article/meghara/china-new-internment-camps-xinjiang-uighurs-muslims |
2020-08-27T13:35:04 | 2 | NBA | https://www.buzzfeednews.com/article/skbaer/milkwaukee-bucks-boycott-jacob-blake |
2020-08-27T13:35:04 | 3 | Hurricane Laura | https://www.buzzfeednews.com/article/emmanuelfelton/hurricane-laura-could-lead-to-an-environmental-disaster-on |
2020-08-27T13:35:04 | 4 | RNC 2020 | https://www.buzzfeednews.com/article/ryancbrooks/trump-white-house-rnc-backdrop |
2020-08-27T13:35:04 | 5 | Mike Pence | https://www.buzzfeednews.com/article/salvadorhernandez/pence-dhs-officer-death-rnc-speech |
Deduplicated data
Because the trending strip typically updated much less often than the script fetched the data, the raw data file contains much redundancy. I.e., two fetches, five minutes apart, often returned exactly the same data.
To simplify this redundancy, I've also created a smaller data file that contains a deduplicated version of the data: data/bfn-trending-strip-deduped.tsv
. It contains roughly 60x fewer rows, and takes up roughly 50x less space (less than 8MB).
Structure
The file contains 51,344 rows, each representing one article observed across a range of time.
The file uses the same core columns as the raw data, but replaces timestamp
with timestamp_first
and timestamp_last
, which represent the first and last consecutive fetches the script saw identical data. If the positions, text, or URLs changed at all, the file begins a new set of entries.
Note: If you need a precise accounting of the specific fetch timings within the timestamp ranges, please see the "Timestamps of all fetches" section below.
Data sample
Here are six rows of the dataset, from one particular time range on August 27, 2020:
timestamp_first | timestamp_last | position | text | url |
---|---|---|---|---|
2020-08-27T13:35:04 | 2020-08-27T17:50:03 | 0 | Kenosha Protests | https://www.buzzfeednews.com/article/ellievhall/kenosha-suspect-kyle-rittenhouse-trump-rally |
2020-08-27T13:35:04 | 2020-08-27T17:50:03 | 1 | Xinjiang Internment Camps | https://www.buzzfeednews.com/article/meghara/china-new-internment-camps-xinjiang-uighurs-muslims |
2020-08-27T13:35:04 | 2020-08-27T17:50:03 | 2 | NBA | https://www.buzzfeednews.com/article/skbaer/milkwaukee-bucks-boycott-jacob-blake |
2020-08-27T13:35:04 | 2020-08-27T17:50:03 | 3 | Hurricane Laura | https://www.buzzfeednews.com/article/emmanuelfelton/hurricane-laura-could-lead-to-an-environmental-disaster-on |
2020-08-27T13:35:04 | 2020-08-27T17:50:03 | 4 | RNC 2020 | https://www.buzzfeednews.com/article/ryancbrooks/trump-white-house-rnc-backdrop |
2020-08-27T13:35:04 | 2020-08-27T17:50:03 | 5 | Mike Pence | https://www.buzzfeednews.com/article/salvadorhernandez/pence-dhs-officer-death-rnc-speech |
Timestamps of all fetches
The data/all-timestamps.tsv
file contains a simple table of all timestamps for which the script successfully obtained data. If you're using the deduplicated data, this file can provide you with a more precise understanding of the fetch timings within the timestamp_first
and timestamp_last
spans.
timestamp |
---|
2018-11-13T22:10:02 |
2018-11-13T22:15:02 |
2018-11-13T22:20:02 |
2018-11-13T22:25:02 |
2018-11-13T22:30:02 |
2018-11-13T22:35:02 |
Licensing
The data files in this repository are available under Creative Commons’ CC BY-SA 4.0 license terms. The code files in this repository are available under the MIT License terms.