Home

Awesome

Dataset: BuzzFeed News “Trending” Strip, 2018–2023

A tribute to a trailblazing newsroom.

BuzzFeedNews.com launched in July 2018 as the dedicated homepage for BuzzFeed News. (Previously, BuzzFeed’s news coverage was published on BuzzFeed’s main domain, buzzfeed.com.) One key feature of the site was its “Trending” strip, curated by editors and highlighting up to eight articles at a time:

Screenshot of the trending strip

In mid-November 2018, a few months after the site launched, I wrote a script to fetch that list of articles and to save that information to a simple file. The script ran every five minutes (with occasional interruptions) until the newsroom’s final day of operation in May 2023. This repository contains all the data the script collected, in raw and deduplicated forms.

Disclosure: I worked for BuzzFeed’s news division from March 2014 to January 2022. I undertook this project on personal time and out of personal interest, using only the publicly-available homepage; nothing here should be considered to represent the views of BuzzFeed or BuzzFeed News.

Raw data

The file data/bfn-trending-strip-raw.tsv.gz contains the raw data I collected. I have compressed it with gzip, which reduces the size from 390MB to 11MB.

Structure

The file contains 3.1 million rows, each representing one article observed at one point in time.

The file uses these columns:

Note: Although the script generally ran every five minutes, there are some gaps in the data, accounting for roughly 3% of the total time period covered. These gaps owe to two main factors: technical complications (such as server downtime) and periods during which the website swapped out the trending strip with breaking news alerts, single-story highlights, or other notices. Unfortunately, I did not have the foresight to collect data that would distinguish between those scenarios.

Example data

Here are six rows of the dataset, from one particular point in time on August 27, 2020:

timestamppositiontexturl
2020-08-27T13:35:040Kenosha Protestshttps://www.buzzfeednews.com/article/ellievhall/kenosha-suspect-kyle-rittenhouse-trump-rally
2020-08-27T13:35:041Xinjiang Internment Campshttps://www.buzzfeednews.com/article/meghara/china-new-internment-camps-xinjiang-uighurs-muslims
2020-08-27T13:35:042NBAhttps://www.buzzfeednews.com/article/skbaer/milkwaukee-bucks-boycott-jacob-blake
2020-08-27T13:35:043Hurricane Laurahttps://www.buzzfeednews.com/article/emmanuelfelton/hurricane-laura-could-lead-to-an-environmental-disaster-on
2020-08-27T13:35:044RNC 2020https://www.buzzfeednews.com/article/ryancbrooks/trump-white-house-rnc-backdrop
2020-08-27T13:35:045Mike Pencehttps://www.buzzfeednews.com/article/salvadorhernandez/pence-dhs-officer-death-rnc-speech

Deduplicated data

Because the trending strip typically updated much less often than the script fetched the data, the raw data file contains much redundancy. I.e., two fetches, five minutes apart, often returned exactly the same data.

To simplify this redundancy, I've also created a smaller data file that contains a deduplicated version of the data: data/bfn-trending-strip-deduped.tsv. It contains roughly 60x fewer rows, and takes up roughly 50x less space (less than 8MB).

Structure

The file contains 51,344 rows, each representing one article observed across a range of time.

The file uses the same core columns as the raw data, but replaces timestamp with timestamp_first and timestamp_last, which represent the first and last consecutive fetches the script saw identical data. If the positions, text, or URLs changed at all, the file begins a new set of entries.

Note: If you need a precise accounting of the specific fetch timings within the timestamp ranges, please see the "Timestamps of all fetches" section below.

Data sample

Here are six rows of the dataset, from one particular time range on August 27, 2020:

timestamp_firsttimestamp_lastpositiontexturl
2020-08-27T13:35:042020-08-27T17:50:030Kenosha Protestshttps://www.buzzfeednews.com/article/ellievhall/kenosha-suspect-kyle-rittenhouse-trump-rally
2020-08-27T13:35:042020-08-27T17:50:031Xinjiang Internment Campshttps://www.buzzfeednews.com/article/meghara/china-new-internment-camps-xinjiang-uighurs-muslims
2020-08-27T13:35:042020-08-27T17:50:032NBAhttps://www.buzzfeednews.com/article/skbaer/milkwaukee-bucks-boycott-jacob-blake
2020-08-27T13:35:042020-08-27T17:50:033Hurricane Laurahttps://www.buzzfeednews.com/article/emmanuelfelton/hurricane-laura-could-lead-to-an-environmental-disaster-on
2020-08-27T13:35:042020-08-27T17:50:034RNC 2020https://www.buzzfeednews.com/article/ryancbrooks/trump-white-house-rnc-backdrop
2020-08-27T13:35:042020-08-27T17:50:035Mike Pencehttps://www.buzzfeednews.com/article/salvadorhernandez/pence-dhs-officer-death-rnc-speech

Timestamps of all fetches

The data/all-timestamps.tsv file contains a simple table of all timestamps for which the script successfully obtained data. If you're using the deduplicated data, this file can provide you with a more precise understanding of the fetch timings within the timestamp_first and timestamp_last spans.

timestamp
2018-11-13T22:10:02
2018-11-13T22:15:02
2018-11-13T22:20:02
2018-11-13T22:25:02
2018-11-13T22:30:02
2018-11-13T22:35:02

Licensing

The data files in this repository are available under Creative Commons’ CC BY-SA 4.0 license terms. The code files in this repository are available under the MIT License terms.