Home

Awesome

FakeRecogna

FakeRecogna is a dataset comprised of real and fake news. The real news is not directly linked to fake news and vice-versa, which could lead to a biased classification. The news collection was performed by crawlers developed for mining pages of well-known and of great national importance agency news. The web crawlers were developed based on each analyzed webpage, where the extracted information is first separated into categories and then grouped by dates. The plurality of news on several pages and the different writing styles provide the dataset with great diversity for natural language processing analysis and machine learning algorithms.

The Dataset

The news collection was performed by crawlers developed for mining pages of well-known and of great national importance agency news. The fake news mining was mainly focused on pages mentioned by the Duke Reporters Lab, which provides a list of pages that verify the veracity of news worldwide.There were 160 active fact-checking agencies in the world in 2019 and Brazil figures as a growing ecosystem with currently 9 initiatives and there were considered 6 out of the 9 pages during search with a great variation in the number of fake news extracted from each one, ending in 5,951 samples. Table 1 presents the current initiatives as well as the number of fake news collected from each source.

Fact-Check AgencyWeb address# News
Boatos.orghttps://boatos.org2,605
Fato ou Fakehttps://oglobo.globo.com/fato-ou-fake1,055
E-farsashttps://www.e-farsas.com812
UOL Conferehttps://noticias.uol.com.br/confere582
AFP Checamoshttps://checamos.afp.com/afp-brasil509
Projeto Comprovahttps://checamos.afp.com/afp-brasil388
Total-------------------------------------5,951

Concerning the real news, the crawlers searched portals such as G1, UOL and Extra, which are publicly recognized as reliable news outlets, besides the Ministry of Health of Brazil home page, resulting in a collection of over 100,000 samples. From this set, there were filtered out 5,951 samples to keep the balance between classes and, thus, resulting in a dataset comprised of 11,902 samples.

More informations

The FakeRecogna dataset is available at GitHub as a single XLSX file that contains 8 columns for the metadata, and each row stands for a sample (real or fake news), as described in Table 2.

ColumnsDescription
TitleTitle of article
Sub-title (if available)Brief description of news
NewsInformation about the article
CategoryNews grouped according to your information
AuthorPublication author
DatePublication date
URLArticle web address
Class0 for fake news and 1 for real news

The collected texts are distributed into six categories in relation to their main subjects: Brazil, Entertainment, Health, Politics, Science, and World. These categories are defined based on the journal sections where the news were extracted. The distribution of news by category and its percentages are described in Table 3.

Category# News%
Brazil9047.6
Entertainment1,40912.00
Health4,45637.4
Politics3.95133.1
Science6025.1
World5804.9
Total11,902100.00