Awesome

FakeRecogna

FakeRecogna is a dataset comprised of real and fake news. The real news is not directly linked to fake news and vice-versa, which could lead to a biased classification. The news collection was performed by crawlers developed for mining pages of well-known and of great national importance agency news. The web crawlers were developed based on each analyzed webpage, where the extracted information is first separated into categories and then grouped by dates. The plurality of news on several pages and the different writing styles provide the dataset with great diversity for natural language processing analysis and machine learning algorithms.

The Dataset

The news collection was performed by crawlers developed for mining pages of well-known and of great national importance agency news. The fake news mining was mainly focused on pages mentioned by the Duke Reporters Lab, which provides a list of pages that verify the veracity of news worldwide.There were 160 active fact-checking agencies in the world in 2019 and Brazil figures as a growing ecosystem with currently 9 initiatives and there were considered 6 out of the 9 pages during search with a great variation in the number of fake news extracted from each one, ending in 5,951 samples. Table 1 presents the current initiatives as well as the number of fake news collected from each source.

Fact-Check Agency	Web address	# News
Boatos.org	https://boatos.org	2,605
Fato ou Fake	https://oglobo.globo.com/fato-ou-fake	1,055
E-farsas	https://www.e-farsas.com	812
UOL Confere	https://noticias.uol.com.br/confere	582
AFP Checamos	https://checamos.afp.com/afp-brasil	509
Projeto Comprova	https://checamos.afp.com/afp-brasil	388
Total	-------------------------------------	5,951

Concerning the real news, the crawlers searched portals such as G1, UOL and Extra, which are publicly recognized as reliable news outlets, besides the Ministry of Health of Brazil home page, resulting in a collection of over 100,000 samples. From this set, there were filtered out 5,951 samples to keep the balance between classes and, thus, resulting in a dataset comprised of 11,902 samples.

More informations

The FakeRecogna dataset is available at GitHub as a single XLSX file that contains 8 columns for the metadata, and each row stands for a sample (real or fake news), as described in Table 2.

Columns	Description
Title	Title of article
Sub-title (if available)	Brief description of news
News	Information about the article
Category	News grouped according to your information
Author	Publication author
Date	Publication date
URL	Article web address
Class	0 for fake news and 1 for real news

The collected texts are distributed into six categories in relation to their main subjects: Brazil, Entertainment, Health, Politics, Science, and World. These categories are defined based on the journal sections where the news were extracted. The distribution of news by category and its percentages are described in Table 3.

Category	# News	%
Brazil	904	7.6
Entertainment	1,409	12.00
Health	4,456	37.4
Politics	3.951	33.1
Science	602	5.1
World	580	4.9
Total	11,902	100.00