Home

Awesome

NoReC: The Norwegian Review Corpus

This repository distributes the Norwegian Review Corpus (NoReC), created for the purpose of training and evaluating models for document-level sentiment analysis. More than 43,000 full-text reviews have been collected from major Norwegian news sources and cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author. The accompanying paper by Velldal et al. at LREC 2018 describes the (initial release of the) data in more detail.

Sources and partners

NoReC was created as part of the SANT project (Sentiment Analysis for Norwegian Text), a collaboration between the Language Technology Group (LTG) at the Department of Informatics at the University of Oslo, the Norwegian Broadcasting Corporation (NRK), Schibsted Media Group and Aller Media. This 2nd release, v.2.1 of the corpus comprises 43,436 review texts extracted from eight different news sources: Dagbladet, VG, Aftenposten, Bergens Tidende, Fædrelandsvennen, Stavanger Aftenblad, DinSide.no and P3.no. In terms of publishing date the reviews mainly cover the time span 2003–2019, although it also includes a handful of reviews dating back as far as 1998.

Terms of use

The data is distributed under a Creative Commons Attribution-NonCommercial licence (CC BY-NC 4.0), access the full license text here: https://creativecommons.org/licenses/by-nc/4.0/

The licence is motivated by the need to block the possibility of third parties redistributing the orignal reviews for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of NoReC are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.

Formats and pre-processing

The reviews are distributed as .txt files, split into train, dev, and test sets. The files contain sentence and paragraph segmented texts, processed using UDPipe.

Metadata for each review is provided as a JSON object, all listed in a single file, metadata.json, indexed on the document id. The JSON objects record properties like the numerical rating (an integer in the range 1–6), the thematic category or domain, the URL of the original document, and so on. It also records which of the two official varieties of Norwegian is used, as detected using langid.py.

Structure

Each review is stored as a separate file, with the filename given by the review ID. To facilitate replicability of experiments the corpus comes with pre-defined standard splits for training, development and testing, with a 80–10–10 ratio. The data directory of the distribution is structured as follows, where the train/dev/test directories holds the individual files (e.g. 000042.txt):

data
├── metadata.json
├── train
├── dev
├── test

Obtaining the data

git clone https://github.com/ltgoslo/norec

Citing

If you publish work that uses or references the data, please cite our LREC article. BibEntry:

@InProceedings{VelOvrBer18,
  author = {Erik Velldal and Lilja {\O}vrelid and 
            Eivind Alexander Bergem and  Cathrine Stadsnes and 
            Samia Touileb and Fredrik J{\o}rgensen},
  title = {{NoReC}: The {N}orwegian {R}eview {C}orpus},
  booktitle = {Proceedings of the 11th edition of the 
               Language Resources and Evaluation Conference},
  year = {2018},
  address = {Miyazaki, Japan},
  pages = {4186--4191}
}

Some statistics

Distribution over year and publication source

All splits combined

yearapbtdbdinsidefvnp3savgTotal
2003*0401430250286458
20040440142012199841201
2005000179062249091318
20060002400112947781323
200700013901274007251391
200800011902163697391443
2009052377163274282598152121
201001006422601565713097692807
20111515922841466523629002988
201221506132573326115617633289
2013416052721621361943310583230
20143929150123635754638711913548
20152492357282454564996208493881
20163093408091773214396827153792
20176494919212486925678226875077
20186054708851944663398604924311
20192601679530160363461651259

2003*: Including the 31 documents 1998-2002

Distribution over split and rating

split123456Total
dev51225707140916782784348
test27242706138517142664340
train379228760041130412614216134749

Distribution over split and category

splitgamesliteraturemiscmusicproductsrestaurantsscreensportsstageTotal
dev179539281445347941569151324348
test180547241444345981579161074340
train145343371561177727717451253611885634749

What's new

Version 2.1 November 2023:
We have cleaned NoReC, introducing the following changes:

Updated "category" data

There were previously 4619 texts in the "misc" category. We have assigned the correct category for most these, based on the source categories, source tags and manual inspection. The remaining 208 texts labeled "misc" should now be truly miscellaneous, like reviews of podcasts, art exhibitions and politicians taking part in debates.

We consider the "category" tag to be the best representation of domain for the reviewed entity or event.

Removed duplicates

177 reviews were found to be duplicates, cross-postings in more than one news outlet in the same media group. This reduced the toal count of reviews from 43614 to 43437.