Home

Awesome

XED

This is the XED dataset. The dataset consists of emotion annotated movie subtitles from OPUS. We use Plutchik's 8 core emotions to annotate. The data is multilabel. The original annotations have been sourced for mainly English and Finnish, with the rest created using annotation projection to aligned subtitles in 41 additional languages, with 31 languages included in the final dataset (more than 950 lines of annotated subtitle lines). The dataset is an ongoing project with forthcoming additions such as machine translated datasets. Please let us know if you find any errors or come across other issues with the datasets!

Citation

You can read more about XED in the following paper:

Öhman, E., Pàmies, M., Kajava, K. and Tiedemann, J., 2020. XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020).

@inproceedings{ohman2020xed,
  title={{XED}: A Multilingual Dataset for Sentiment Analysis and Emotion Detection},
  author={{\"O}hman, Emily and P{\`a}mies, Marc and Kajava, Kaisla and Tiedemann, J{\"o}rg},
  booktitle={The 28th International Conference on Computational Linguistics (COLING 2020)},
  year={2020}
}

Please cite this paper if you use the dataset.

Format

The files are formatted as follows:

sentence1\tlabel1,label2
sentence2\tlabel2,label3,label4...

Where the number indicates the emotion in ascending alphabetical order: anger:1, anticipation:2, disgust:3, fear:4, joy:5, sadness:6, surprise:7, trust:8, with neutral:0 where applicable. Note that if you use our BERT code, it will re-arrange the original labels when you use 1-8 into 0-7 by switching trust:8->0

Metadata can be found in the metadata file and the projection "pairs" files. Access to detailed metadata can be found on the OPUS website. We recommend the use of OPUS Tools. Compatible augmentation data by expert annotators can be found for a selection of languages in the following repos:

NB! The number of annotated subtitle lines are the same as listed in the original paper. The original paper gives the number of annotations, not lines with annotations which is the format of the files here.

Evaluations

We used BERT to test the annotations for Finnish, English, and a handful of other languages with complete BERT models.

English annotated data

Number of annotations:24164 + 9384 neutral
Number of unique data points:17530 + 6420 neutral
Number of emotions:8 (+pos, neg, neu)
Number of annotators:108 (63 active)
dataf1accuracy
English without NER, BERT0.5300.538
English with NER, BERT0.5360.544
English NER with neutral, BERT0.4670.529
English NER binary with surprise, BERT0.6790.765
English NER true binary, BERT0.8380.840
English NER, one-vs-rest Linear SVC0.5020.650-0.789 / class

Multilingual projections

And for the other languages with more than 950 lines using SVM:

LANGSIZEAVG_LENANGERANTICIP.DISGUSTFEARJOYSADNESSSURPRISETRUST1label2labels3labels4+labelsF1_SVM
AR359030.02101283947856556153661558965.0126.94%6.74%1.31%0.5729
BG697441.3192316308911051117411121166123964.0127.89%6.62%1.48%0.6069
BR1229538.493228284616411821212820252121209864.6927.02%6.66%1.63%0.6726
BS244333.1363257129436742839439739965.9826.65%6.47%0.9%0.5854
CN139510.9231531514018028822124226666.3127.46%5.16%1.08%0.5004
CS651129.94172816158071035104510111110109164.6427.42%6.63%1.31%0.6263
DA183831.0344747219321835028229435166.5926.17%6.2%1.03%0.5989
DE550350.241492130474279093888990590464.9627.11%6.6%1.33%0.6059
EL808335.222238195610701162136912731345136764.2527.58%6.73%1.45%0.6192
ES1130335.693007263114821765190218101959192464.5227.22%6.59%1.66%0.676
ET147628.6637039614421828021022225565.5827.57%6.17%0.68%0.5449
FI828929.112175201010141281150312431383144764.327.8%6.38%1.52%0.5859
FR730641.27194617269941127125612001198125963.6328.02%6.86%1.49%0.6257
HE444928.971244107855165879168175478363.3428.37%6.74%1.55%0.598
HR594131.7149414087249781029947991105264.1328.24%6.26%1.36%0.6503
HU577732.0715391378715925937899989102864.1927.77%6.63%1.42%0.5978
IS97729.5523623012112417516813418066.8427.12%5.32%0.72%0.5416
IT655244.65178315148871092101111221065110463.5828.4%6.59%1.42%0.6907
MK30028.95810033366153645258.6731.0%9.67%0.67%0.4961
NL533333.931392133765882287885794292764.2227.21%6.86%1.71%0.614
NO425731.11051102950058482267873171265.0927.93%5.68%1.29%0.5771
PL717932.44196617079641121120611191199122064.0327.72%6.69%1.56%0.6233
PT722033.72189017109061101126012101234125763.8527.87%6.86%1.43%0.6203
RO947436.882543218112581433156315681579160864.927.07%6.58%1.45%0.6387
RU237732.4556459026842337639541640564.727.6%6.6%1.09%0.5976
SK97559.822562349916816815315215965.4428.0%5.54%1.03%0.5305
SL268029.1967969427840245641648141965.5227.61%5.6%1.27%0.6015
SR898431.692365216311311282165213991519156564.327.58%6.72%1.39%0.6566
SV490544.341273116059169181583186682765.327.01%6.48%1.2%0.6218
TR920235.952423224312121339161014691589162863.6428.03%6.71%1.63%0.608
VI95634.5324522412814118715014417863.2828.56%7.11%1.05%0.5594

Related publications:

Some preliminary and related work has also been discussed in the following papers:

License: Creative Commons Attribution 4.0 International License (CC-BY)