Awesome

Table of Contents

Overview

Overview

This dataset is an annotated variant of the Persuasive Opinion Multimedia (POM) corpus. It was developed for the opinion prediction task and includes opinion annotations at the expression and word levels. Expression-level annotations label the textual span of the opinion. Word-level annotations (e.g. holder, target, polarity) label the word components of the opinion. Further details can be found in (Garcia et al. 2019 (1)). As part of preprocessing, punctuation was added to the text of the original corpus. The dataset is stored as a pickled pandas MultiIndex DataFrame.

The hierarchical index structure can be understood according to the tuple which forms the MultiIndex object. The first element of the index is one of the following values: features, labels, level_0, seq_level_labels_lvl1 or words.

Each row in words is indexed by the following tuple of values: (index_text, id_sentence, level_1) where index_text indexes the raw filename for each movie review, id_sentence indexes the sentences in the review, and level_1 indexes each word in each sentence. This same tuple indexes the rows of each of the following pieces of data in the dataframe.

Features

Features consist of the tuple (features, [feature name], dimension) where the number of dimensions count the number of columns that comprise a particular feature. This data originates from the original POM corpus (Park et al. 2014) but was re-aligned so that it could be incorporated into this dataset.

feature name	feature type	dimensions
feature_COAVAREP	audio	43
feature_FACET 4.1	video	43
feature_FACET 4.2	video	36
feature_glove_vectors	text	300
intervals	word start, word stop	2

Video labels

Video labels consist of the tuple (labels, [label name], dimension) where the number of dimensions count the number of columns that comprise a particular label. This data also originates from the original POM corpus (Park et al. 2014) and was re-aligned.

label name	dimensions
label_video_personality	16
label_video_persuasion	1
label_video_sentiment	1

Opinion labels

Opinion labels consist of the tuple (seq_level_labels_lvl1, seq_level_labels_lvl2, [label]). The field label consist of all holders, polarities, and targets in the dataset. Each label is boolean. The exception is the sentence-level 4_levels_polarity label which can take the value '0' (no opinion), '1' (negative opinion), or '2' (positive opinion).

label	granularity
4_levels_polarity	sentence-level
Actor	expression-level
Atmosphere and mood	expression-level
Character design	expression-level
Composer - Singer - Soundmaker	expression-level
Director	expression-level
Music and Sound effects	expression-level
Negative	expression-level
Negative_levels	expression-level
Neutral	expression-level
Other	expression-level
Other people involved in movie making	expression-level
Overall	expression-level
Polarity	word-level
Positive	expression-level
Positive_levels	expression-level
Price	expression-level
Producer	expression-level
Screenplay	expression-level
Target	word-level
Token	word-level
Very\\_Negative	expression-level
Very\\_Positive	expression-level
Vision and Special effect	expression-level

There are two unique expression-level labels: Negative_levels and Positive_labels. They are both aggregate labels that only take the value '1' if either the values Negative OR Very\\_Negative (Positive OR Very\\_Positive) take the value '1' at the expression level.

An example of a sentence from the dataset is:

This movie came out a few years ago and it is awesome

This sentence has a 4_levels_priority of '2' because the sentence contains the positive expression "it is awesome". The target word is "it" so this word has a value of '1' for the label Target. Finally "it is" refers to the overall film so the words "it" and "is" both have values of '1' for the labels Very\\_Positive, Positive_levels, and Overall.

Considerations

Researcher should keep in mind that this dataset differs from the original POM dataset due to the follow data process:

Annotators did not take into account the video portion of the dataset during annotation. Only the transcripts of each review were considered.
While the original dataset contained punctuation (e.g. silent pauses), this dataset does not contain punctuation and only provides sentence segmentation. This could be of significant importance for those who want to use certain audio features from the CMU SDK -- such as pause (Park et al. 2014).
Because punctation has been removed the Levenshtein distance was used in order to re-match the annotated transcripts with the transcripts of the original dataset.
Finally the annotated transcripts were re-integrated with the remaining features in the original POM dataset.

Download Link

The dataset is available for download through registration at the following link:

http://service.tsi.telecom-paristech.fr/cgi-bin/user-service/subscribe.cgi?form=&license=1&ident=POM

If prompted to sign in simply click 'Cancel' in order to navigate to the registration page.

Filezilla is the recommended FTP client. Please make sure to use the following configuration when connecting to the server.

title title

Acknowledgement

The documentation of this dataset and its issues, and code to parse the data were contributed by Tanvi Dinkar.

Contact Information

Please direction any questions or concerns regarding this dataset to Chloé Clavel (chloe.clavel@telecom-paris.fr) or Tanvi Dinkar (T.Dinkar@hw.ac.uk).

Citation information

@article{garcia2019multimodal,
  title={A multimodal movie review corpus for fine-grained opinion mining},
  author={Garcia, Alexandre and Essid, Slim and d'Alch{\'e}-Buc, Florence and Clavel, Chlo{\'e}},
  journal={arXiv preprint arXiv:1902.10102},
  year={2019}
}

@article{garcia2019token,
  title={From the token to the review: A hierarchical multimodal approach to opinion mining},
  author={Garcia, Alexandre and Colombo, Pierre and Essid, Slim and d'Alch{\'e}-Buc, Florence and Clavel, Chlo{\'e}},
  journal={arXiv preprint arXiv:1908.11216},
  year={2019}
}

@inproceedings{park2014computational,
  title={Computational analysis of persuasiveness in social multimedia: A novel dataset and multimodal prediction approach},
  author={Park, Sunghyun and Shim, Han Suk and Chatterjee, Moitreya and Sagae, Kenji and Morency, Louis-Philippe},
  booktitle={Proceedings of the 16th International Conference on Multimodal Interaction},
  pages={50--57},
  year={2014}
}