Home

Awesome

Sirocco Opinion Extraction Framework

The Sirocco opinion extraction framework was first developed at Cuesense in early 2010s. Since then it was updated and maintained by Sergei Sokolenko @datancoffee. You can follow the news on Sirocco on Twitter and Medium.

Sirocco parses input text (e.g. news articles or threaded conversations on Twitter or Reddit) into subjects and opinions. The subject and the opinion, together with the author of the news article, form a triad that allows us to answer the question: “Who (author) thinks about what (the subject) in what way (the opinion)?”

The theoretical underpinnings of Sirocco are based on a framework of human emotions originally developed by Robert Plutchik, a professor at the Albert Einstein College of Medicine. Plutchik's Wheel of Emotions identifies 8 basic emotions: Joy, Acceptance, Fear, Surprise, Sadness, Disgust, Anger, Anticipation.

In our research and commercial application of Plutchik we were able to use signals for Anticipation (aka Interest) for lead generation, positive emotion of Joy to extract user testimonials, and negative emotions of Sadness, Anger, and Disgust to identify potential leads for objects that compete with the subjects of the emotion.

Sirocco relies on Apache OpenNLP to supply it with constituency-based parse trees of sentences.

Sirocco in Presentations, News and Customer Use-Cases

Sirocco was used by a major pharma company as a solution for automatically identifying high-quality consumer reviews and resurfacing them on a product page.

We collaborated with Kalev Leetaru from The GDELT Project / Georgetown University and published a 3-part blog series (Part 1, Part 2, and Part 3) on sentiment analysis in News using Sirocco.

We co-presented (video recording, slides) together with Reddit’s VP of Engineering Nick Caldwell at Cloud Next’18 on “Predicting user engagement at Reddit” and then developed Deep Learning prediction models for Reddit (blog, kaggle notebook).

Sirocco was featured at Strata NYC 2017:

Using Sirocco

You can use Sirocco either by running the Sirocco Indexer application locally on your machine or by embedding the Sirocco library in your backend application. It is usually a good idea to start the Sirocco evaluation by making it work locally and running the provided test cases. After reviewing the output results produced by the Indexer tool, and understanding how you can use Sirocco functionality in your system, you can add the Sirocco library to your backend app. While the Sirocco library is written in Java, there are frameworks now that will allow you using it in data pipelines written in Python, Go, and even SQL. The Apache Beam SDK is one of such tools, and the Google Cloud Dataflow service in the Google Cloud allows you operating Sirocco in a fully managed and serverless way. Review the Dataflow Opinion Analysis project for examples of how to embed Sirocco in ETL pipelines running in Cloud Dataflow.

How to run Sirocco on your local machine

The steps for configuring and running the Sirocco Indexer locally are as follows:

Installing Prerequisites

Install tools necessary for compiling and deploying the code in this sample, if not already on your system, specifically git, Java and Maven:

brew install git
brew install maven

Install Latest Opinion Model Files

The Sirocco Model Files repo contains all the recent releases of model files. You can also find the source files used for bulding the model jar. Download the latest sirocco-mo-x.y.z.jar on the release page of the repo.

mvn install:install-file \
  -DgroupId=sirocco.sirocco-mo \
  -DartifactId=sirocco-mo \
  -Dpackaging=jar \
  -Dversion=x.y.z \
  -Dfile=sirocco-mo-x.y.z.jar \
  -DgeneratePom=true

Clone Sirocco repo to local machine

To clone the GitHub repository to your computer, run the following command:

git clone https://github.com/datancoffee/sirocco

Go to the sirocco directory. The exact path depends on where you placed the directory when you cloned the sample files from GitHub.

cd sirocco

Building Sirocco

You need to have Java 8 and Maven 3 installed.

Build and install the Sirocco library

mvn clean install

Alternatively, run the package and install steps separately (don't forget to replace x.y.z with the active version of Sirocco)

mvn clean package
mvn install:install-file -Dfile=target/sirocco-sa-x.y.z.jar -DpomFile=pom.xml

The build process will create a shaded jar sirocco-sa-x.y.z.jar that contains all dependencies (including OpenNLP packages) in the target directory. This jar does not contain model files and these need to be downloaded separately (see one of the previous steps).

Running Included Test Datasets and Your Own Tests

This repo contains a set of blog posts and other text documents that can be used as inputs for verification of changes. These test datasets are located in the src/test/resources/testdatasets folder.

The test scripts that invoke the Sirocco Indexer are located in the src/test/scripts/ folder. The runindexer.sh script processes files with .txt extensions while the runindexercsv.sh script processes .csv files. Both scripts run the indexer through all files of their associated extensions in the src/test/resources/in folder (configurable in the script) and produce outputs in the src/test/resources/out folder (also configurable in the script). You can use these scripts to run the indexer on your own data as well (just make sure to put your files in the input directory specified in the script).

Preparing input and output folders and making scripts executable

To do a test run, copy the following test datasets into the input folder.

cp -r src/test/resources/testdatasets/articles-col1 src/test/resources/in
cp -r src/test/resources/testdatasets/kaggle-rotten-tomato src/test/resources/in

You also need to create output folders, and their names need to match the names of the input folders.

mkdir src/test/resources/out
mkdir src/test/resources/out/articles-col1
mkdir src/test/resources/out/kaggle-rotten-tomato

Lastly, make sure that the test scripts are executable

chmod +x src/test/scripts/*.sh
Processing bag-of-properties TXT files

For our first test run, let's process TXT files in the articles-col1 folder. The runindexer.sh script expects .txt files to have the following format.

Title=<Title>
Author=<Author> 
PubTime=<yyyy-mm-dd hh:mm:ss>
Url=<Url>
Language={EN | UN}

<Text of the article>

You can have multiple documents in a .txt file, but they need to be separated by a special separator, which by default is ASCII character 30 (RS). You can change the separator if you modify the script.

To run the test script, execute the following command in shell

src/test/scripts/runindexer.sh TOPSENTIMENTS SHALLOW ARTICLE

The results of Sirocco indexing can be reviewed in the src/test/resources/out/articles-col1 folder

ls src/test/resources/out/articles-col1

The runindexer.sh test script accepts 3 parameters - the Indexing Type, the Parsing Type and Content Type.

The Indexing Type parameter describes what the output of the indexing operation should be. The acceptable values are FULLINDEX and TOPSENTIMENTS. When TOPSENTIMENTS is specified, the Indexer will select the top 4 sentence chunks (a few sequential sentences in text that have the same sentiement valence) in input text and output them in the output file. When FULLINDEX is selected, all sentence chunks will be output.

The Parsing Type parameter determines what kind of sentence trees and what traversing strategy the Indexer will select when connecting Entities with Sentiments. The acceptable values are DEEP, SHALLOW, DEPENDENCY. DEEP and SHALLOW parsing types will cause Sirocco to use constituency-based trees and DEPENDENCY parsing type will lead to Sirrocco using dependency-based trees.

Note that while DEEP parsing provides the best quality associations of entities and sentiments, this parsing mode is still work-in-progress (as of Feb 2021). Currently, the SHALLOW parsing mode is the most tested mode. The DEPENDENCY parsing mode is also work-in-progress.

The Content Type provides hints to the Sirocco Indexer what kind of textual artifacts it is dealing with. The possible values are ARTICLE or SHORTTEXT. SHORTTEXT covers short messages on Twitter or SMS or one-line product reviews, usually less than 140 chars. ARTICLE represents pieces of text usually longer than 140 characters. The difference between the content types will be in the final steps of indexing. For SHORTTEXT, Sirocco will chunk all sentences it finds into one Chunk, while for ARTICLE it will chunk sentences into one or many Chunks depending on their Sentiment Valence (Orientation). If in doubt how long your text is, chose the ARTICLE content type.

./src/test/scripts/runindexer.sh FULLINDEX DEEP ARTICLE
Processing CSV files

For our second test run, let's process CSV files in the kaggle-rotten-tomato folder.

If you have .csv files with multiple documents (text pieces) per file, you can process them with the runindexercsv.sh script. The CSV file can have an unlimited number of columns, but the indexer will be interested in two specific ones: the column that contains the external ID of the document as well as the column that contains the text.

Here is an example of a CSV file with two documents, each in its own CSV record.

SentenceId,PhraseId,Phrase,Sentiment
6,167,A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis .,4
179,4684,"Beautifully crafted , engaging filmmaking that should attract upscale audiences hungry for quality and a nostalgic , twisty yarn that will keep them guessing .",4

Because of this, when processing CSV files you have to specify 5 parameters: Indexing Type, Parsing Type, Content Type, Item Id Column Index, and Text Column Index. Here is an example that does Shallow parsing of the above CSV file, and sets the Item ID column Index to 1 (meaning, it's the second column in the file, labelled PhraseId), and specifies the Text column Index as 2 (meaning, as the third column in the file, labelled Phrase). Because most text pieces in this dataset are one-line or one-sentence movie reviews, we will use the SHORTTEXT content type parameter. Note that if you are dealing with anything but tweets or 1-2 sentence paragraphs, it is better to use the ARTICLE content type when doing SHALLOW parsing. When doiong DEEP parsing, getting content type right is less important.

./src/test/scripts/runindexercsv.sh FULLINDEX SHALLOW SHORTTEXT 1 2

The results of Sirocco indexing can be reviewed in the src/test/resources/out/kaggle-rotten-tomato folder

ls src/test/resources/out/kaggle-rotten-tomato

Note the following when processing CSV files:

How to Incorporate Sirocco in Your Project

After running the Sirocco Indexer on included test datasets and your own data and verifying that its output is useful for you, you might decide that you want to embed Sirocco in your own application.

A good example of the first type of Sirocco usage is the Dataflow Opinion Analysis project. But even if you are building an event-driven app, the Dataflow Opinion Analysis project can provide good pointers to the API and its usage.

Including the Sirocco library and Sirocco Model Files into your binary

Install the Sirocco library sirocco-sa-x.y.z.jar and Sirocco Model files sirocco-mo-x.y.z.jar

Add the dependency to your project's pom.xml file.

<dependency>
  <groupId>sirocco.sirocco-sa</groupId>
  <artifactId>sirocco-sa</artifactId>
  <version>[1.0.0,2.0.0)</version>
</dependency>

You will also need to add a dependency to the model files

<dependency>
  <groupId>sirocco.sirocco-mo</groupId>
  <artifactId>sirocco-mo</artifactId>
  <version>[1.0.0,2.0.0)</version>
</dependency>

Roadmap

We are actively working on improving the quality of the Sirocco models as well as extending its availability on NLP frameworks. We are looking for contributors for helping us to accomplish this mission. See below for contact info.

Additional Publications and Documentation

July 2017 - Life of a Cloud Dataflow service-based shuffle

May 2017 - Designing ETL architecture for a cloud-native data warehouse on Google Cloud Platform

May 2017 - Opinion Analysis of Text using Plutchik

April 2017 - Selecting a Java WordNet API for lemma lookups

April 2017 - Modernizing Sirocco from C# and SharpNLP to Java and Apache OpenNLP

Useful Links

Want to Help or Get in Touch?