Home

Awesome

Sample: Opinion Analysis of News, Threaded Conversations, and User Generated Content

This sample uses Cloud Dataflow to build an opinion analysis processing pipeline for news, threaded conversations in forums like Hacker News, Reddit, or Twitter and other user generated content e.g. email.

Opinion Analysis can be used for lead generation purposes, user research, or automated testimonial harvesting.

About the sample

This sample contains three types of artifacts:

Major Changes in current and past Releases

Version 0.7

How to run the sample

The steps for configuring and running this sample are as follows:

Prerequisites

Setup your Google Cloud Platform project and permissions

Install tools necessary for compiling and deploying the code in this sample, if not already on your system, specifically git, Java and Maven:

brew install git
brew install maven

Install the Google Cloud SDK

Create and setup a Cloud Storage bucket and Cloud Pub/Sub topics

(Optional) Create or verify a configuration for your project

By now you have already created a configuration, e.g. when you initiated the Google Cloud SDK. Now is another chance to change your mind and create a new configuration.

Verify your configuration

Important: This tutorial uses several billable components of Google Cloud Platform. New Cloud Platform users may be eligible for a free trial.

Clone the sample code

Go to the directory where you typically store your git repos.

To clone the GitHub repository to your computer, run the following command:

git clone https://github.com/GoogleCloudPlatform/dataflow-opinion-analysis

Go to the dataflow-opinion-analysis directory. The exact path depends on where you placed the directory when you cloned the sample files from GitHub.

cd dataflow-opinion-analysis

Activate gcloud configuration and set environment variables

Do this step before creating the BigQuery dataset and before running your demo Dataflow jobs every time you open a new shell.

gcloud config configurations list

gcloud config configurations activate <config-name>
cd scripts
cp set_env_vars_template.sh set_env_vars_local.sh
chmod +x *.sh

Don't miss the dot at the beginning of this command!

. ./set_env_vars_local.sh
cd ..

Create the BigQuery dataset

Table schema definitions are located in the *Schema.json files in the bigquery directory. View definitions are located in the shell script build_views.sh.

Prepare your machine for Dataflow job submissions

Download and install Sirocco, a framework maintained by @datancoffee.

mvn install:install-file \
  -DgroupId=sirocco.sirocco-sa \
  -DartifactId=sirocco-sa \
  -Dpackaging=jar \
  -Dversion=x.y.z \
  -Dfile=sirocco-sa-x.y.z.jar \
  -DgeneratePom=true
mvn install:install-file \
  -DgroupId=sirocco.sirocco-mo \
  -DartifactId=sirocco-mo \
  -Dpackaging=jar \
  -Dversion=x.y.z \
  -Dfile=sirocco-mo-x.y.z.jar \
  -DgeneratePom=true

Run demo jobs

You can use the included news articles (from Google's blogs) and movie reviews in the src/test/resources/testdatasets directory to run demo jobs. News articles are in TXT bag-of-properties format and movie reviews are in CSV format. More information about the format and the meaning of parameters is available in the Sirocco repo

We will run a demo job that processes movie reviews in CSV format.

cd dataflow-opinion-analysis

mvn clean package
scripts/run_indexer_gcs_csv_to_bigquery.sh FULLINDEX SHALLOW SHORTTEXT 1 2 "gs://$GCS_BUCKET/input/kaggle-rotten-tomato/*.csv"

The workflow was automatically rejected by the service because it uses an unsupported SDK Google Cloud Dataflow SDK for Java 2.2.0. Please upgrade to the latest SDK version. To override the SDK version check temporarily, please provide an override token using the experiment flag '--experiments=unsupported_sdk_temporary_override_token=<token>'. Note that this token expires on <date>.

This is because we are still working on upgrading our Beam dependecies to newer versions of Beam. To fix this error, modify your scripts/set_env_vars_local.sh script to set the UNSUPPORTED_SDK_OVERRIDE_TOKEN to the token that was returned.

Set the shell variables again.

. scripts/set_env_vars_local.sh

Resubmit the job.

#standardSQL
SELECT d.CollectionItemId, s.* 
FROM opinions.sentiment s
    INNER JOIN opinions.document d ON d.DocumentHash = s.DocumentHash
WHERE SentimentTotalScore > 0
ORDER BY ProcessingDateId DESC, SentimentTotalScore DESC
LIMIT 1000;

Issues Under Investigation

DELETE FROM opinions.document WHERE 1=1;
DELETE FROM opinions.sentiment WHERE 1=1;
DELETE FROM opinions.webresource WHERE 1=1;

This is because we are using an older version of the Beam SDK, which in turn uses an older version of snappy-java. Snappy-java version 1.1.8.2 is supposed to work on Apple M1 chips, and we will fix the problem when we upgrade to newer versions of Beam. For the time being, build the project and submit jobs on pre-M1 Mac hardware.

If you are seeing pipeline failures, see if you are getting the following errors in the pipeline logs

java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.io.IOException: Error executing batch GCS request
...
Caused by: java.util.concurrent.ExecutionException: com.google.api.client.http.HttpResponseException: 404 Not Found

<!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 404 (Not Found)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/logos/errorpage/error_logo-150x54.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/logos/errorpage/error_logo-150x54-2x.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/logos/errorpage/error_logo-150x54-2x.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/logos/errorpage/error_logo-150x54-2x.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>404.</b> <ins>That’s an error.</ins>
  <p>  <ins>That’s all we know.</ins> 

Clean up

Now that you have tested the sample, delete the cloud resources you created to prevent further billing for them on your account.

##License:

Copyright 2021 Google Inc. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.