Home

Awesome

Phrase-At-Scale

Phrase-At-Scale provides a fast and easy way to discover phrases from large text corpora using PySpark. Here's an example of phrases extracted from a review dataset:

<div align="center" width="100%"> <img src="phrase-at-scale.png" width="50%"> </div>

Features

Quick Start

Run locally

To re-run phrase discovery using the default dataset:

  1. Install Spark

  2. Clone this repo and move into its top-level directory.

    git clone git@github.com:kavgan/phrase-at-scale.git
    
  3. Run the spark job:

    <your_path_to_spark>/bin/spark-submit --master local[200] --driver-memory 4G phrase_generator.py 
    

This will use settings (including input data files) as specified in config.py.

  1. You should be able to monitor the progress of your job at http://localhost:4040/

Notes:

Configuration

To change configuration, just edit the config.py file.

ConfigDescription
input_filePath to your input data files. This can be a file or folder with files. The default assumption is one text document (of any size) per line. This can be one sentence per line, one paragraph per line, etc.
output-folderPath to output your annotated corpora. Can be local path or on HDFS
phrase-filePath to file that should hold the list of discovered phrases.
stop-fileStop-words file to use to indicate phrase boundary.
min-phrase-countMinimum number of occurrence for phrases. Guidelines: use 50 for < 300 MB of text, 100 for < 2GB and larger values for a much larger dataset.

Dataset

The default configuration uses a subset of the OpinRank dataset, consisting of about 255,000 hotel reviews. You can use the following to cite the dataset:

@article{ganesan2012opinion,
  title={Opinion-based entity ranking},
  author={Ganesan, Kavita and Zhai, ChengXiang},
  journal={Information retrieval},
  volume={15},
  number={2},
  pages={116--150},
  year={2012},
  publisher={Springer} 
}

Contact

This repository is maintained by Kavita Ganesan. Please send me an e-mail or open a GitHub issue if you have questions.