Awesome

Spectral LDA on Spark

Summary

This code implements a Spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark.

Latent Dirichlet Allocation Given prior Dirichlet Distribution for topic assignment alpha, and V-by-k topic-word distribution matrix beta, the bag-of-words of every document is generated by first drawing a random variable j from the prior alpha, then drawing from Multinomial Distribution beta(,j).

Spectral Learning Algorithm We first do counts of word pairs or word triplets that appeared simultaneously from any given document. From the empirical counts we could obtain moments conditions on alpha and beta. We finally do tensor decomposition of these moments to recover alpha and beta. Please refer to report.pdf in this repo for more details.

How do I get set up?

Build and Publish the Package

The code is written for Java 8, Scala 2.11, and Spark 2.3.0+. We use the sbt build system. Download the latest version of sbt and run

sbt package test
sbt publishLocal

which will produce target/scala-2.11/spectrallda-tensor_2.11-<version>.jar. The version number is defined in build.sbt.

Either invoke the Spark shell with option --packages megadata:spectrallda-tensor_2.11:<version> or add the following lines in the pom.xml in your new project.

<dependencies>
  <dependency>
    <groupId>megadata</groupId>
    <artifactId>spectrallda-tensor_2.11</artifactId>
    <version>x.xx.xx</version>
  </dependency>
</dependencies>

API Usage

The API is designed following the lines of the Spark built-in LDA class.

import megadata.spectralLDA.algorithm.TensorLDA

val lda = new TensorLDA(
  dimK = params.k,
  alpha0 = params.topicConcentration
)

// Denote V the vocabulary size, k the number of topics
// beta:     V-by-k topic-word distribution matrix, with each column
//           the word distribution for one topic
// alpha:    length-k Dirichlet prior parameter for the topic distribution
// eigvecM2: V-by-k matrix for the top k eigenvectors of rescaled M2
// eigvalM2: length-k vector for the top k eigenvalues of rescaled M2
// m1:       length-V vector for the average word distribution

val (beta: DenseMatrix[Double], alpha: DenseVector[Double], 
  eigvecM2: DenseMatrix[Double], eigvalM2: DenseVector[Double],
  m1: DenseVector[Double]) = lda.fit(documents)

Example Results

We give the example results on the UCI Bag of Words NYTimes Dataset. Suppose we placed the docword and vocabulary files under /home/ubuntu/data. For efficient parallel reading of data in Spark decompress all files. Within the sbt console,

import megadata.spectralLDA.utils.Datasets
import megadata.spectralLDA.algorithm.TensorLDA

// Take top 20,000 words as vocabulary 
val (nytimesBow, nytimesVocab) = Datasets.readUciBagOfWords(sc, "/home/ub
untu/data/docword.nytimes.txt", "/home/ubuntu/data/vocab.nytimes.txt", 20000)

// Run Spectral LDA for k = alpha0 = 10
// Setting alpha0 = k allows non-informative prior for topic assignment
// where alpha_i = 1.0 for any i
val (beta, alpha, _, eigv, _) = new TensorLDA(10, 10).fit(Datasets.bowFeaturesToBreeze(nytimesBow, 20000))

// Print top 20 words for every topic
val nytIdToWordMap = nytimesVocab.zipWithIndex.map { case (x, y) => (y.toInt, x) }.toMap
TensorLDA.describeTopicsInWords(beta, nytIdToWordMap, 20)

On a 16-core Instance, it'll take 1-2 minutes to run to finish. Below are the results:

  Topic #0: court, case, law, official, lawyer, government, federal, police, trial, decision, death, judge, attorney, drug, prosecutor, group, officer, cases, right, legal
  Topic #1: company, companies, stock, million, percent, market, business, billion, analyst, firm, executive, customer, deal, chief, investor, zzz_enron, quarter, sales, industry, share
  Topic #2: palestinian, official, zzz_israel, attack, zzz_bush, zzz_israeli, military, government, zzz_united_states, leader, zzz_u_s, zzz_yasser_arafat, war, peace, terrorist, security, israeli, zzz_afghanistan, zzz_taliban, zzz_american
  Topic #3: zzz_al_gore, zzz_george_bush, campaign, zzz_bush, president, election, republican, vote, voter, political, democratic, presidential, zzz_republican, zzz_clinton, zzz_white_house, democrat, tax, zzz_senate, votes, ballot
  Topic #4: game, team, point, play, season, games, player, shot, coach, goal, yard, win, played, lead, zzz_laker, half, minutes, left, playing, final
  Topic #5: school, student, teacher, program, children, parent, education, high, women, percent, public, district, college, test, kid, job, class, child, home, boy
  Topic #6: percent, stock, market, quarter, point, economy, rate, sales, billion, analyst, fund, growth, rates, companies, earning, women, prices, investor, average, economic
  Topic #7: run, inning, hit, game, season, home, games, right, zzz_dodger, left, ball, start, pitch, yankees, pitcher, homer, manager, ranger, team, field
  Topic #8: com, information, www, web, zzz_eastern, sport, question, daily, newspaper, commentary, site, separate, business, marked, today, holiday, need, spot, eta, reach
  Topic #9: million, shares, offering, debt, public, money, billion, initial, percent, bond, revenue, contract, expected, tax, bill, deal, securities, quarter, cost, share

References

White Paper: http://newport.eecs.uci.edu/anandkumar/pubs/whitepaper.pdf
New York Times Result Visualization: http://newport.eecs.uci.edu/anandkumar/Lab/Lab_sub/NewYorkTimes3.html

Who do I talk to?

Repo owner or admin: Furong Huang
Contact: furongh.uci@gmail.com