Awesome
Spectral LDA on Spark
Summary
This code implements a Spectral (third order tensor decomposition) learning method for learning LDA topic model on Spark.
Latent Dirichlet Allocation Given prior Dirichlet Distribution for topic assignment alpha, and V-by-k topic-word distribution matrix beta, the bag-of-words of every document is generated by first drawing a random variable j from the prior alpha, then drawing from Multinomial Distribution beta(,j).
Spectral Learning Algorithm We first do counts of word pairs or word triplets that appeared simultaneously from any given document. From the empirical counts we could obtain moments conditions on alpha and beta. We finally do tensor decomposition of these moments to recover alpha and beta. Please refer to report.pdf
in this repo for more details.
How do I get set up?
Build and Publish the Package
The code is written for Java 8, Scala 2.11, and Spark 2.3.0+. We use the sbt
build system. Download the latest version of sbt
and run
sbt package test
sbt publishLocal
which will produce target/scala-2.11/spectrallda-tensor_2.11-<version>.jar
. The version number is defined in build.sbt
.
Either invoke the Spark shell with option --packages megadata:spectrallda-tensor_2.11:<version>
or add the following lines in the pom.xml
in your new project.
<dependencies>
<dependency>
<groupId>megadata</groupId>
<artifactId>spectrallda-tensor_2.11</artifactId>
<version>x.xx.xx</version>
</dependency>
</dependencies>
API Usage
The API is designed following the lines of the Spark built-in LDA
class.
import megadata.spectralLDA.algorithm.TensorLDA
val lda = new TensorLDA(
dimK = params.k,
alpha0 = params.topicConcentration
)
// Denote V the vocabulary size, k the number of topics
// beta: V-by-k topic-word distribution matrix, with each column
// the word distribution for one topic
// alpha: length-k Dirichlet prior parameter for the topic distribution
// eigvecM2: V-by-k matrix for the top k eigenvectors of rescaled M2
// eigvalM2: length-k vector for the top k eigenvalues of rescaled M2
// m1: length-V vector for the average word distribution
val (beta: DenseMatrix[Double], alpha: DenseVector[Double],
eigvecM2: DenseMatrix[Double], eigvalM2: DenseVector[Double],
m1: DenseVector[Double]) = lda.fit(documents)
Example Results
We give the example results on the UCI Bag of Words NYTimes Dataset. Suppose we placed the docword and vocabulary files under /home/ubuntu/data
. For efficient parallel reading of data in Spark decompress all files. Within the sbt
console,
import megadata.spectralLDA.utils.Datasets
import megadata.spectralLDA.algorithm.TensorLDA
// Take top 20,000 words as vocabulary
val (nytimesBow, nytimesVocab) = Datasets.readUciBagOfWords(sc, "/home/ub
untu/data/docword.nytimes.txt", "/home/ubuntu/data/vocab.nytimes.txt", 20000)
// Run Spectral LDA for k = alpha0 = 10
// Setting alpha0 = k allows non-informative prior for topic assignment
// where alpha_i = 1.0 for any i
val (beta, alpha, _, eigv, _) = new TensorLDA(10, 10).fit(Datasets.bowFeaturesToBreeze(nytimesBow, 20000))
// Print top 20 words for every topic
val nytIdToWordMap = nytimesVocab.zipWithIndex.map { case (x, y) => (y.toInt, x) }.toMap
TensorLDA.describeTopicsInWords(beta, nytIdToWordMap, 20)
On a 16-core Instance, it'll take 1-2 minutes to run to finish. Below are the results:
Topic #0: court, case, law, official, lawyer, government, federal, police, trial, decision, death, judge, attorney, drug, prosecutor, group, officer, cases, right, legal
Topic #1: company, companies, stock, million, percent, market, business, billion, analyst, firm, executive, customer, deal, chief, investor, zzz_enron, quarter, sales, industry, share
Topic #2: palestinian, official, zzz_israel, attack, zzz_bush, zzz_israeli, military, government, zzz_united_states, leader, zzz_u_s, zzz_yasser_arafat, war, peace, terrorist, security, israeli, zzz_afghanistan, zzz_taliban, zzz_american
Topic #3: zzz_al_gore, zzz_george_bush, campaign, zzz_bush, president, election, republican, vote, voter, political, democratic, presidential, zzz_republican, zzz_clinton, zzz_white_house, democrat, tax, zzz_senate, votes, ballot
Topic #4: game, team, point, play, season, games, player, shot, coach, goal, yard, win, played, lead, zzz_laker, half, minutes, left, playing, final
Topic #5: school, student, teacher, program, children, parent, education, high, women, percent, public, district, college, test, kid, job, class, child, home, boy
Topic #6: percent, stock, market, quarter, point, economy, rate, sales, billion, analyst, fund, growth, rates, companies, earning, women, prices, investor, average, economic
Topic #7: run, inning, hit, game, season, home, games, right, zzz_dodger, left, ball, start, pitch, yankees, pitcher, homer, manager, ranger, team, field
Topic #8: com, information, www, web, zzz_eastern, sport, question, daily, newspaper, commentary, site, separate, business, marked, today, holiday, need, spot, eta, reach
Topic #9: million, shares, offering, debt, public, money, billion, initial, percent, bond, revenue, contract, expected, tax, bill, deal, securities, quarter, cost, share
References
- White Paper: http://newport.eecs.uci.edu/anandkumar/pubs/whitepaper.pdf
- New York Times Result Visualization: http://newport.eecs.uci.edu/anandkumar/Lab/Lab_sub/NewYorkTimes3.html
Who do I talk to?
- Repo owner or admin: Furong Huang
- Contact: furongh.uci@gmail.com