Awesome
Geometric-Topic-Modeling
This is a Python 2 implementation of Geometric Dirichlet Means algorithm for topic inference (M. Yurochkin, X. Nguyen NIPS 2016) and Conic Scan-and-Cover algorithms for nonparametric topic modeling (M. Yurochkin, A. Guha, X. Nguyen NIPS 2017). Code written by Mikhail Yurochkin.
Overview
This is a simple demonstration of GDM, CoSAC and Gibbs sampler (from lda package) on simulated data. More extensive guide is in preparation.
all_func.py Implements data simulation according to LDA model, GDM algorithm and projection estimate of topic proportions $\theta$
geom_tm.py Implements CoSAC algorithm for sparse document-term matrix and wraps it as scikit-learn class
tester_CoSAC.py contains a simulated example
Implementation is designed to be used in the interactive mode (e.g. Python IDE like Spyder).
Usage guide for GDM algorithm
gdm(wdfn, K, ncores=-1)
wdfn: $M \times V$ matrix of normalized document-term counts
K: number of topics to fit
ncores: CPUs to use for k-means
Returns: topic estimates
Usage guide for CoSAC algorithm
geom_tm(delta=0.4, prop_discard=0.5, prop_n=0.01, verbose=False)
Parameters:
delta: cosine cone radius $\omega$
prop_discard: quantile to compute $\mathcal{R}$
prop_n: proportion of data to be used as outlier threshold $\lambda$
verbose: if True, plots as in Figure 2 will be printed
Methods:
fit_a(data, cent)
data: sparse $M \times V$ matrix of normalized document-term counts
cent: data mean $\hat C_p$
Returns: a_betas_: topic estimates from Algorithm 2 without spherical k-means step K_: estimated number of topics
fit_sph(data, cent, init=None, it=10)
data: sparse $M \times V$ matrix of normalized document-term counts
cent: data mean $\hat C_p$
init, it: if None and fit_a was run, will complete Algorithm 2 with \emph{it} spherical k-means iterations
Returns: sph_betas_: updated topics sph_clust_: cluster assignments
fit_all(data, cent, it=5)
Full run of Algorithm 2 with \emph{it} spherical k-means post processing iterations