Awesome
anchor-topic
This package supports implementation of anchor-based topic modeling and variants of the anchoring algorithm in Python 3.
If you use this package for academic research, please cite the relevant papers.
Installation
Install the package through terminal with this command:
pip install anchor-topic
Dependencies (Numpy, Scipy, Numba) will be installed as well.
Models
To build a topic model using the code, you must include this import statement:
import anchor_topic.topics
Preprocessing
Anchoring algorithm takes in word-document matrix M as input (M(i,j) = frequency of word i in document j). As with other topic models, corpus should be preprocessed to improve quality of model. The word-document matrix M should be of type scipy.sparse.csc_matrix
.
Anchoring
To build an anchor-based topic model for monolingual corpus, use the following function:
A, Q, anchors = anchor_topic.topics.model_topics(M, k, threshold)
Inputs:
- M, word-document matrix
- k, is number of topics
- threshold, minimum percentage of document occurrences for word to be considered as an anchor candidate
Outputs:
- A, word-topic matrix
- Q, word-cooccurrence matrix
- anchors, 2D list of anchor words for each topic
Multilingual anchoring
To build an anchor-based topic model for comparable corpora, use the following function:
A1, A2, Q1, Q2, anchors1, anchors2 = anchor_topic.topics.model_multi_topics(M1, M2, k, threshold1, threshold2, dictionary)
dictionary should be a text file where each line is a tab-separated dictionary entry.
hello 你好
goodbye 再見
Updating topics
To support topic model interactivity, users can choose their own anchors. First, topic model should be built from anchoring algorithm to get initial anchors and word-cooccurrence matrix Q. Then, use the following function to update topics:
A = update_topics(Q, anchors)
For each topic, user may pick one or more anchors. Make sure anchors is a 2d list of type int
where each number represents the word's index in Q.
Publications
If you use this package for academic research, please cite the relevant paper(s) as follows:
@inproceedings{yuan2018mtanchor,
title={Multilingual Anchoring: Interactive Topic Modeling and Alignment Across Languages},
author={Yuan, Michelle and Van Durme, Benjamin and Boyd-Graber, Jordan},
booktitle={Advances in neural information processing systems},
year={2018}
}
@inproceedings{lund2017tandem,
title={Tandem anchoring: A multiword anchor approach for interactive topic modeling},
author={Lund, Jeffrey and Cook, Connor and Seppi, Kevin and Boyd-Graber, Jordan},
booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
volume={1},
pages={896--905},
year={2017}
}
@inproceedings{arora2013practical,
title={A practical algorithm for topic modeling with provable guarantees},
author={Arora, Sanjeev and Ge, Rong and Halpern, Yonatan and Mimno, David and Moitra, Ankur and Sontag, David and Wu, Yichen and Zhu, Michael},
booktitle={International Conference on Machine Learning},
pages={280--288},
year={2013}
}
License
Copyright (C) 2018, Michelle Yuan
Licensed under the terms of the MIT License. A full copy of the license can be found in LICENSE.txt.