Awesome
<img align="left" width="82" height="82" src="docs/_static/icon.svg">tweetopic
:zap: Blazing Fast topic modelling over short texts in Python <br>
<p align="center"> <img src="docs/_static/banner.svg" height=400 align="center"> </p>Features
- Fast :zap:
- Scalable :collision:
- High consistency and coherence :dart:
- High quality topics :fire:
- Easy visualization and inspection :eyes:
- Full scikit-learn compatibility :nut_and_bolt:
New in version 0.4.0 ✨
You can now pass random_state
to topic models to make your results reproducible.
from tweetopic import DMM
model = DMM(10, random_state=42)
🛠 Installation
Install from PyPI:
pip install tweetopic
👩💻 Usage (documentation)
Train your a topic model on a corpus of short texts:
from tweetopic import DMM
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
vectorizer = CountVectorizer(min_df=15, max_df=0.1)
# Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(n_components=30, n_iterations=100, alpha=0.1, beta=0.1)
# Creating topic pipeline
pipeline = Pipeline([
("vectorizer", vectorizer),
("dmm", dmm),
])
You may fit the model with a stream of short texts:
pipeline.fit(texts)
To investigate internal structure of topics and their relations to words and indicidual documents we recommend using topicwizard.
Install it from PyPI:
pip install topic-wizard
Then visualize your topic model:
import topicwizard
topicwizard.visualize(pipeline=pipeline, corpus=texts)
🎓 References
- Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233–242). Association for Computing Machinery.