Awesome
Learning Clustering (BahasaIndonesia)
Code
source: http://brandonrose.org/clustering modified by : kirra
Data sources
-
Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
-
Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.
Step
- tokenizing and stemming each article (Bahasa Indonesia)
- transforming the corpus into vector space using tf-idf
- calculating cosine distance between each document as a measure of similarity
- clustering the documents using the k-means algorithm
- using multidimensional scaling to reduce dimensionality within the corpus
- plotting the clustering output using matplotlib and mpld3
- conducting a hierarchical clustering on the corpus using Ward clustering
- plotting a Ward dendrogram
- topic modeling using Latent Dirichlet Allocation (LDA)
How to use
- download the new (kompas and tempo) extract to folder "data"
- create virtualenvironment python >>> $ virtualenv env
- activate virtualenvironment >>> source env/bin/activate
- install all depedencies >>> pip install -r requirements.txt
- run jupiter >>> jupyter notebook
- open file "Clustering.ipynb"
Example visualization
Source for vosualization
- http://adilmoujahid.com/posts/2015/01/interactive-data-visualization-d3-dc-python-mongodb/
- http://bl.ocks.org/lmatteis/efd9be8f472e673eef6ce9d1951256a9
- https://bl.ocks.org/bricedev/8b2da06ddef27d94cde9
- https://bl.ocks.org/jyucsiro/767539a876836e920e38bc80d2031ba7
- https://bl.ocks.org/emeeks/df6ea0128724289337ef