Awesome
cca-core
cca-core (Civic CrowdAnalytics Core) offers machine learning and natural language processing utilities for processing civic text input.
Requirements
- scipy >= 0.19.1
- nltk >= 3.2.4
- scikit-learn >= 0.18.2
- beautifulsoup4 >= 4.6.0
- googletrans >= 2.2.0
- pandas >= 0.20.3
Classes and Usage
SentimentAnalyzer
Analyzes the sentiment polarity of a collection of documents. It determines wether the feeling about each doc is positive, negative or neutral.
Parameters:
- neu_inf_lim : float, -0.05 by default If a doc's polarity score is lower than this paramenter, then the sentiment is considered negative. Use values greater than -1 and lower than 0.
- neu_sup_lim : float, 0.05 by default If a doc's polarity score is greater than this parameter, then the sentiment is considered positive. Use values greater than 0 and lower than 1.
- language: string, 'english'; by default Language on which documents are written. There are 2 languages supported natively: - 'english': through the ntlk_vader algorithms - 'spanish': through the ML_SentiCon algorithm If you use another language, the module will first translate each document to english (using Google Translate AJAX API), so it can late re-use ntlk_vader algorithm for english docs.
Methods
- analyze_docs:
- Description: It takes as input a list of strings. For each document on that list, a sentiment label and a polarity score is assigned. The possible values for the label are 'pos' (for positive), 'neu' (for neutral), and 'neg' (for negative). The score is a float number.
- Method Parameters:
- docs: list of strings.
Attributes
- tagged_docs: list of tuples on which each tuple consists of three elements:
- the original text document. Data type: string
- the sentiment label of that document. Data type: string
- the polarity score of that document. Data type: float
Examples
# import the Sentiment Analyzer class
from cca_core import SentimentAnalyzer
# create an instance of the analyzer
sa = SentimentAnalyzer(neu_inf_lim=-0.05,
neu_sup_lim=0.05,
language='spanish')
# sample docs
docs = [
'Reciclar me parece buena idea. Reutilizar desechos es muy provechoso.',
'Mala gestión. Lamentable y pobre manjeo de los encargados.'
]
# analyze docs with the 'analyze_docs' method
sa.analyze_docs(docs)
# results are accesible through the 'tagged_docs' attribute
print(sa.tagged_docs[0])
# ('Reciclar me parece buena idea. Reutilizar desechos es muy provechoso.', 'pos', 1.0)
print(sa.tagged_docs[1])
# ('Mala gestión. Lamentable y pobre manjeo de los encargados.', 'neg', -0.15)
ConceptExtractor
Extract the most common concepts from a collection of documents.
Parameters:
- num_concepts : int, 5 by default. The number of concepts to extract.
- context_words : list, empty list by default. List of context-specific words that should notbe considered in the analysis.
- ngram_range: tuple, (1,1) by default. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
- pos_vec: list, only words tagged as nouns (i.e., ['NN', 'NNP']) are considered by default. List of tags related with the part-of-speech that should be considered in the analysis. Please check this link for a complete list of tags.
- consider_urls: boolean, False by default Whether URLs should be removed or not.
- language: string, 'english' by default Language of the documents. Only the languages supported by the library NLTK are supported.
Methods
- extract_concepts:
- Description: Extract the most common concepts in the collection of documents.
- Method Parameters:
- docs: list of strings.
Attributes
- common_concepts: list of tuples on which each tuple consists of two elements:
- A concept, represented by a text n-gram . Data type: string.
- The number of occurrences of the concept within the document collection . Data type: integer.
Examples
# import the Concept Extractor class
from cca_core import ConceptExtractor
# create an instance of the extractor
ce = ConceptExtractor(
num_concepts=4,
language='english',
pos_vec=['NN', 'NNP', 'NNS', 'NNPS']
)
# sample docs
docs = [
'Make new bikes lanes in the park',
'Clean the campus and add more trash cans',
'Use bikes instead of cars during weekends',
'Clean up the streets',
'Create a bike renting service for employees',
'Too much garbage. Cleaning needed',
'Use bikes or another alternative trasnportation',
'Keep streets clean',
]
# extract most common concepts with the 'extract_concepts method'
ce.extract_concepts(docs)
# the 'common_concepts' attribute has the extracted concepts and its number of appearances
print(ce.common_concepts)
# [('bikes', 2), ('use', 2), ('streets', 2), ('lanes', 1)]
DocumentClustering
Cluster documents by similarity using the k-means algorithm.
Parameters:
- num_clusters : int, 5 by default The number of clusters in which the documents will be grouped.
- context_words : list, empty list by default List of context-specific words that should notbe considered in the analysis.
- ngram_range: tuple, (1,1) by default The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
- min_df: float in range [0.0, 1.0] or int, default=0.1 The minimum number of documents that any term is contained in. It can either be an integer which sets the number specifically, or a decimal between 0 and 1 which is interpreted as a percentage of all documents.
- max_df: float in range [0.0, 1.0] or int, default=0.9 The maximum number of documents that any term is contained in. It can either be an integer which sets the number specifically, or a decimal between 0 and 1 which is interpreted as a percentage of all documents.
- consider_urls: boolean, False by default Whether URLs should be removed or not.
- language: string, 'english' by default Language of the documents. Only the languages supported by the library NLTK are supported.
- algorithm: string, 'k-means' by default Clustering algorithm use to group documents Currently available: k-means and agglomerative (hierarchical)
- use_idf: boolean, False by default If true, it will use TF-IDF vectorization for feature extraction. If false it will use only TF.
Methods
- clustering:
- Description: Cluster, by similarity, a collection of documents into groups.
- Method Parameters:
- docs: list of strings.
- top_terms_per_cluster:
- Description: extract the most common concepts of each cluster
- Method Parameters:
- num_terms_per_cluster: integer, the number of concepts to extract from each cluster
Attributes
- num_docs_per_cluster: dict where the keys are cluster labels, while the values are the number of docs in a cluster.
Examples
# import the Document Clustering class
from cca_core import DocumentClustering
# create an instance of the class
clu = DocumentClustering(num_clusters=2,
language='english',
max_features=5)
# sample docs
docs = [
'Make new bikes lanes in the park',
'Clean the campus and add more trash cans',
'Use bikes instead of cars during weekends',
'Clean up the streets',
'Create a bike renting service for employees',
'Too much garbage. Cleaning needed',
'Use bikes or another alternative trasnportation',
'Keep streets clean',
]
# start the clustering process with the 'clustering' method
clu.clustering(docs)
# the 'clusters' attribute has the cluster label assigned to each doc
print(clu.clusters)
# [0, 1, 0, 1, 0, 1, 0, 1]
# the 'num_docs_per_cluster' is a dict that shows how many docs were assigned to each cluster
print(clu.num_docs_per_cluster)
# {'0': 4, '1': 4}
DocumentClassifier
Train a classifier with labeled documents and classify new documents into one of the labeled clases.
Parameters:
- train_p : float, 0.8 by default The proportion of the 'dev docs' used as 'train docs'. Use values greater than 0 and lower than 1. The remaining docs will be using as 'test docs'
- n_folds : integer, 10 by default Number of folds to be used in k-fold cross validation technique for choosing different sets as 'train docs'
- vocab_size : integer, 500 by default This is the size of the vocabulary set that will be used for extracting features out of the docs
- t_classifier : string, 'NB' by default This is the type of classifier model used. Available types are 'NB' (Naive Bayes), 'DT' (decision tree), 'RF' (Random Forest), and 'SVM' (Support Vector Machine)
- language: string, 'english' by default Language on which documents are written
- train_method: string, 'all_class_train' by default Choose the method to train the classifier. There are two options: 'all_class_train' and 'cross_validation'
Methods
- classify_docs:
- Description: First train the classifier with the labeled data. Then classifies the unlabeled data.
- Method Parameters:
- docs: list of tuples (t,c), where t is text document, and c is the category label of t. Both, t and c, are strings. If c is an empty string, it means that t is unlabeled and is ment to be classified.
Attributes
- classified_docs: list of tuples (t,c), where t is text document, and c is the category label of t. Both, t and c, are strings. All t's are those that where unlabeled when calling classify_docs method. The ones that were already labeled are not included in this list.
Examples
# import the Document Classifier class
from cca_core import DocumentClassifier
# create an instance of the classifier
cla = DocumentClassifier(
language="english",
t_classifier="SVM",
vocab_size=5
)
# sample docs. The last two are unclissified docs
docs = [
('Make new bikes lanes in the park', 'trasnportation'),
('Clean the campus and add more trash cans','cleaning'),
('Use bikes instead of cars during weekends', 'transportation'),
('Clean up the streets','cleaning'),
('Create a bike renting service for employees', 'transportation'),
('Too much garbage. Cleaning needed','cleaning'),
('Use bikes or another alternative trasnportation',''),
('Keep streets clean',''),
]
# classify docs with the 'classify_docs' method
cla.classify_docs(docs)
# all previously unclassified docs are now classified
print(cla.classified_docs[0])
# ('Use bikes or another alternative trasnportation', 'transportation')
print(cla.classified_docs[1])
# ('Keep streets clean', 'cleaning')