Home

Awesome

pypi License: MIT

Arabica

Python package for text mining of time-series data

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include social media conversations, product reviews, research metadata, central banker communication, and newspaper headlines. Arabica makes exploratory analysis of these datasets simple by providing:

It automatically cleans data from punctuation on input. It can also apply all or a selected combination of the following cleaning operations:

Arabica works with texts of languages based on the Latin alphabet, uses cleantext for punctuation cleaning, and enables stop words removal for languages in the NLTK corpus of stopwords.

It reads dates in:

Installation

Arabica requires Python 3.8 - 3.10, NLTK - stop words removal, cleantext - text cleaning, wordcloud - word cloud visualization, plotnine - heatmaps and line graphs, matplotlib - word clouds and graphical operations, vaderSentiment - sentiment analysis, finvader - financial sentiment analysis, and jenskpy for breakpoint identification.

To install using pip, use:

pip install arabica

Usage

from arabica import arabica_freq
from arabica import cappuccino
from arabica import coffee_break 

arabica_freq enables a specific set of cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.

def arabica_freq(text: str,                # Text
                 time: str,                # Time
                 date_format: str,         # Date format: 'eur' - European, 'us' - American
                 time_freq: str,           # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
                 max_words: int,           # Maximum of most frequent n-grams displayed for each period
                 stopwords: [],            # Languages for stop words
                 stopwords_ext: [],        # Languages for extended stop words list, currently provided lists: 'english'
                 skip: [],                 # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
                 numbers = True,           # Remove numbers
                 lower_case = True)        # Lowercase text
                 numbers: bool = False,    # Remove numbers
                 lower_case: bool = False  # Lowercase text
) 

cappuccino enables cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.

def cappuccino(text: str,                # Text
               time: str,                # Time
               date_format: str,         # Date format: 'eur' - European, 'us' - American
               plot: str,                # Chart type: 'wordcloud'/'heatmap'/'line'
               ngram: int,               # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
               time_freq: str,           # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup'
               max_words int,            # Maximum of most frequent n-grams displayed for each period
               stopwords: [],            # Languages for stop words
               stopwords_ext: [],        # Languages for extended stop words list, currently provided lists: 'english'
               skip: [],                 # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
               numbers: bool = False,    # Remove numbers
               lower_case: bool = False  # Lowercase text
)

coffee_break provides sentiment analysis and breakpoint identification in aggregated time series of sentiment. The implemented models are:

Break points in the time series are identified with the Fisher-Jenks algorithm (Jenks, 1977. Optimal data classification for choropleth maps).

def coffee_break(text: str,                 # Text
                 time: str,                 # Time
                 date_format: str,          # Date format: 'eur' - European, 'us' - American
                 model: str,                # Sentiment classifier, 'vader' - general language, 'finvader' - financial text                
                 skip: [],                  # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
                 preprocess: bool = False,  # Clean data from numbers and punctuation
                 time_freq: str,            # Aggregation period: 'Y'/'M'
                 n_breaks: int              # Number of breakpoints: min. 2
)

Documentation, examples and tutorials

For more examples of coding, read these tutorials:

General use:

Applications:


💬 Please visit here for any questions, issues, bugs, and suggestions.

Citation

Using arabica in a paper or thesis? Please cite this paper:


@article{Koráb:2024,
  author   = {{Koráb}, P., and {Poměnková}, J.},
  title    = {Arabica: A Python package for exploratory analysis of text data},
  journal  = {Journal of Open Source Software},
  volume   = {97},
  number   = {9},
  pages    = {6186},
  year     = {2024},
  doi      = {doi.org/10.21105/joss.06186},
}