Home

Awesome

ci PyPI version fury.io PyPI license PRs Welcome Downloads

TextWiser: Text Featurization Library

TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries.

The main contributions include:

TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021.

Quick Start

# Conceptually, TextWiser is composed of an Embedding, potentially with a pretrained model,
# that can be chained into zero or more Transformations
from textwiser import TextWiser, Embedding, Transformation, WordOptions, PoolOptions

# Data
documents = ["Some document", "More documents. Including multi-sentence documents."]

# Model: TFIDF `min_df` parameter gets passed to sklearn automatically
emb = TextWiser(Embedding.TfIdf(min_df=1))

# Model: TFIDF followed with an NMF + SVD
emb = TextWiser(Embedding.TfIdf(min_df=1), [Transformation.NMF(n_components=30), Transformation.SVD(n_components=10)])

# Model: Word2Vec with no pretraining that learns from the input data
emb = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained=None), Transformation.Pool(pool_option=PoolOptions.min))

# Model: BERT with the pretrained bert-base-uncased embedding
emb = TextWiser(Embedding.Word(word_option=WordOptions.bert), Transformation.Pool(pool_option=PoolOptions.first))

# Features
vecs = emb.fit_transform(documents)

Available Embeddings

EmbeddingsNotes
Bag of Words (BoW)Supported by scikit-learn <br> Defaults to training from scratch
Term Frequency Inverse Document Frequency (TfIdf)Supported by scikit-learn <br> Defaults to training from scratch
Document Embeddings (Doc2Vec)Supported by gensim <br> Defaults to training from scratch
Universal Sentence Encoder (USE)Supported by tensorflow, see requirements <br> Defaults to large v5
Compound EmbeddingSupported by a context-free grammar
Word Embedding: Word2VecSupported by these pretrained embeddings <br> Common pretrained options include crawl, glove, extvec, twitter, and en-news <br> When the pretrained option is None, trains a new model from the given data <br> Defaults to en, FastText embeddings trained on news
Word Embedding: CharacterInitialized randomly and not pretrained <br> Useful when trained for a downstream task <br> Enable fine-tuning to get good embeddings
Word Embedding: BytePairSupported by these pretrained embeddings <br> Pretrained options can be specified with the string <lang>_<dim>_<vocab_size> <br> Default options can be omitted like en, en_100, or en__10000 <br> Defaults to en, which is equal to en_100_10000
Word Embedding: ELMoSupported by these pretrained embeddings from TensorflowHub <br> Defaults to original
Word Embedding: FlairSupported by these pretrained embeddings <br> Defaults to news-forward-fast
Word Embedding: BERTSupported by these pretrained embeddings <br> Defaults to bert-base-uncased
Word Embedding: OpenAI GPTSupported by these pretrained embeddings <br> Defaults to openai-gpt
Word Embedding: OpenAI GPT2Supported by these pretrained embeddings <br> Defaults to gpt2-medium
Word Embedding: TransformerXLSupported by these pretrained embeddings <br> Defaults to transfo-xl-wt103
Word Embedding: XLNetSupported by these pretrained embeddings <br> Defaults to xlnet-large-cased
Word Embedding: XLMSupported by these pretrained embeddings <br> Defaults to xlm-mlm-en-2048
Word Embedding: RoBERTaSupported by these pretrained embeddings <br> Defaults to roberta-base
Word Embedding: DistilBERTSupported by these pretrained embeddings <br> Defaults to distilbert-base-uncased
Word Embedding: CTRLSupported by these pretrained embeddings <br> Defaults to ctrl
Word Embedding: ALBERTSupported by these pretrained embeddings <br> Defaults to albert-base-v2
Word Embedding: T5Supported by these pretrained embeddings <br> Defaults to t5-base
Word Embedding: XLM-RoBERTaSupported by these pretrained embeddings <br> Defaults to xlm-roberta-base
Word Embedding: BARTSupported by these pretrained embeddings <br> Defaults to facebook/bart-base
Word Embedding: ELECTRASupported by these pretrained embeddings <br> Defaults to google/electra-base-generator
Word Embedding: DialoGPTSupported by these pretrained embeddings <br> Defaults to microsoft/DialoGPT-small
Word Embedding: LongformerSupported by these pretrained embeddings <br> Defaults to allenai/longformer-base-4096

Available Transformations

TransformationsNotes
Singular Value Decomposition (SVD)Differentiable
Latent Dirichlet Allocation (LDA)Not differentiable
Non-negative Matrix Factorization (NMF)Not differentiable
Uniform Manifold Approximation and Projection (UMAP)Not differentiable
Pooling Word VectorsApplies to word embeddings only <br> Reduces word-level vectors to document-level <br> Pool options include max, min, mean, first, and last <br> Defaults to max

Usage Examples

Examples can be found under the notebooks folder.

Installation

TextWiser requires Python 3.8+ and can be installed from PyPI using pip install textwiser, using pip install textwiser[full] to install from PyPI with all optional dependencies, or by building from source by following the instructions in our documentation.

Compound Embedding

A unique research contribution of TextWiser lies in its novel approach in creating embeddings from components, called the Compound Embedding.

This method allows forming arbitrarily complex embeddings, thanks to a context-free grammar that defines a formal language for valid text featurization. You can see the details in our documentation and in the usage example.

Fine-Tuning for Downstream Tasks

All Word2Vec and transformer-based embeddings and any embedding followed with an svd transformation are fine-tunable for downstream tasks. In other words, if you pass the resulting fine-tunable embedding to a PyTorch training method, the features will automatically be trained for your application. You can see the details in our documentation and in the usage example.

Tokenization

In general, text data should be whitespace-tokenized before being fed into TextWiser. Customized tokenization is also supported as described in more detail in our documentation

Support

Please submit bug reports, questions and feature requests as Issues.

Citation

If you use TextWiser in a publication, please cite it as:

  @article{textwiser2021,
    author={Kilitcioglu, Doruk and Kadioglu, Serdar},
    title={Representing the Unification of Text Featurization using a Context-Free Grammar},
    url={https://github.com/fidelity/textwiser},
    journal={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={35},
    number={17},
    year={2021},
    month={May},
    pages={15439-15445}
  }

License

TextWiser is licensed under the Apache License 2.0.

<br>