Home

Awesome

<!-- <p align="center"> <img height="200" acleto="https://cdn-icons-png.flaticon.com/512/2092/2092791.png" alt="ALToolbox" /> </p> -->

πŸ›  ALToolbox

<!-- **πŸ›  ALToolbox πŸ› **: -->

PyPI version License Documentation Status Tests

ALToolbox is a framework for practical active learning in NLP.

<hr>

Installation | Quick Start | Overview | Docs | Citation

<!-- ALToolbox provides a set of tools **Active Learning** for text classification and sequence tagging tasks: state-of-the-art query strategies, Several pre-implemented Query Strategies, Initialization Strategies, and Stopping Criterion are provided, which can be easily mixed and matched to build active learning applications or run experiments. -->

ALToolbox is a framework for active learning annotation in natural language processing. Currently, the framework supports text classification and sequence tagging tasks. ALToolbox provides state-of-the-art query strategies, serverless annotation tool for Jupyter IDE, and a set of tools that help to reduce computational overhead / duration of AL iterations and increase annotated data reusability.

<!-- computationally efficient and reusable -->

<a name="installation"></a>βš™οΈ Installation

pip install acleto

To annotate instances for active learning in Jupyter Notebook or Jupyter Lab one have to install additional widget after framework installation. In case of Jupyter Notebook usage run:

jupyter nbextension install --py --symlink --sys-prefix text_selector
jupyter nbextension enable --py --sys-prefix text_selector

In case of Jupyter Lab usage run:

jupyter labextension install js
jupyter labextension install text_selector

<a name="quick_start"></a>πŸ’« Quick Start

For quick start, please see the examples of launching an active learning annotation or benchmarking a novel query stategy / unlabeled pool subsampling strategy for sequence tagging and text classification tasks:

#Notebook
1Launching Active Learning for Token Classification
2Launching Active Learning for Text Classification
3Benchmarking a novel AL query strategy / unlabeled pool subsampling strategy

<a name="overview"></a>πŸ”­ Overview

1. Query Strategies

#StrategyCitation
1ALPSCitation
2BADGECitation
3BAITCitation
4BALDCitation
5BatchBALDCitation
6Breaking Ties (BT) (also Maximum Margin)Citation
7Contrastive Active Learning (CAL)Citation
8Cluster MarginCitation
9CoresetCitation
10Expected Gradient Length (EGL)Citation
11Embeddings KMCitation
12EntropyCitation
13Least Confidence (LC)Citation
14Mahalanobis DistanceCitation
15Maximum Normalized Log-Probability (MNLP)Citation
16Random (No AL)-

3. Unlabeled Pool Subsampling Strategies

#StrategyCitation
1UPSCitation
2NaΓ―veCitation
3Random-

4. Pipelines for postprocessing of annotated data and preparation of acquisition models

5. GUI Annotator tool in Jupyter IDE

Our framework provides a serverless GUI annotation tool integrated into the Jupyter IDE: GUI

6. Extensible benchmark for query strategies

TODO:

<a name="documentation"></a>πŸ“• Documentation

Usage

The configs folder contains config files with general settings. The experiments folder contains config files with experimental design. To run an experiment with a chosen configuration, specify config file name in HYDRA_CONFIG_NAME variable and run train.sh script (see ./examples/al for details).

For example to launch PLASM on AG-News with ELECTRA as a successor model:

cd PATH_TO_THIS_REPO
HYDRA_CONFIG_PATH=../experiments/ag_news HYDRA_EXP_CONFIG_NAME=ag_plasm python active_learning/run_tasks_on_multiple_gpus.py

Config structure explanation

Output Explanation

By default, the results will be present in the folder RUN_DIRECTORY/workdir_run_active_learning/DATE_OF_RUN/${TIME_OF_RUN}_${SEED}_${MODEL_CHECKPOINT}. For instance, when launching from the repository folder: al_nlp_feasible/workdir/run_active_learning/2022-06-11/15-59-31_23419_distilbert_base_uncased_bert_base_uncased.

Post-processing

Our framework provides tools for effective data post-processing for its re-usability and a possibility to build powerful models on it. PLASM, which aims to alleviate the acquisition-successor mismatch problem and allow to build a model of an arbitrary type using the labeled data without performance degradation, is implemented in post_processing/pipeline_plasm. It uses the config cls_plasm / ner_plasm (from `jupyterlab_demo/configs). A brief explanation of the config structure:

πŸ†•οΈ New strategies addition

An AL query strategy should be designed as a function that:

  1. Receives 3 positional arguments and additional strategy kwargs: - model of inherited class TransformersBaseWrapper or PytorchEncoderWrapper or FlairModelWrapper: model wrapper; - X_pool of class Dataset or TransformersDataset: dataset with the unlabeled instances; - n_instances of class int: number of instances to query; - kwargs: additional strategy-specific arguments.
  2. Outputs 3 objects in the following order:
    • query_idx of class array-like: array with the indices of the queried instances;
    • query of class Dataset or TransformersDataset: dataset with the queried instances;
    • uncertainty_estimates of class np.ndarray: uncertainty estimates of the instances from X_pool. The higher the value - the more uncertain the model is in the instance.

The function with the strategy should be named the same as the file where it is placed (e.g. function def my_strategy inside a file path_to_strategy/my_strategy.py). Use your strategy, setting al.strategy=PATH_TO_FILE_YOUR_STRATEGY in the experiment config.

The example is presented in examples/benchmark_custom_strategy.ipynb

πŸ†•οΈ New pool subsampling strategies addition

The addition of a new pool subsampling query strategy is similar to the addition of an AL query strategy. A subsampling strategy should be designed as a function that:

  1. It must receive 2 positional arguments and additional subsampling strategy kwargs: - uncertainty_estimates of class np.ndarray: uncertainty estimates of the instances in the order they are stored in the unlabeled data; - gamma_or_k_confident_to_save of class float or int: either a share / number of instances to save (as in random / naive subsampling) or an internal parameter (as in UPS); - kwargs: additional subsampling strategy specific arguments.
  2. It must output the indices of the instances to use (sampled indices) of class np.ndarray.

The function with the strategy should be named the same as the file where it is placed (e.g. function def my_subsampling_strategy inside a file path_to_strategy/my_subsampling_strategy.py). Use your subsampling strategy, setting al.sampling_type=PATH_TO_FILE_YOUR_SUBSAMPLING_STRATEGY in the experiment config.

The example is presented in examples/benchmark_custom_strategy.ipynb

Datasets

The research has employed 2 Token Classification datasets (CoNLL-2003, OntoNotes-2012) and 2 Text Classification datasets (AG-News, IMDB). If one wants to launch an experiment on a custom dataset, they need to use one of the following ways to add it:

  1. Upload to Hugging Face datasets and set: config.data.path=datasets, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME
  2. Upload to data/DATASET_NAME folder, create train.csv / train.json file with the dataset, and set: config.data.path=PATH_TO_THIS_REPO/data, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME
  3. * Upload to data/DATASET_NAME train.txt, dev.txt, and test.txt files and set the arguments as in the previous point.
  4. ** Upload to data/DATASET_NAME with each folder for each class, where each file in the folder contains a text with the label of the folder. For details, please see the bbc_news dataset in ./data. The arguments must be set as in the previous two points.

* - only for Token Classification datasets

** - only for Text Classification datasets

Models

The current version of the repository supports all models from HuggingFace Transformers, which can be used with AutoModelForSequenceClassification / AutoModelForTokenClassification classes (for Text / Token classification). For CNN-based / BiLSTM-CRF models, please see the al_cls_cnn.yaml / al_ner_bilstm_crf_flair.yaml configs from ./configs folder for details.

Testing

By default, the tests will be run on the cuda:0 device if CUDA is available or on CPU, otherwise. If one wants to manually specify the device for running the tests:

We recommend to use CPU for the robustness of the results. The tests for CUDA are written under Tesla V100-SXM3 32GB, CUDA V.10.1.243.

πŸ‘― Alternatives

FAMIE, Small-Text, modAL, ALiPy, libact

<a name="citation"></a>πŸ’¬ Citation

@inproceedings{tsvigun-etal-2022-altoolbox,
    title = "{ALT}oolbox: A Set of Tools for Active Learning Annotation of Natural Language Texts",
    author = "Tsvigun, Akim  and
      Sanochkin, Leonid  and
      Larionov, Daniil  and
      Kuzmin, Gleb  and
      Vazhentsev, Artem  and
      Lazichny, Ivan  and
      Khromov, Nikita  and
      Kireev, Danil  and
      Rubashevskii, Aleksandr and
      Panchenko, Alexander and
      Shahmatova, Olga and
      Dylov, Dmitry and
      Galitskiy, Igor and
      Shelmanov, Artem",
    booktitle = "Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-demos.41",
    pages = "406--434",
    abstract = "We present ALToolbox {--} an open-source framework for active learning (AL) annotation in natural language processing. Currently, the framework supports text classification, sequence tagging, and seq2seq tasks. Besides state-of-the-art query strategies, ALToolbox provides a set of tools that help to reduce computational overhead and duration of AL iterations and increase annotated data reusability. The framework aims to support data scientists and researchers by providing an easy-to-deploy GUI annotation tool directly in the Jupyter IDE and an extensible benchmark for novel AL methods. We prepare a small demonstration of ALToolbox capabilities available a href={''}http://demo.nlpresearch.group{''}online/a. A demo video for ALToolbox is provided at: a href={''}http://demo-video.nlpresearch.group{''}http://demo-video.nlpresearch.group/a.The code of the framework is a href={''}https://github.com/AIRI-Institute/al{\_}toolbox{''}published/a under the MIT license.",
}

πŸ“„ License

Β© 2022 Autonomous Non-Profit Organization "Artificial Intelligence Research Institute" (AIRI). All rights reserved.

Licensed under the MIT License.