Home

Awesome

Towards Computationally Feasible Deep Active Learning

A repository to reproduce the experiments from the paper "Towards Computationally Feasible Deep Active Learning".

Installation

Install the library:

pip install -e .

Usage

The configs folder contains config files with general settings. The experiments folder contains config files with experimental design. To run an experiment with a chosen configuration, specify config file name in HYDRA_CONFIG_NAME variable and run train.sh script (see ./examples/al for details).

For example to launch PLASM on AG-News with ELECTRA as a successor model:

cd PATH_TO_THIS_REPO
HYDRA_CONFIG_PATH=../experiments/ag_news HYDRA_EXP_CONFIG_NAME=ag_plasm python active_learning/run_tasks_on_multiple_gpus.py

Config structure explanation

Output explanation

By default, the results will be present in the folder RUN_DIRECTORY/workdir_run_active_learning/DATE_OF_RUN/${TIME_OF_RUN}_${SEED}_${MODEL_CHECKPOINT}. For instance, when launching from the repository folder: al_nlp_feasible/workdir/run_active_learning/2022-06-11/15-59-31_23419_distilbert_base_uncased_bert_base_uncased.

Datasets

The research has employed 2 NER datasets (CoNLL-2003, OntoNotes-2012) and 2 Text Classification (CLS) datasets (AG-News, IMDB). If one wants to launch an experiment on a custom dataset, they need to use one of the following ways to add it:

  1. Upload to Hugging Face datasets and set: config.data.path=datasets, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME
  2. Upload to data/DATASET_NAME folder, create train.csv / train.json file with the dataset, and set: config.data.path=PATH_TO_THIS_REPO/data, config.data.dataset_name=DATASET_NAME, config.data.text_name=COLUMN_WITH_TEXT_OR_TOKENS_NAME, config.data.label_name=COLUMN_WITH_LABELS_OR_NER_TAGS_NAME
  3. * Upload to data/DATASET_NAME train.txt, dev.txt, and test.txt files and set the arguments as in the previous point.
  4. ** Upload to data/DATASET_NAME with each folder for each class, where each file in the folder contains a text with the label of the folder. For details, please see the bbc_news dataset in ./data. The arguments must be set as in the previous two points.

* - only for NER datasets

** - only for CLS datasets

Models

The current version of the repository supports all models from HuggingFace Transformers, which can be used with AutoModelForSequenceClassification / AutoModelForTokenClassification classes (for CLS / NER). For CNN-based / BiLSTM-CRF models, please see the al_cls_cnn.yaml / al_ner_bilstm_crf.yaml configs from ./configs folder for details.

Citation

@inproceedings{tsvigun-etal-2022-plasm,
    title = "Towards Computationally Feasible Deep Active Learning",
    author = "Tsvigun, Akim  and
      Shelmanov, Artem  and
      Kuzmin, Gleb  and
      Sanochkin, Leonid  and
      Larionov, Daniil  and
      Gusev, Gleb  and
      Avetisian, Manvel  and
      Zhukov, Leonid",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.90",
    pages = "1198--1218",
}

License

© 2022 Autonomous Non-Profit Organization "Artificial Intelligence Research Institute" (AIRI). All rights reserved.

Licensed under the GNU GPLv3 License.