Home

Awesome

New Intent Discovery with Pre-training and Contrastive Learning (ACL main 2022, long paper)

This the official PyTorch implementation of paper MTP-CLNN, please check the paper for more details. New intent discovery aims to uncover novel intent categories from user utterances to expand the set of supported intent classes. It is a critical task for the development and service expansion of a practical dialogue system. Our method provide new solutions to two important research questions for new intent discovery: (1) how to learn semantic utterance representations and (2) how to better cluster utterances. Particularly, we first propose a multi-task pre-training (MTP) strategy to leverage rich unlabeled data along with external labeled data for representation learning. Then, we design a new contrastive loss to exploit self-supervisory signals in unlabeled data for clustering.

method

Data

We have already included the data folder in our code.

Requirements

This is the environment we have testified to be working. However, other versions might be working too.

Download External Pretrained Models

The external pretraining is conducted with IntentBert. You can also download the pretrained checkpoints from following link. And then put them into a folder pretrained_models in root directory. The link includes following models.

File Structure

Please organize the file structure like following:

MTP-CLNN
├── README.md  (this file)
├── data (each folder is a dataset)
    ├── banking
    ├── mcid
    ├── stackoverflow
    └── clinc
├── pretrained_models (external pretrained models)
    ├── banking
    ├── mcid
    └── stackoverflow
├── saved_models (save trained clnn models)
├── scripts (running scripts with hyper-parameters)
├── utils
    ├── contrastive.py (objective function)
    ├── memory.py (memory bank for loading neighbors)
    ├── neighbor_dataset.py
    └── tools.py
├── clnn.py
├── mtp.py
├── init_parameters.py (hyper-parameters)
├── model.py
└── dataloader.py

Run

Run MTP-CLNN on any dataset as following

bash scripts/clnn_${DATASET_NAME}.sh ${GPU_ID}

and please fill in the ${DATASET_NAME} with dataset name and ${GPU_ID} with GPU_ID you want to run on.

Citation

@inproceedings{zhang-etal-2022-new,
    title = "New Intent Discovery with Pre-training and Contrastive Learning",
    author = "Zhang, Yuwei  and
      Zhang, Haode  and
      Zhan, Li-Ming  and
      Wu, Xiao-Ming  and
      Lam, Albert",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.21",
    pages = "256--269"
}

Thanks

Some of the code was adapted from:

Contact

Yuwei Zhang zhangyuwei.work@gmail.com