Awesome
Resources for our KDD 2023 paper: "QUERT: Continual Pre-training of Language Model for Query Understanding in Travel Domain Search".
TODO
- Release downstream tasks
- Check the early version code
Dependencies
pytorch==1.8.0
transformers==4.18.0
datasets
For convenience, you can also run pip3 install -r requirements.txt
Code Structure Explanation
config.json
Configuration for the modeldata_collator.py
Training data batch generationmodel.py
Model Structuredata_preprocess.py
Corpus preprocessing coderun.py
Model training codetrain.sh
Launched by shell script
Parameter Explaination
geohash_open
- Whether to enable the geohash pre-training task, default is truensp_open
- Whether to enable the NSP task (query[SEP]item), default is falsecl_open
- Whether to enable contrastive learning (i.e., UCBL task), default is trueorder_open
- Whether to enable order prediction task, default is truegeo_aware_open
- Whether to enable the Geo-MP task, default is true, if closed, it degrades to the original MLMmlm_open
- Whether to enable the MLM task, default is truehard_negative
- Whether to enable hard negative sample learning for contrastive learning (data needs to be prepared in advance, default is false)triplet_open
- Whether to use triplet loss (under this setting, contrastive learning needs to be turned off), default is false
Data Processing
First, run the data_preprocess.py
file.
The input file format is [query, item, geo_in_query, geo_in_item, query_tag (tokenized query), query_json, geo_hash, similar_query]
, separated by \t
with each line being a data column.
Example:
318川藏线全包 定制旅行西藏川藏线318自驾稻城亚丁包车拼车定制旅游 none 稻城亚丁 318川藏线;全包 [{"term": "318川藏线", "first_cate": "division2poi", "second_cate": "division2poi", "codes": ""}, {"term": "全包", "first_cate": "abstract", "second_cate": "abstract", "codes": ""}] w j p 3 f h 西藏
Run
export TRAIN_FILE=path/to/your/training_file.txt
export VALIDATION_FILE=path/to/your/validation_file.txt
export TRAIN_SINGLE_REF_FILE=path/to/your/training_single_reference_file.txt
export TRAIN_COVER_REF_FILE=path/to/your/training_cover_reference_file.txt
export TRAIN_ORDER_REF_FILE=path/to/your/training_order_reference_file.txt
export TRAIN_GEOHASH_FILE=path/to/your/training_geohash_reference_file.txt
export VALIDATION_SINGLE_REF_FILE=path/to/your/validation_single_reference_file.txt
export VALIDATION_COVER_REF_FILE=path/to/your/validation_cover_reference_file.txt
export VALIDATION_ORDER_REF_FILE=path/to/your/validation_order_reference_file.txt
export VALIDATION_GEOHASH_FILE=path/to/your/validation_geohash_reference_file.txt
export OUTPUT_DIR=path/to/your/output_directory
python run.py \
--model_name_or_path bert-base-chinese \
--tokens_order_size 64 \
--phrases_order_size 64 \
--cl_open True \
--geohash_open True \
--order_open True \
--mlm_open True \
--nsp_open False \
--geo_aware_open True \
--hard_negative False \
--triplet_open False \
--pooler_type "cls" \
--temp 0.1 \
--overwrite_output_dir \
--pos_shuffle_probability 0.15 \
--train_file $TRAIN_FILE \
--validation_file $VALIDATION_FILE \
--train_single_geo_ref_file $TRAIN_SINGLE_REF_FILE \
--train_cover_geo_ref_file $TRAIN_COVER_REF_FILE \
--train_order_ref_file $TRAIN_ORDER_REF_FILE \
--train_geohash_file $TRAIN_GEOHASH_FILE \
--val_single_geo_ref_file $VALIDATION_SINGLE_REF_FILE \
--val_cover_geo_ref_file $VALIDATION_COVER_REF_FILE \
--val_order_ref_file $VALIDATION_ORDER_REF_FILE \
--val_geohash_file $VALIDATION_GEOHASH_FILE \
--do_train \
--do_eval \
--output_dir $OUTPUT_DIR \
--eval_steps 5000\
--save_total_limit 2 \
--evaluation_strategy steps \
--load_best_model_at_end True \
--num_train_epochs 10 \
--save_steps 5000 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 32
Citation
If our paper or related resources prove valuable to your research, we kindly ask for citation. Please feel free to contact us with any inquiries.
@inproceedings{10.1145/3580305.3599891,
author = {Xie, Jian and Liang, Yidan and Liu, Jingping and Xiao, Yanghua and Wu, Baohua and Ni, Shenghua},
title = {QUERT: Continual Pre-Training of Language Model for Query Understanding in Travel Domain Search},
year = {2023},
isbn = {9798400701030},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3580305.3599891},
doi = {10.1145/3580305.3599891},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {5282–5291},
numpages = {10},
keywords = {query understanding, continual pre-training, travel domain search},
location = {Long Beach, CA, USA},
series = {KDD '23}
}
Question
If you find any questions, please feel free to contact Jian Xie jianx0321@gmail.com
.
You can also create new issue directly.