Home

Awesome

CSpider: A Large-Scale Chinese Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

CSpider is a large Chinese dataset for complex and cross-domain semantic parsing and text-to-SQL task (natural language interfaces for relational databases). It is released with our EMNLP 2019 paper: A Pilot Study for Chinese SQL Semantic Parsing. This repo contains all code for evaluation, preprocessing, and all baselines used in our paper. Please refer to the task site for more general introduction and the leaderboard.

Changelog

Citation

When you use the CSpider dataset, we would appreciate it if you cite the following:

@inproceedings{min2019pilot,
  title={A Pilot Study for Chinese SQL Semantic Parsing},
  author={Min, Qingkai and Shi, Yuefeng and Zhang, Yue},
  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
  pages={3643--3649},
  year={2019}
}

Our dataset is based on Spider, please cite it too.

Baseline models

Environment Setup

  1. The code uses Python 2.7 and Pytorch 0.2.0 GPU, and will update python and Pytorch soon.
  2. Install Pytorch via conda: conda install pytorch=0.2.0 -c pytorch
  3. Install Python dependency: pip install -r requirements.txt

Prepare Data, Embeddings, and Pretrained Models

  1. Download the data, embedding and database:
  1. (optional) Download the pretrained Glove, and put it as chisp/embedding/glove.%dB.%dd.txt
  2. Generate training files for each module: python preprocess_data.py -s char|word

Folder/File Description

Training

Run train_all.sh to train all the modules. It looks like:

python train.py \
    --data_root       path/to/char/or/word/based/generated_data \
    --save_dir        path/to/save/trained/module \
    --train_component <module_name> \
    --emb_path        path/to/embeddings
    --col_emb_path    path/to/corresponding/embeddings/for/column

Testing

Run test_gen.sh to generate SQL queries. test_gen.sh looks like:

python test.py \
    --test_data_path  path/to/char/or/word/based/raw/dev/or/test/data \
    --models          path/to/trained/module \
    --output_path     path/to/print/generated/SQL \
    --emb_path        path/to/embeddings
    --col_emb_path    path/to/corresponding/embeddings/for/column

Evaluation

Run evaluation.sh to evaluate generated SQL queries. evaluation.sh looks like:

python evaluation.py \
    --gold            path/to/gold/dev/or/test/queries \
    --pred            path/to/predicted/dev/or/test/queries \
    --etype           evaluation/metric \
    --db              path/to/database \
    --table           path/to/tables \

evalution.py is from the general evaluation process in the Spider github page.

Acknowledgement

The implementation is based on SyntaxSQLNet. Please cite it too if you use this code.