Awesome
CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL
CHASE is a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL task (natural language interfaces for relational databases). It is released along with our ACL 2021 paper: CHASE: A Large-Scale and Pragmatic Chinese Dataset for Cross-Database Context-Dependent Text-to-SQL. This repo contains our dataset CHASE.
Citation
Data Content and Format
Question, SQL, and Parsed SQL
Each file intrain.json
and dev.json
contains the following fields:
database_id
: the database id to which this interaction is addressed.interaction
: the query interaction including multiple DB query questions. For each question in the interaction, it includes:utterance
: the natural language questionutterance_toks
: the natural language question tokensquery
: the SQL query corresponding to the question.sql
: parsed results of this SQL query usingprocess_sql.py
. Please refer to the Spider Github page for the detailed documentation.
{
"database_id": "party_host",
"interaction": [
{
"utterance": "主办方都有谁?",
"utterance_toks": [
"主",
"办",
"方",
...
"?"
],
"query": "select 姓名 from 主办方",
"sql": {
"except": null,
"from": {
"conds": [],
"table_units": [
[
"table_unit",
1
]
]
},
...
"where": []
}
},
{
"utterance": "他们来自哪些不同的国家?",
"utterance_toks": [
"他",
"们",
...
"?"
],
"query": "select distinct 国籍 from 主办方",
"sql": {
"except": null,
"from": {
"conds": [],
"table_units": [
[
"table_unit",
1
]
]
},
...
"where": []
}
},
{
"utterance": "每个国家有多少个主办方?",
"utterance_toks": [
"每",
"个",
"国",
"家",
...
"?"
],
"query": "select 国籍 , count(*) from 主办方 group by 国籍",
"sql": {
"except": null,
"from": {
"conds": [],
"table_units": [
[
"table_unit",
1
]
]
},
...
"where": []
}
}
]
}
Tables
tables.json
contains the following information for each database:
db_id
: database idtable_names_original
: original table names stored in the database.table_names
: cleaned and normalized table names. We make sure the table names are meaningful. [to be changed]column_names_original
: original column names stored in the database. Each column looks like:[0, "派对主题"]
.0
is the index of table names intable_names
, which is"派对"
in this case."派对主题"
is the column name.column_names
: cleaned and normalized column names. We make sure the column names are meaningful. [to be changed]column_types
: data type of each columnforeign_keys
: foreign keys in the database.[11, 7]
means column indices in thecolumn_names
. These two columns are foreign keys of two different tables.primary_keys
: primary keys in the database. Each number is the index ofcolumn_names
.
{
"db_id": "party_host",
"table_names_original": [
"派对",
"主办方",
"派对主办方"
],
"table_names": [
"派对",
"主办方",
"派对主办方"
],
"column_names_original": [
[
-1,
"*"
],
[
0,
"派对"
],
[
0,
"派对主题"
],
[
0,
"地点"
],
...
],
"column_names": [
[
-1,
"*"
],
[
0,
"派对"
],
[
0,
"派对主题"
],
[
0,
"地点"
],
...
],
"column_types": [
"text",
"number",
"text",
"text",
...
],
"foreign_keys": [
[
11,
1
],
[
12,
7
]
],
"primary_keys": [
1,
7,
11
]
}