Home

Awesome

EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records

Overview

EHRSQL is a large-scale, high-quality dataset designed for text-to-SQL question answering on Electronic Health Records from MIMIC-III and eICU. The dataset includes questions collected from 222 hospital staff, such as physicians, nurses, insurance reviewers, and health records teams. It can be used to test three aspects of QA models: generating a wide range of SQL queries asked in the hospital workplace, understanding various types of time expressions (absolute, relative, or both), and the capability to abstain from answering (querying the database) when the model's prediction is not confident.

The dataset is released along with our paper titled EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records (NeurIPS 2022 Datasets and Benchmarks). For further details, please refer to our paper.

News

06/22/2024 The complete data for the newest EHRSQL data (MIMIC-IV demo) has been released along with the EHRSQL 2024 Shared Task Overview Paper.

04/27/2024 We have released the entire EHRSQL dataset (including test sets) to the public.

01/29/2024 EHRSQL for MIMIC-IV is being used as one of the shared tasks at NAACL 2024 - ClinicalNLP 2024. For more information, please visit https://sites.google.com/view/ehrsql-2024.

09/21/2022 EHRSQL has been accepted to NeurIPS 2022 Datasets and Benchmarks (3/163)!

Getting Started

Requirments and Installation

git clone https://github.com/glee4810/EHRSQL.git
cd EHRSQL
conda create -n ehrsql python=3.7
conda activate ehrsql
pip install pandas==1.5.3
pip install dask
pip install scikit-learn
pip install func-timeout
pip install transformers==4.19.2 # 4.29.2 works too
pip install sentencepiece
pip install wandb # if needed

Dataset

Question and SQL

The train.json file contains the following fields for each database:

  {
    "db_id": "mimic_iii", 
    "question": "what is the ingesting method of methimazole?", 
    "template": "what is the intake method of methimazole?", 
    "query": "select distinct prescriptions.route from prescriptions where prescriptions.drug = 'methimazole'", 
    "value": {"drug_name": "methimazole"},
    "q_tag": "what is the intake method of {drug_name}?", 
    "t_tag": ["", "", "", "", ""], 
    "o_tag": ["", "", "", "", "", "", "", "", ""], 
    "tag": "what is the intake method of {drug_name}?",
    "department": "['nursing']",
    "importance": "medium", 
    "para_type": "machine", 
    "is_impossible": false,
    "split": "train", 
    "id": "75379177b6a56fb54e946591"
  }

In valid.json, answerable instances have the same structure as train.json. However, unanswerable instances have fewer fields.

 {
    "db_id": "mimic_iii",
    "question": "tell me what medicine to use to relieve a headache in hypertensive patients.",
    "query": "nan",
    "department": "['nursing']",
    "para_type": "human",
    "is_impossible": true,
    "split": "valid",
    "id": "9db3a82be08e143d7976b015"
}

Tables

We follow the same table information style used in Spider. tables.json contains the following information for both databases:

{
    "column_names": [
      [
        0,
        "row id"
      ],
      [
        0,
        "subject id"
      ],
      [
        0,
        "gender"
      ],
      [
        0,
        "dob"
      ],
      ...
    ],
    "column_names_original": [
      [
        0,
        "ROW_ID"
      ],
      [
        0,
        "SUBJECT_ID"
      ],
      [
        0,
        "GENDER"
      ],
      [
        0,
        "DOB"
      ],
      ...
    ],
    "column_types": [
      "number",
      "number",
      "text",
      "time",
      ...
    ],
    "db_id": "mimic_iii",
    "foreign_keys": [
      [
        7,
        2
      ],
      ...
    ],
    "primary_keys": [
      1,
      5,
      ...
    ],
    "table_names": [
      "patients",
      "admissions",
      ...
    ],
    "table_names_original": [
      "PATIENTS",
      "ADMISSIONS",
      ...
    ]
  }

Database

To access the databases, PhysioNet’s credentialed access (see license) is needed. Below are the links to the download pages.

Once completed, run the code below to preprocess the database. This step involves patient sampling, further de-identification, and time-shifting, and more.

cd preprocess
python3 preprocess_db.py --data_dir <path_to_mimic_iii_csv_files> --db_name mimic_iii --deid --timeshift --current_time "2105-12-31 23:59:00" --start_year 2100 --time_span 5 --cur_patient_ratio 0.1

T5 SQL Generation

To train T5-base models, run the code below.

python T5/main.py --config T5/config/ehrsql/training/ehrsql_mimic3_t5_base.yaml --CUDA_VISIBLE_DEVICES <gpu_id>

To generate SQL queries with abstention, run the code below.

python T5/main.py --config T5/config/ehrsql/eval/ehrsql_mimic3_t5_base__mimic3_valid.yaml --output_file prediction_raw.json --CUDA_VISIBLE_DEVICES <gpu_id>
python T5/abstain_with_entropy.py --inference_result_path outputs/eval_ehrsql_mimic3_t5_base__mimic3_valid --input_file prediction_raw.json --output_file prediction.json --threshold 0.14923561

Evaluation

To evaluate the generated SQL queries, run the code below.

python evaluate.py --db_path ./dataset/ehrsql/mimic_iii/mimic_iii.sqlite --data_file dataset/ehrsql/mimic_iii/valid.json --pred_file ./outputs/eval_ehrsql_mimic3_t5_base__mimic3_valid/prediction.json

Have Questions?

Ask us questions on our Github issues page or contact gyubok.lee@kaist.ac.kr.

Citation

When you use the EHRSQL dataset, we would appreciate it if you cite the following:

@article{lee2022ehrsql,
  title={EHRSQL: A Practical Text-to-SQL Benchmark for Electronic Health Records},
  author={Lee, Gyubok and Hwang, Hyeonji and Bae, Seongsu and Kwon, Yeonsu and Shin, Woncheol and Yang, Seongjun and Seo, Minjoon and Kim, Jong-Yeup and Choi, Edward},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={15589--15601},
  year={2022}
}