Home

Awesome

COVID-QA

A collection of COVID-19 Q&A pairs and transformer baselines for evaluating question-answering models

Links

💾 Official Kaggle Dataset

💻 Official Github Repository

:bookmark: Alternate Download Link

Data summary

In addition, we included a clean, tabular version of 290k non-COVID Q&A pairs, queried from the same Stackexchange communities. You can download it here.

Model summary

Are you releasing a new model for diagnosing COVID-19? Can we start using it for our projects?

The goal of COVID-QA is not to release novel models, but to provide a dataset for evaluating your own Q&A models, along with strong baselines that you can easily reproduce and improve. In fact, the datasets relate more closely to news, public health, and community discussions; it is not intended to be used in a clinical setting, and should not be used to influence clinical outcomes. Both the data and models are there to help you for your research projects or R&D prototypes. If you are planning to build and deploy any model or system that uses COVID-QA in some way, please ensure that it is sufficiently tested and validated by medical and public health experts. The content of this collection has not been medically validated.

How do the baseline models work?

In order to make it accessible, we designed our baselines with the simplest Q&A mechanism available for transformer models: concatenate the question with the answer, and let the model learn to predict if it is a correct match (label of 1) or incorrect match (label of 0). Ideally, when trained correctly, we want our model to behave this way:

Why do we need this type of Q&A Models?

The baselines do not auto-regressively generate an answer, so it is not a generative model. Instead, it can tell you if a pair of question and answer is reasonable or not. This is useful when you have a new question (e.g. asked by a user) and a small set of candidate answers (that was pre-filtered from a database of reliable and verified answers), and your goal is to either select the best answer, or rerank those candidates in order of relevance. The latter is used by Neural Covidex, a search engine about COVID-19. Here's how you could visually think about it:

Cite this work

We don't currently have a paper about this work. Feel free to link to this repository, or to the Kaggle dataset. Please reach out if you are interested in citing a technical report.

Data Usage

To load the data, simply download the data from Kaggle or from the alternative link. Then, use pandas to load it:

import pandas as pd

community = pd.read_csv("path/to/dataset/community.csv")
community.head()

Model Usage

Preliminary

First, make sure to download the data from Kaggle or from the alternative link, and unzip the directory. Also, make sure to have the utils script in your current directory. For example:

wget https://github.com/xhlulu/covid-qa/releases/download/v1.0/electra-small-healthtap.zip
unzip electra-small-healthtap.zip

Then, make sure that transformers and tensorflow are correctly installed:

pip install transformers>=2.8.0
pip install tensorflow>=2.1.0 # or pip install tensorflow-gpu>=2.1.0

Helper function

Then, define the following helper functions in your Python script:

import os
import pickle

import tensorflow as tf
import tensorflow.keras.layers as L
import transformers as trfm

def build_model(transformer, max_len=None):
    """
    https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
    """
    input_ids = L.Input(shape=(max_len, ), dtype=tf.int32)
    
    x = transformer(input_ids)[0]
    x = x[:, 0, :]
    x = L.Dense(1, activation='sigmoid', name='sigmoid')(x)
    
    # BUILD AND COMPILE MODEL
    model = tf.keras.Model(inputs=input_ids, outputs=x)
    model.compile(
        loss='binary_crossentropy', 
        metrics=['accuracy'], 
        optimizer=Adam(lr=1e-5)
    )
    
    return model

def load_model(sigmoid_dir='transformer', transformer_dir='transformer', architecture="electra", max_len=None):
    """
    Special function to load a keras model that uses a transformer layer
    """
    sigmoid_path = os.path.join(sigmoid_dir,'sigmoid.pickle')
    
    if architecture == 'electra':
        transformer = trfm.TFElectraModel.from_pretrained(transformer_dir)
    else:
        transformer = trfm.TFAutoModel.from_pretrained(transformer_dir)
    model = build_model(transformer, max_len=max_len)
    
    sigmoid = pickle.load(open(sigmoid_path, 'rb'))
    model.get_layer('sigmoid').set_weights(sigmoid)
    
    return model

Loading model

Then, you can load it as a tf.keras model:

model = load_model(
  sigmoid_dir='/path/to/sigmoid/dir/', 
  transformer_dir='/path/to/transformer/dir/'
)

Sometimes the sigmoid file is not stored in the same directory as the transformer files, so make sure to double check it.

Loading tokenizer

The tokenizer used is exactly the same as the original tokenizers that we loaded from huggingface model repository. E.g.:

tokenizer = trfm.ElectraTokenizer.from_pretrained("google/electra-small-discriminator")

You can also load the fast tokenizer from Huggingface's tokenizers library:

from tokenizers import BertWordPieceTokenizer
fast_tokenizer = BertWordPieceTokenizer('/path/to/model/vocab.txt', lowercase=True, add_special_tokens=True)

Where add_special_tokens depends on whether you are using adding the tags manually or not.

Then, you can use the following function to encode the questions and answers:

def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512, enable_padding=False):
    """
    ---
    Inputs:
        tokenizer: the `fast_tokenizer` that we imported from the tokenizers library
    """
    tokenizer.enable_truncation(max_length=maxlen)
    if enable_padding:
        tokenizer.enable_padding(max_length=maxlen)
    
    all_ids = []
    
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

Advanced model usage

For more advanced and complete examples of using the models, please check out the model evaluation section

Future works for ease of access

We are hoping to potentially host the base model on the Huggingface repository. Currently, we are faced with problems concerning the sigmoid layer, which can't be easily added to the model. We will evaluate the next step in order to make the model available.

We are also planning to make a utils file that you can download off this repo, so you won't need to copy paste those files.

Source Code and Kaggle Notebooks

For this project, our workflow mostly consisted of pipelines of Kaggle notebooks that first preprocess the data, then train a model, and finally evaluate them on each of the tasks we are proposing. To reproduce our results, simply click "Copy and Edit" any of the notebooks below. If you are not familiar with Kaggle, check out this video.

For archival purposes, we also included all the notebooks inside this repository under notebooks.

Preprocessing

The following notebooks show how to preprocess the relevant datasets for training:

Since the StackExchange dataset consumed a lot of memory, we decided to create and save the encoded input of the training data in a separate notebook:

Model Training

Each of the 6 baselines were trained using a TPU notebook. You can find them here:

Model Validation

Acknowledgements

Thank you to: @JunhaoWang and @Makeshn for helping build the dataset from scratch; Akshatha, Ashita, Louis-Philippe, Jeremy, Joao, Joumana, Mirko, Siva for the helpful and insightful discussions.

Aggregated Results

Below are some aggregated results (Macro-averaged across all sources) from the output of our evaluation notebooks. Please check them out for more complete metrics!

Community-QA

electra_ht_smallelectra_ht_baseelectra_se_smallelectra_se_base
ap0.56090.67920.94290.9396
roc_auc0.58980.70970.95590.9586
f1_score0.67440.68170.89460.915
accuracy0.52180.53740.8910.912

Multilingual-QA

mdistilbert_htmdistilbert_se
ap0.76350.5611
roc_auc0.77090.5963
f1_score0.72190.688
accuracy0.62220.5501

News-QA

electra_ht_smallelectra_ht_baseelectra_se_smallelectra_se_base
ap0.90380.92730.66910.7553
roc_auc0.91860.93270.71640.8053
f1_score0.84330.85270.71130.7762
accuracy0.8420.85240.6590.7266

AP score by source

Below are the average precisions (AP) for each source, for every task.

Multilingual-QA

mdistilbert_htmdistilbert_se
chinese0.80750.5281
english0.81910.6495
korean0.59260.5091
spanish0.78920.5546
vietnamese0.62640.5994
arabic0.73390.5669
french0.86050.5876
russian0.79510.4844

News-QA

electra_ht_smallelectra_ht_baseelectra_se_smallelectra_se_base
ABC Australia0.89680.8860.69310.74
ABC News0.88250.93340.64920.6274
BBC0.89770.92590.73820.8679
CNN0.95250.94360.70520.8598
CTV0.82250.93390.70620.8579
Forbes0.75340.83020.70770.7361
LA Times0.8750.950.70950.6458
NDTV0.86750.89150.6790.7449
NPR0.9720.96370.67520.8085
NY Times0.96040.94550.64890.8077
SCMP0.94150.94640.81550.8523
The Australian0.81790.80.6070.8556
The Hill0.93770.97340.63820.7539
Times Of India0.98690.98240.78230.7366
USA Today0.89950.93910.72370.7689

Community-QA

electra_ht_smallelectra_ht_baseelectra_se_smallelectra_se_base
biomedical0.58510.69020.95080.947
general0.5710.70970.95380.956
expert0.52650.62330.89940.8858