Home

Awesome

DDXPlus: A New Dataset For Automatic Medical Diagnosis

<img src="images/diagram.png" width="800">

Appearing in NeurIPS 2022 dataset and benchmark track

We are releasing under the CC-BY licence a new large-scale dataset for Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the medical domain.

The dataset contains patients synthesized using a proprietary medical knowledge base and a commercial rule-based ASD system. Patients in the dataset are characterized by their socio-demographic data, a pathology they are suffering from, a set of symptoms and antecedents related to this pathology, and a differential diagnosis. The symptoms and antecedents can be binary, categorical and multi-choice, with the potential of leading to more efficient and natural interactions between ASD/AD systems and patients.

To the best of our knowledge, this is the first large-scale dataset that includes the differential diagnosis, and non-binary symptoms and antecedents.

Availability

<!-- #FIXME: check the date and add link-->

Dataset documentation

In what follows, we use the term evidence as a general term to refer to a symptom or an antecedent. The dataset contains the following files:

Evidence description

Each evidence in the release_evidences.json file is described using the following entries:

Example

English

{
    "name": "E_130",
    "code_question": "E_129",
    "question_fr": "De quelle couleur sont les lésions?",
    "question_en": "What color is the rash?",
    "is_antecedent": false,
    "default_value": "V_11",
    "value_meaning": {
        "V_11": {"fr": "NA", "en": "NA"},
        "V_86": {"fr": "foncée", "en": "dark"},
        "V_107": {"fr": "jaune", "en": "yellow"},
        "V_138": {"fr": "pâle", "en": "pale"},
        "V_156": {"fr": "rose", "en": "pink"},
        "V_157": {"fr": "rouge", "en": "red"}
    },
    "possible-values": [
        "V_11",
        "V_86",
        "V_107",
        "V_138",
        "V_156",
        "V_157"
    ],
    "data_type": "C"
}

French

{
    "name": "lesions_peau_couleur",
    "code_question": "lesions_peau",
    "question_fr": "De quelle couleur sont les lésions?",
    "question_en": "What color is the rash?",
    "is_antecedent": false,
    "default_value": "NA",
    "value_meaning": {
        "NA": {"fr": "NA", "en": "NA"},
        "foncee": {"fr": "foncée", "en": "dark"},
        "jaune": {"fr": "jaune", "en": "yellow"},
        "pale": {"fr": "pâle", "en": "pale"},
        "rose": {"fr": "rose", "en": "pink"},
        "rouge": {"fr": "rouge","en": "red"}
    },
    "possible-values": [
        "NA",
        "foncee",
        "jaune",
        "pale",
        "rose",
        "rouge"
    ],
    "data_type": "C"
}

Pathology description

The file release_conditions.json contains information about the pathologies patients in the datasets may suffer from. Each pathology has the following attributes:

Example

English

{
    "condition_name": "Myasthenia gravis",
    "cond-name-fr": "Myasthénie grave",
    "cond-name-eng": "Myasthenia gravis",
    "icd10-id": "G70.0",
    "symptoms": {
        "E_65": {},
        "E_63": {},
        "E_52": {},
        "E_172": {},
        "E_84": {},
        "E_66": {},
        "E_90": {},
        "E_38": {},
        "E_176": {}
     },
    "antecedents": {
        "E_28": {},
        "E_204": {}
    },
    "severity": 3
}

French

{
    "condition_name": "Myasthénie grave",
    "cond-name-fr": "Myasthénie grave",
    "cond-name-eng": "Myasthenia gravis",
    "icd10-id": "G70.0",
    "symptoms": {
        "dysphagie": {},
        "dysarthrie": {},
        "diplopie": {},
        "ptose": {},
        "faiblesse_msmi": {},
        "dyspn": {},
        "fatigabilité_msk": {},
        "claud_mâchoire": {},
        "rds_paralys_gen": {}
    },
    "antecedents": {
        "atcdfam_mg": {},
        "trav1": {}
    },
    "severity": 3
}

Patient description

Each patient in each of the 3 sets has the following attributes:

Example

English

{
    "AGE": 18,
    "DIFFERENTIAL_DIAGNOSIS": [["Bronchitis", 0.19171203430383882], ["Pneumonia", 0.17579340398940366], ["URTI", 0.1607809719801254], ["Bronchiectasis", 0.12429044460990353], ["Tuberculosis", 0.11367177304035844], ["Influenza", 0.11057936110639896], ["HIV (initial infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
    "SEX": "M",
    "PATHOLOGY": "URTI",
    "EVIDENCES": ["E_48", "E_50", "E_53", "E_54_@_V_161", "E_54_@_V_183", "E_55_@_V_89", "E_55_@_V_108", "E_55_@_V_167", "E_56_@_4", "E_57_@_V_123", "E_58_@_3", "E_59_@_3", "E_77", "E_79", "E_91", "E_97", "E_201", "E_204_@_V_10", "E_222"],
    "INITIAL_EVIDENCE": "E_91"
}

French

{
    "AGE": 18, 
    "DIFFERENTIAL_DIAGNOSIS": [["Bronchite", 0.19171203430383882], ["Pneumonie", 0.17579340398940366],["IVRS ou virémie", 0.1607809719801254], ["Bronchiectasies", 0.12429044460990353], ["Tuberculose", 0.11367177304035844], ["Possible influenza ou syndrome virémique typique", 0.11057936110639896], ["VIH (Primo-infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]], 
    "SEX": "M", 
    "PATHOLOGY": "IVRS ou virémie", 
    "EVIDENCES": ["crowd", "diaph", "douleurxx", "douleurxx_carac_@_sensible", "douleurxx_carac_@_une_lourdeur_ou_serrement", "douleurxx_endroitducorps_@_front", "douleurxx_endroitducorps_@_joue_D_", "douleurxx_endroitducorps_@_tempe_G_", "douleurxx_intens_@_4", "douleurxx_irrad_@_nulle_part", "douleurxx_precis_@_3", "douleurxx_soudain_@_3", "expecto", "f17.210", "fievre", "gorge_dlr", "toux", "trav1_@_N", "z77.22"], 
    "INITIAL_EVIDENCE": "fievre"
}

Dataset statistics

Pathology statistics

<!-- ![](images/global_patho_hist.png) --> <img src="images/global_patho_hist.png" height="300">

Socio-demographic statistics

<!-- ![](images/global_demos_hist.png) --> <img src="images/global_demos_hist.png" height="150">

Distribution of the evidence types

BinaryCategoricalMulti-choiceTotal
Evidences208105223
Symptoms9695110
Antecedents11210113

Number of evidences of the synthesized patients

AvgStd devMin1st quartileMedian3rd quartileMax
Evidences13.565.06110131736
Symptoms10.074.6918101225
Antecedents3.492.23023512

Differential diagnosis statistics

<img src="images/global_length_and_rank_hist.png" height="150">

Experiments

Code for reproducing results in the paper can be found in code.

In our paper, we reported results of two methods, a RL-based method AARLC and a supervised method BASD which is adapted from ASD. For instructions on how to run them, see here for AARLC and here for BASD.