<div align="center"> <h1> CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats </h1> </div>Table of Content
First install radgraph-XL through the radgraph package:
pip install radgraph
There are two ways to access the radgraph-XL annotation from the CSV. First, using row index:
import json
import pandas as pd
# Load the CSV file into a DataFrame
df = pd.read_csv('df_chexpert_plus_240401.csv')
df = df[df['section_findings'].apply(lambda x: isinstance(x, str) and len(x.split()) >= 2)]
# Load radgraph-XL annotations
with open("radgraph-XL-annotations/section_findings.json") as f:
annotations = json.load(f)
# get fifth example
index = 5
findings = df.iloc[index]["section_findings"]
annotation = annotations[index]
Or directly using the section content:
import json
import pandas as pd
from radgraph import utils
# Load the CSV file into a DataFrame
df = pd.read_csv('df_chexpert_plus_240401.csv')
# Load radgraph-XL annotations
with open("radgraph-XL-annotations/section_findings.json") as f:
annotations = json.load(f)
annotations_dict = {a["0"]["text"]: a for a in annotations}
# Get a random findings:
findings = df.iloc[20]["section_findings"]
preprocessed_findings = utils.radgraph_xl_preprocess_report(findings)
# Retrieve annotation
The json files contains the mapping between CheXpert images and diseases extracted from the radiology report section.
import json
json_diseases = [json.loads(s) for s in open("chexbert_labels/findings_fixed.json").readlines()]
print(json.dumps(json_diseases[0], indent=4))
> {
"path_to_image": "train/patient42142/study5/view1_frontal.jpg",
"Enlarged Cardiomediastinum": null,
"Cardiomegaly": null,
"Lung Opacity": null,
"Lung Lesion": null,
"Edema": null,
"Consolidation": null,
"Pneumonia": null,
"Atelectasis": null,
"Pneumothorax": null,
"Pleural Effusion": null,
"Pleural Other": null,
"Fracture": null,
"Support Devices": null,
"No Finding": 1.0
You can merge these diseases annotations with the main CSV:
import json
import pandas as pd
# Merge both DataFrames on the 'path_to_image' column
jsonl_df = pd.read_json('chexbert_labels/findings_fixed.json', lines=True)
csv_df = pd.read_csv('df_chexpert_plus_240401.csv')
merged_df = pd.merge(jsonl_df, csv_df, on='path_to_image')
# Filter DataFrame to include only rows where 'section_findings' is not null
filtered_df = merged_df[merged_df['section_findings'].notna()]
You can further create the mapping findings -> diseases:
disease_columns = [
"Enlarged Cardiomediastinum", "Cardiomegaly", "Lung Opacity", "Lung Lesion",
"Edema", "Consolidation", "Pneumonia", "Atelectasis", "Pneumothorax",
"Pleural Effusion", "Pleural Other", "Fracture", "Support Devices", "No Finding"
# Create a dictionary where the key is 'section_findings' and the value is mapping diseases
findings_to_diseases = {
row['section_findings']: {disease: row[disease] for disease in disease_columns}
for index, row in filtered_df.iterrows()
print(json.dumps(findings_to_diseases, indent=4))
> {
"Unchanged right internal jugular venous catheter. Stable ....":
"Enlarged Cardiomediastinum": -1.0,
"Cardiomegaly": NaN,
"Lung Opacity": 1.0,
"Lung Lesion": NaN,
"Edema": 1.0,
"Consolidation": NaN,
"Pneumonia": -1.0,
"Atelectasis": -1.0,
"Pneumothorax": NaN,
"Pleural Effusion": NaN,
"Pleural Other": NaN,
"Fracture": NaN,
"Support Devices": 1.0,
"No Finding": NaN
Model Zoo
Type | Datasets | Model | Link | Tutorial |
RRG | MIMIC-cxr & Chexpert Plus | Swinv2/bert-decoder-2-layers | 🤗 | Doc |
VQGAN | MIMIC-CXR & CheXpert Plus & PadChest & BIMCV & Candid-PTX | XrayVQGAN | 🤗 | Doc |
DINOv2 | MIMIC-CXR & CheXpert Plus & PadChest & BIMCV & Candid-PTX | XrayDINOv2 | 🤗 | Doc |
CLIP | MIMIC-CXR & CheXpert Plus & PadChest & BIMCV & Candid-PTX | XrayCLIP | 🤗 | Doc |
LLaMA | - | RadLLaMA | 🤗 | Doc |
title={CheXpert Plus: Augmenting a Large Chest X-ray Dataset with Text Radiology Reports, Patient Demographics and Additional Image Formats},
author={Chambon, Pierre and Delbrouck, Jean-Benoit and Sounack, Thomas and Huang, Shih-Cheng and Chen, Zhihong and Varma, Maya and Truong, Steven QH and Chuong, Chu The and Langlotz, Curtis P},
journal={arXiv preprint arXiv:2405.19538},