Awesome

DeepIPW

1. Introduction

This repository contains source code and data description for paper "A deep learning framework for drug repurposing via emulating clinical trials on real world patient data". (accepted by Nature Machine Intelligence).

In this paper, we present an efficient and easily-customized framework for generating and testing multiple candidates for drug repurposing using a retrospective analysis of real world data (RWD). <img src="img/flowchart.png" width="60%"/>

Building upon well-established causal inference and deep learning methods, our framework emulates randomized clinical trials for drugs present in a large-scale medical claims database. <img src="img/LSTM.png" width="60%"/>

We demonstrate our framework on a coronary artery disease (CAD) cohort of millions of patients. We successfully identify drugs and drug combinations that significantly improve the CAD outcomes but not have been indicated for treating CAD, paving the way for drug repurposing.

2. System requirement

OS: Ubuntu 16.04

GPU: NVIDIA 1080ti (11GB memory) is minimum requirement. We recommend NVIDIA TITAN RTX 6000 GPUs.

3. Dependencies

Python 3.6
Pytorch 1.2.0
Scipy 1.3.1
Numpy 1.17.2
Scikit-learn 0.22.2

4. Preprocessing data

Dataset

The real world patient data used in this paper is MarketScan claims data. Interested parties may contact IBM for acquiring the data access at this link.

Data flow chart

The data flow chart of MarketScan claims data. <img src="img/MarketScan_DataFlow.png" width="70%"/>

Source: 2012 MarketScan® CCAE MDCR User Guide

Data files used

Inpatient Admissions (I) : Admission summary records
Outpatient Services (O): Individual outpatient claim records
Outpatient Pharmaceutical Claims (D): Individual outpatient prescription drug claim records
Population (P): Summarizes demographic information about the eligible population

Input data demo

The demo of the input data can be found in the data folder, where the data structures and a synthetic demo of the inputs are provided. Before running the preprocessing codes, make sure the input data format is same to the provided input demo.

Cohort

The data structure for cohort table is as follows,

Column Name	Description	Note
ENROLID	Patient enroll ID	Unique identifier for each patient
Index_date	The date of first CAD encounter	i.e., min (ADMDATE [1st CAD admission date for the inpatient records],SVCDATE [1st CAD service date for the outpatient records])
DTSTART	Date of insurance enrollment start	M/D/Y, e.g., 03/25/2732
DTEND	Date of insurance enrollment end	M/D/Y, e.g., 03/25/2732

Drug table

The data structure for the drug table is as follows,

Column Name	Description	Note
ENROLID	Patient enroll ID	Unique identifier for each patient
NDCNUM	National drug code (NDC)	We map NDC to observational medical<br>outcomes partnership (OMOP) ingredient concept ID, and obtain 1,353 unique drugs
SVCDATE	Date to take the prescription	M/D/Y, e.g., 03/25/2732
DAYSUPP	Days supply. The number of days of drug therapy covered by this prescription	Day, e.g., 28

Inpatient table

The data structure for the inpatient table is as follows,

Column Name	Description	Note
ENROLID	Patient enroll ID	Unique identifier for each patient
DX1-DX15	Diagnosis codes. International Classification of Diseases (ICD) codes	57,089 ICD-9/10 codes considered in the dataset. Dictionary for ICD-9 and ICD-10 codes.
DXVER	Flag to denote ICD-9/10 codes	“9” = ICD-9-CM and “0” = ICD-10-CM
ADMDATE	Admission date for this inpatient visit	M/D/Y, e.g., 03/25/2732
Days	The number of days stay in the inpatient hospital	Day, e.g., 28

Outpatient table

The data structure for the outpatient table is as follows,

Column Name	Description	Note
ENROLID	Patient enroll ID	Unique identifier for each patient
DX1-DX4	Diagnosis codes. International Classification of Diseases (ICD) codes	57,089 ICD-9/10 codes considered in the dataset. Dictionary for ICD-9 and ICD-10 codes.
DXVER	Flag to denote ICD-9/10 codes	“9” = ICD-9-CM and “0” = ICD-10-CM
SVCDATE	Service date for this outpatient visit	M/D/Y, e.g., 03/25/2732

Demographics

The data structure for demo table is as follows,

Column Name	Description	Note
ENROLID	Patient enroll ID	Unique identifier for each patient
DOBYR	birth year	Year, e.g., 2099
SEX	gender	1- male; 2- female

Preprocess drug tables

cd preprocess
python pre_drug.py --input_data_dir ../data/synthetic/drug --output_data_dir 'pickles/cad_prescription_taken_by_patient.pkl'

Preprocess patient cohort

# Note: Here's just a demo case for parameter selection. They can be easily adjusted for different application scenario. 
cd preprocess
python run_preprocess.py --min_patients 10 --min_prescription 2 --followup 60 --time_interval 240 --baseline 10 --input_data ../data/synthetic --save_cohort_all save_cohort_all/

Parameters

--min_patients, minimum number of patients for each cohort.
--min_prescription, minimum times of prescriptions of each drug.
--time_interval, minimum time interval for every two prescriptions.
--followup, number of days of followup period.
--baseline, number of days of baseline period.
--input_pickles, data pickles.
--save_cohort_all, save path.

5. DeepIPW model

Bash command

bash run_lstm.sh

Python command

cd deep-ipw
python main.py

Parameters

--data_dir, input cohort data
--pickles_dir, pickles file.
--treated_drug_file, current evaluating drug.
--controlled_drug, sampled controlled drugs (randomly sampling or ATC class).
--controlled_drug_ratio, ratio of the number of controlled drug.
--input_pickles, data pickles.