Awesome
Latent Meaning Cells
This is the main repository for the Latent Meaning Cells (LMC) Model. The LMC is a Latent Variable Model for Jointly Modeling Words and Document Metadata.
This is the official PyTorch codebase for the paper "Zero-Shot Clinical Acronym Expansion via Latent Meaning Cells" presented at the NeurIPS 2020 Machine Learning for Healthcare (ML4H) Workshop.
Notation
A word is the atomic unit of discrete data and represents an item from a fixed vocabulary.
A word is denoted as <img src="https://render.githubusercontent.com/render/math?math=w"> when representing a center word, and <img src="https://render.githubusercontent.com/render/math?math=c"> for a context word. <img src="https://render.githubusercontent.com/render/math?math=\boldsymbol{c}"> represents the set of context words relative to a center word <img src="https://render.githubusercontent.com/render/math?math=w">. In different contexts, each word operates as both a center word and a context word. For our purposes, metadata are pseudo-documents which contain a sequence of <img src="https://render.githubusercontent.com/render/math?math=N"> words denoted by <img src="https://render.githubusercontent.com/render/math?math=m = (w_1, w_2, ..., w_N)"> where <img src="https://render.githubusercontent.com/render/math?math=w_n"> is the <img src="https://render.githubusercontent.com/render/math?math=n^{th}"> word. A corpus is a collection of <img src="https://render.githubusercontent.com/render/math?math=K$"> metadata denoted by <img src="https://render.githubusercontent.com/render/math?math=D = \{m_1,m_2,...,m_K\}">.
Overview of Psuedo-Generative Process
The psuedo-generative process of the LMC model is shown in plate notation and story form below:
Please refer to the paper for more information on the distributions and model parameters.
K
represents the number of unique metadata in the corpus. This could be the number of unique section headers or even simply the number of documents in the corpus, depending on the modeling choice.
m<sub>k</sub>
represents the k<sup>th</sup> metadata.
N<sub>k</sub>
represents the number of unique tokens in the k<sup>th</sup> metadata. For our purposes, metadata are pseudo-documents which contain a sequence of words. For instance, if the metadata is a section header Discharge Medications, that metadata is comprised of the concatenation of the body of every section entitled Discharge Medications across the corpus. Yet, when computing context windows, we do not combine text from different physical documents.
w<sub>ik</sub>
represents the i<sup>th</sup> center word belonging to the k<sup>th</sup> metadata.
z<sub>ik</sub>|w<sub>ik</sub>,d<sub>k</sub>
represents the latent meaning cell (lmc) given the center word w<sub>ik</sub> and metadata m<sub>k</sub>
S
denotes the window size. That is, the number of words drawn from the left and right side of the center word.
c<sub>ijk</sub>|z<sub>ik</sub>
denotes the j<sup>th</sup> context word given the latent meaning of i<sup>th</sup> center word in k<sup>th</sup> metadata.
This formulation allows for the latent meaning of a word to depend on the metadata (section header, paragraph id, etc.) in which it is found, and vice versa. For instance, the latent meaning of a sports article is not the same for all sports articles. Sports can refer to the NBA, the Olympics, or chess. Therefore, the concept of a sports article is refined by knowing the words used inside the article. Conversely, if you see the word net, its latent meaning will shift more to basketball than to fishing if you know that it is used within a sports article. The LMC models both phenomena. This notion is encapsulated in the below figure.
Contents
The repository containts the following modules:
acronyms
- Evaluation scripts for clinical acronym expansion
- Also contains custom scripts for each model in
modules
to adapt each neural language model to the acronym expansion task.
modules
- LMC and baseline language model training scripts. Set up to train on MIMIC-III clinical notes.
- Latent Meaning Cells
- Bayesian Skip-Gram (BSG)
- ELMo
preprocess
- Scripts to preprocess (tokenize, extract section headers) from MIMIC-III notes.utils
- Utility function store.weights
- Stores trained weights for language model pre-training as well as optional acronym expansion fine-tuning
Quick Setup
- Clone this repository and place it in
~
. - Run
pip install -r requirements.txt
- Note: packages are pinned to exact versions but may work with older versions. Compatibility is untested for versions not explicitly listed in
requirements.txt
.
Train Language Model on MIMIC-III
Downloading Data
- If you haven't already, please request access to MIMIC-III notes.
- Follow instructions to download
NOTEEVENTS.csv
and place underpreprocess/data/mimic/
.
Preprocessing
In ./preprocess
, please run the following scripts in order:
generate_mini_dataset.py
- sample from full dataset to create mini development set (principally for debugging).compute_sections.py
- Use custom regex to precompute the names of all section headers in MIMIC-III.mimic_tokenize.py
- Tokenize the data and savesubsample_tokens.py
- Subsample frequent tokens to speed-up training and increase effective window size.
The output of these scripts is a series of data files:
./preprocess/data/mimic/NOTEEVENTS_token_counts.csv
./preprocess/data/mimic/NOTEEVENTS_tokenized.csv
./preprocess/data/mimic/section_freq.csv
./preprocess/data/mimic/ids.npy
./preprocess/data/mimic/vocab.pk
NB:
- Please see individual scripts for optional argument flags along with descriptions
- We recommend running each of the above scripts with the optional
-debug
boolean flag which does all preprocessing on the mini version of the dataset as created fromgenerate_mini_dataset.py
. - The last two files are essential for training the language model.
Training LMC Model
In this section, we describe how to train the jointly contextualized token and document metadata embeddings, as described in the LMC paper.
In ./modules/lmc/
, please run the following training script:
lmc_main.py
- this script trains on MIMIC-III data (preprocesed intoids.npy
andvocab.pk
) and serializes learned model weights to a corresponding directory in./weights/lmc/{experiment_name}/
.
- Please see
lmc_main.py
for all command-line arguments with descriptions - Please note that, at this point, the
-bert
flag is an experimental feature.
Training Baselines
Bayesian Skip-Gram Model (BSG)
As a primary baseline and source of great inspiration for the LMC, we provide our own PyTorch implementation of the BSG model:
- Original Theano source code
- Bražinskas, A., Havrylov, S., & Titov, I. (2017). Embedding words as distributions with a Bayesian skip-gram model. arXiv preprint arXiv:1711.11027.
The training procedure is identical to the original paper. Yet, the encoder architecture is different: we found better performance by encoding context sequences with a bi-LSTM and summarizing with a simple pooled attention mechanism. Please see paper and code for more details.
In similar fashion to the LMC, to train the BSG embeddings, please run the following script in ./modules/bsg/
:
bsg_main.py
NB:
In the LMC paper, we introduce a modification of the BSG, referred to as the MBSGE (Metadata Bayesian Skip-Gram Ensemble). In this specific instance, center word ids are randomly replaced with metadata ids. This variant is controlled by the boolean flag -multi_bsg
with parameters --multi_weights
to control the categorical distribution parameters governing the relative frequency with which words, sections, and note categories are chosen as the pseudo-center word.
ELMo Baseline
We also provide a setup to enable training AllenNLP's ELMo model on MIMIC-III. from the seminal (paper).
We use the Transformer based implementation of ELMo given its promising performance. There is a version mismatch between allennlp's import of Huggingface and the version of Huggingface our code. As such, we are currently working on a solution and will provide complete documentation for how to run it when available. For now, all ELMo related code has been commented out.
Evaluating on Clinical Acronym Expansion
To evaluate pre-trained LMC, BSG, and ELMo models on the task of clinical acronym expansion, please refer to the README in the acronyms
module.
The code is compatible with two acronym expansion datasets:
- CASI dataset - this labeled dataset from the University of Minnesota is a common benchmark for clinical acronym expansion and has been pre-processed to work with all models (included in the
shared_data/casi
for convenience). - MIMIC-III RS dataset - this is a new dataset that uses the same sense inventory as CASI. It creates a synthetic dataset using reverse substitution (RS). MIMIC requires a license so please follow instructions in
preprocess/context_extraction
to generate the dataset.
Each dataset is runnable with the same script by toggling the flag --dataset {mimic, casi}
.
Contact
Please raise a Github issue if you encounter any issues running the code. Please feel free to issue a pull requests for bug fixes or feature requests as well.
If you want to discuss the paper and modeling approach more, please contact me at griffin.adams@columbia.edu.