Home

Awesome

CAMEL: Clinically Adapted Model Enhanced from LLaMA

<p align='center'> <img src="./resources/camel.png" width="400" height="400" center-align="true"> <div align="center"><b>CAMEL</b> from Bing Image Creator</div> </p>

License: MIT Python 3.9+ Code style: black

UPDATE: NEW MODEL ANNOUNCEMENT

We are proud to introduce Asclepius, a more advanced clinical large language model. As this model was trained on synthetic clinical notes, it is publicly accessible via Huggingface. If you are considering using CAMEL, we highly recommend switching to Asclepius instead. For more information, please visit this link.


Our Blog Post

Our Demo

<br/>

We present CAMEL, Clinically Adapted Model Enhanced from LLaMA. As LLaMA for its foundation, CAMEL is furtherpre-trained on MIMIC-III and MIMIC-IV clinical notes, and finetuned over clinical instructions (Figure 2). Our preliminary evaluation with GPT-4 assessment, demonstrates that CAMEL achieves over 96% of the quality of OpenAI's GPT-3.5 (Figure 1). In accordance with the data usage policies of our source data, both our instruction dataset and model will be published on PhysioNet with credentialized access. To facilitate replication, we will also release all code, allowing individual healthcare institutions to reproduce our model using their own clinical notes. For further detail, please refer our blog post.

<p align='center'> <img src="./resources/performance.png" center-align="true" width="70%"> <div align="center">Figure 1. Performance Comparison</div> </p> <p align='center'> <img src="./resources/pipeline.jpg" center-align="true"> <div align="center">Figure 2. Model Pipeline</div> </p>

Reproducing Guide

Due to the license issue of MIMIC and i2b2 datasets, we cannot publish the instruction dataset and checkpoints. We would publish our model and data via physionet within few weeks.

<details> <summary>Environment Setup</summary>
conda create -n camel python=3.9 -y
conda activate camel
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install pandarallel pandas jupyter numpy datasets sentencepiece openai fire
pip install git+https://github.com/huggingface/transformers.git@871598be552c38537bc047a409b4a6840ba1c1e4
</details> <details> <summary> Pretraining </summary> </details> <details> <summary>Instruction Finetuning</summary>
    $ torchrun --nproc_per_node=8 --master_port={YOUR_PORT} \
        src/instruction_ft.py \
        --model_name_or_path "decapoda-research/llama-7b-hf" \
        --data_path  {OUTPUT_FILE_FINAL} \
        --bf16 True \
        --output_dir ./checkpoints \
        --num_train_epochs 3 \
        --per_device_train_batch_size 2 \
        --per_device_eval_batch_size 2 \
        --gradient_accumulation_steps 8 \
        --evaluation_strategy "no" \
        --save_strategy "epoch" \
        --learning_rate 2e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --fsdp "full_shard auto_wrap" \
        --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
        --tf32 True \
        --model_max_length 2048 \
        --gradient_checkpointing True
        --ddp_timeout 18000
</details> <details> <summary>Evaluation</summary> </details>

Citation

@misc{CAMEL,
    title = {CAMEL : Clinically Adapted Model Enhanced from LLaMA},
    author = {Sunjun Kweon and Junu Kim and Seongsu Bae and Eunbyeol Cho and Sujeong Im and Jiyoun Kim and Gyubok Lee and JongHak Moon and JeongWoo Oh and Edward Choi},
    month = {May},
    year = {2023}
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/starmpcc/CAMEL}},
}

Code References