Home

Awesome

BiMediX: Bilingual Medical Mixture of Experts LLM (EMNLP 2024 Findings)

<p align="center"> <img src="https://i.imgur.com/waxVImv.png" alt="Oryx Video-ChatGPT"> </p>

Sara Pieri*, Sahal Shaji Mullappilly*, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, and Hisham Cholakkal

* Equally contributing first authors

Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), UAE

Website HuggingFace Paper Demo License

Video Title


📢 Latest Updates


:woman_health_worker: Overview

We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.
Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question answering.

Our models are available for download at the Project's HuggingFace Page.


🏆 Contributions

Our contributions are as follows:


⚡ Model

Model NameLink Download
BiMediX-BilingualHuggingFace
BiMediX-ArabicHuggingFace
BiMediX-EnglishHuggingFace

🔍 Data

The BiMed1.3M dataset, central to BiMediX's training, was meticulously compiled to include a wide range of medical interactions. The creation process involved generating multi-turn chat conversations using ChatGPT, based on publicly available medical MCQAs to simulate realistic doctor-patient dialogues. This dataset includes over 200,000 high-quality multi-turn medical dialogues, enriching the model's training material.

<p align="center"> <img src="images/data.gif" alt="data gif"> </p>

A semi-automated, iterative translation process was employed to create high-quality Arabic versions of the data, utilizing ChatGPT for initial translations and human professionals for refinement. This ensured the dataset's fidelity and relevance across both English and Arabic. Furthermore, we translated the English evaluation set to Arabic to evaluate the models. Through these meticulous data creation and processing efforts, BiMediX is able to excel in understanding and generating medical content across two languages.


💫 Qualitative Results

<div style="text-align:center;"> <img src="images/bilingual_conv-1.png" alt="Bilingual Conversation" style="height:300px; display:inline-block; margin: 0 auto;"> <img src="images/mcqa-1.png" alt="Multiple Choice Question Answering" style="height:250px; display:inline-block; margin: 0 auto;"> </div>

📊 Quantitative Results

The BiMediX model was evaluated across several benchmarks, demonstrating its effectiveness in medical language understanding and question answering in both English and Arabic.

Medical Benchmarks Used for Evaluation:

Bilingual Benchmark

ModelCKGCBioCMedMedGenProMedAnaMedMCQAMedQAPubmedQAAVG
Jais-30B57.455.246.255.046.048.940.231.075.550.6
Mixtral-8x7B59.157.652.659.553.354.443.240.674.755.0
BiMediX (Bilingual)70.672.259.374.064.259.655.854.078.665.4

BiMediX shows superior performance in bilingual (Arabic-English) evaluations, outperforming both the Mixtral-8x7B base model and Jais-30B. It demonstrated more than 10 and 15 points higher average accuracy, respectively.

Arabic Benchmark

ModelCKGCBioCMedMedGenProMedAnaMedMCQAMedQAPubmedQAAVG
Jais-30B52.150.740.549.039.343.037.028.874.646.1
BiMediX (Arabic)60.054.955.558.058.149.646.040.276.655.4
BiMediX (Bilingual)63.857.652.664.052.950.449.147.378.456.5

In Arabic-specific evaluations, BiMediX outperforms Jais-30B in all categories, highlighting the effectiveness of the BiMed1.3M dataset and bilingual training.

English Benchmark

ModelCKGCBioCMedMedGenProMedAnaMedMCQAMedQAPubmedQAAVG
PMC-LLaMA-13B63.059.752.670.064.361.550.547.275.660.5
Med42-70B75.984.069.983.078.764.461.961.377.272.9
Clinical Camel-70B69.879.267.069.071.362.247.053.474.365.9
Meditron-70B72.382.562.877.877.962.765.160.780.071.3
BiMediX78.986.168.285.080.574.162.762.880.275.4

BiMediX also excells in English medical benchmarks, surpassing other state-of-the-art models like Med42-70B and Meditron-70B in terms of average performance and efficiency.

These results underscore BiMediX's advanced capability in handling medical queries and its significant improvement over existing models in both languages, leveraging its unique bilingual dataset and training approach.


📜 License & Citation

BiMediX is released under the CC-BY-NC-SA 4.0 License. For more details, please refer to the LICENSE file included in this repository.

⚠️ Warning! This release, intended for research, is not ready for clinical or commercial use.

Users are urged to employ BiMediX responsibly, especially when applying its outputs in real-world medical scenarios. It is imperative to verify the model's advice with qualified healthcare professionals and not rely on it for medical diagnoses or treatment decisions. Despite the overall advancements BiMediX shares common challenges with other language models, including hallucinations, toxicity, and stereotypes.
BiMediX's medical diagnoses and recommendations are not infallible.

If you use BiMediX in your research, please cite our work as follows:

@inproceedings{pieri-etal-2024-bimedix,
    title = "{B}i{M}edi{X}: Bilingual Medical Mixture of Experts {LLM}",
    author = "Pieri, Sara  and
      Mullappilly, Sahal Shaji  and
      Khan, Fahad Shahbaz  and
      Anwer, Rao Muhammad  and
      Khan, Salman  and
      Baldwin, Timothy  and
      Cholakkal, Hisham",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.989",
    pages = "16984--17002",
    abstract = "In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question answering. We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations. We also introduce a comprehensive evaluation benchmark for Arabic medical LLMs. Furthermore, we introduce BiMed1.3M, an extensive Arabic-English bilingual instruction set that covers 1.3 Million diverse medical interactions, including 200k synthesized multi-turn doctor-patient chats, in a 1:2 Arabic-to-English ratio. Our model outperforms state-of-the-art Med42 and Meditron by average absolute gains of 2.5{\%} and 4.1{\%}, respectively, computed across multiple medical evaluation benchmarks in English, while operating at 8-times faster inference. Moreover, our BiMediX outperforms the generic Arabic-English bilingual LLM, Jais-30B, by average absolute gains of 10{\%} on our Arabic and 15{\%} on our bilingual evaluations across multiple datasets. Additionally, BiMediX exceeds the accuracy of GPT4 by 4.4{\%} in open-ended question UPHILL evaluation and largely outperforms state-of-the-art open source medical LLMs in human evaluations of multi-turn conversations. Our trained models, instruction set, and source code are available at https://github.com/mbzuai-oryx/BiMediX.",
}

🙏 Acknowledgements

We are thankful to Mistral AI for releasing their models and FastChat and Axolotl for their open-source code contributions.

This project is partially supported with Google Research Award titled "A Climate Change and Sustainability Tailored Arabic LLM".


<img src="images/Oryx_logo.png" width="100" height="100"> <img src="images/MBZUAI_logo.png" width="360" height="85">