Awesome
SymptomCheck Bench
SymptomCheck Bench is an OSCE-style benchmark designed to evaluate the diagnostic accuracy of Large Language Model (LLM) based medical agents in symptom assessment conversations.
The benchmark simulates medical consultations through a structured four-step process:
- Initialization: Selection of a clinical vignette
- Dialogue: Simulated conversation between an LLM agent and patient
- Diagnosis: Generation of top 5 differential diagnoses
- Evaluation: Automated assessment of diagnostic accuracy
For more details about the benchmark, methodology, and results, read our blog post: https://medask.tech/blogs/introducing-symptomcheck-bench/
Installation
conda create -n benchmark python=3.12
conda activate benchmark
pip install -r requirements/development.txt
pip install -e .
export KEY_OPENAI="sk-..." # Set your API key in an ENV variable.
Usage
Running the script
Inspecting the available script options.
# Run without any arguments to see all available options.
python medask/benchmark/main.py
# Output
usage: main.py [-h] [--doctor_llm DOCTOR_LLM] [--patient_llm PATIENT_LLM]
[--evaluator_llm EVALUATOR_LLM] --file {avey,agentclinic}
[--num_vignettes NUM_VIGNETTES] [--num_experiments NUM_EXPERIMENTS]
[--comment COMMENT] [--result_name_suffix RESULT_NAME_SUFFIX]
main.py: error: the following arguments are required: --file
Running a complex example.
Starting the script...
# Run the benchmark on 3 random vignettes from Avey. Repeat the experiment 2 times.
# Use gpt-4o for the LLM simulating the doctor.
# By default, gpt-4o-mini is used for the LLM simulating the patient.
# By default, gpt-4o is used to evaluate the correctness of the resulting diagnoses.
python medask/benchmark/main.py --file=avey --doctor_llm=gpt-4o --num_vignettes=3 --num_experiments=2
... generates the following output.
2024-10-28 21:52:59,084 - [INFO] - benchmark - Running experiment over vignettes [68, 166, 334]
2024-10-28 21:53:24,619 - [INFO] - benchmark - Dumping results to medask/benchmark/results/2024-10-28T21:52:59_gpt-4o_3.json
position=1 Nephrolithiasis : [Kidney stones, Ureteral obstruction, Urinary tract infection, Pyelonephritis, Renal colic]
position=1 Measles : [Measles, Viral exanthem, Roseola, Rubella, Scarlet fever]
position=1 Cerebral Stroke : [Ischemic stroke, Hemorrhagic stroke, Transient ischemic attack (TIA), Bell's palsy, Intracranial hemorrhage]
2024-10-28 21:53:27,234 - [INFO] - benchmark.evaluate - Results of run i=0
positions=[1.0, 1.0, 1.0]
Number of correct diagnoses: 3 / 3
Average position of correct diagnosis: 1.0
position=1 Nephrolithiasis : [Kidney stones, Ureteral stones, Renal colic, Hematuria, Urinary tract obstruction]
position=1 Measles : [Measles, Viral exanthem, Rubella, Scarlet fever, Roseola]
position=1 Cerebral Stroke : [Acute Ischemic Stroke, Transient Ischemic Attack, Hemorrhagic Stroke, Bell's Palsy, Migraine with Aura]
2024-10-28 21:53:30,660 - [INFO] - benchmark.evaluate - Results of run i=1
positions=[1.0, 1.0, 1.0]
Number of correct diagnoses: 3 / 3
Average position of correct diagnosis: 1.0
Understanding the output.
# You can see the indices of the vignettes that are used in the experiment.
2024-10-28 21:52:59,084 - [INFO] - benchmark - Running experiment over vignettes [68, 166, 334]
# All the data about the experiment is saved to disk, in this case to medask/benchmark/results/2024-10-28T21:52:59_gpt-4o_3.json
2024-10-28 21:53:24,619 - [INFO] - benchmark - Dumping results to medask/benchmark/results/2024-10-28T21:52:59_gpt-4o_3.json
position=1 Nephrolithiasis : [Kidney stones, Ureteral obstruction, Urinary tract infection, Pyelonephritis, Renal colic]
# This shows the evaluation of the first diagnosis of the first experiment run.
# * Nephrolithiasis (Kidney stones) was the correct diagnosis.
# * [Kidney stones, Ureteral obstruction, Urinary tract infection, Pyelonephritis, Renal colic] Are the 5 most likely diagnoses the symptom # checker agent decided upon
# * position=1 Is the result of the evaluator LLM. It decided the correct diagnosis was found in position 1 in the differential diagnosis # list
2024-10-28 21:53:27,234 - [INFO] - benchmark.evaluate - Results of run i=0
positions=[1.0, 1.0, 1.0]
Number of correct diagnoses: 3 / 3
Average position of correct diagnosis: 1.0
# This shows the aggregated results of the first experiment run:
# * For the first vignette, the evaluator thought that the first diagnosis is the correct one, hence 1.
# * For the second vignette, the evaluator thought that the first diagnosis is the correct one, hence 1.
# * For the third vignette, the evaluator thought that the first diagnosis is the correct one, hence 1.
# Overall, thus 3 / 3 diagnoses were correct, and the average position of the correct diagnoses is (1.0 + 1.0 + 1.0) / 3 = 1.0
Inspect the results of a benchmark run.
Here you see how to read from disk the result of the benchmark you ran above.