

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Lianghui Zhu<sup>1,2</sup>, Xinggang Wang<sup>1</sup>, Xinlong Wang<sup>2</sup>

<sup>1</sup>HUST, <sup>2</sup>BAAI



<details><summary>Abstract</summary> Evaluating Large Language Models (LLMs) in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 mins to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat. </details>

JudgeLM is an open platform for training, serving, and evaluating scalable large language model judges.

JudgeLM's core features include:



Install: From source

  1. Clone this repository and navigate to the JudgeLM folder
git clone https://github.com/baaivision/JudgeLM
cd JudgeLM
  1. Install Package
conda create -n judgelm python=3.10.10 -y
conda activate judgelm
pip3 install --upgrade pip 
pip3 install -e .
pip install flash-attn==2.0.4 --no-build-isolation

Model Weights

JudgeLM is based on LLaMA and should be used under LLaMA's model license.

Modelw/ reference?Agreement↑Precision↑Recall↑F1↑Consistency↑
JudgeLM-33B 🔥❎89.0380.9784.7682.6491.36
JudgeLM-33B 🔥✅89.3284.0086.2184.9892.37




JudgeLM can judge open-ended answers from LLMs, as well as the multimodal models.

See instructions for running JudgeLM at judgelm/llm_judge.

Serving with Web GUI


We use gradio to provide web server and UI for users to evaluate LLMs' performance at open-ended tasks. The demo can be tried here.

See instructions for running JudgeLM web server at judgelm/serve.



The JudgeLM-100K dataset is available at HuggingFace Datasets.

Code and Hyperparameters

Our code is based on Vicuna with additional support for judging answer pairs. We use similar hyperparameters as the Vicuna.

HyperparameterGlobal Batch SizeLearning rateEpochsMax lengthWeight decay

Fine-tuning JudgeLM-7B with Local GPUs

torchrun --nproc_per_node=4 --master_port=20001 judgelm/train/train_mem.py \
    --model_name_or_path="/share/project/lianghuizhu/vicuna-weights-collection-v1.3/vicuna-7b-v1.3" \
    --data_path /share/project/lianghuizhu/JudgeLM-Project/JudgeLM/judgelm/data/JudgeLM/judgelm_train_100k.jsonl \
    --bf16 True \
    --output_dir="/home/zhulianghui/ProjectC_ChatGPT/alpaca/output/judgelm-debug-evaluator" \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy no \
    --save_strategy steps \
    --save_steps 1000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap offload" \
    --fsdp_transformer_layer_cls_to_wrap "LlamaDecoderLayer" \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --run_name 7B-full-model \
    --swap_aug_ratio 0.5 \
    --ref_drop_ratio 0.5


Acknowledgement :heart:

This project is based on Vicuna (blog, code), PandaLM (paper, code), LLM-Blender (paper, code). Thanks for their wonderful works.


The code (training, serving, and evaluation) in this repository is mostly developed for or derived from the paper below. Please cite it if you find the repository helpful.

      title={JudgeLM: Fine-tuned Large Language Models are Scalable Judges}, 
      author={Lianghui Zhu and Xinggang Wang and Xinlong Wang},