Home

Awesome

Shaberi: A Suite of Japanese Chat Benchmarks

A repo for evaluating Japanese LLMs ・ 日本語LLMを評価するレポ

How to run

# Get code
git clone https://github.com/shisa-ai/shaberi
cd shaberi

# Create Environment, Install requirements
mamba create -n shaberi python=3.11
mamba activate shaberi
pip install -r requirement.txt

# In one terminal, run vLLM OpenAI API, eg: 
python -m vllm.entrypoints.openai.api_server --model shisa-ai/shisa-v1-llama3-70b -tp 8
# or llama.cpp OpenAI API, eg:
./server -ngl 99 -c 8192 -m shisa-v1-llama3-70b.Q4_K_M.gguf --chat-template llama3 --host 0.0.0.0 --port 8000 -a shisa-v1-llama3-70b.q4_k_m

# In a separate terminal, generate answers:
mamba activate shaberi
# Match model name to what vLLM is serving
# We run frequency_penalty=0.5 for all our runs, probably generally the best
python generate_answers.py --model_name 'shisa-ai/shisa-v1-llama3-8b' -fp 0.5

# Then run the judge (assumes your OPENAI_API_KEY is in the env already):
python judge_answers.py -m shisa-ai/shisa-v1-llama3-8b

# Make sure you have new answers and judgements
git status

# To generate updated results
python results_vizualization.py
cat output.csv

Changes made:

If we were making our own:


### OLD SHABERI README... 

実行方法

1. 評価用データセットごとのモデルの回答生成用関数

OPENAI_API_KEY=[自分のOpenAIのAPIキー] python generate_answer.py \ 
    --model_name [評価したいLLMのモデル名] \
    --eval_dataset_name [評価用データセット名] \
    --num_proc [並列処理数]

--model_name(required):評価したいLLMのモデル名。

--eval_dataset_namedefault="all"):評価用データセット名。設定しない場合対応している全ての評価用データセットについて実行します。

--num_procdefault=8):並列処理数。実行環境に合わせて設定してください。設定しない場合の並列処理数は 8 です。

2. モデルの回答の評価用関数

OPENAI_API_KEY=[自分のOpenAIのAPIキー] python judge_answers.py \ 
    --model_name [評価したいLLMのモデル名] \
    --eval_dataset_name [評価用データセット名] \
    --evaluation_model [評価用のLLMモデル名] \
    --num_proc [並列処理数]

--model_name(required):評価したいLLMのモデル名。

--eval_dataset_namedefault="all"):評価用データセット名。設定しない場合対応している全ての評価用データセットについて実行します。

--evaluation_modeldefault="gpt-4-turbo-preview"):評価用のLLMモデル名。設定しない場合gpt-4-turbo-previewを用いて評価します。

--num_procdefault=8):並列処理数。実行環境に合わせて設定してください。設定しない場合の並列処理数は 8 です。

評価することができるLLMのモデル

OpenAI社のモデルおよび、vLLM等のopenaiモジュール形式のサーバーを立てることができるツールが対応しているモデル

vLLMでopenai形式のサーバーを立てる方法については公式ドキュメントをご覧ください→https://docs.vllm.ai/en/latest/getting_started/quickstart.html

評価に利用できるモデル

  1. gpt-4-turbo-preview

評価用データセット名

  1. lightblue/tengu_bench: Tengu-Bench
  2. elyza/ELYZA-tasks-100: Elyza-tasks-100
  3. lightblue/japanes-mt-bench-oneshot: MT-Bench
  4. yuzuai/rakuda-questions: Rakuda

結果

ELYZA-tasks-100RakudaTengu-BenchMT-Benchmean
gpt-4-turbo-2024-04-098.789.188.318.748.75
gpt-4-turbo-preview8.949.287.848.618.67
CohereForAI__c4ai-command-r-plus7.509.056.797.427.69
Qwen__Qwen1.5-72B-Chat7.607.856.817.167.36
gpt-3.5-turbo-01257.247.646.826.977.17
CohereForAI__c4ai-command-r-v016.088.626.676.947.08
Qwen__Qwen1.5-32B-Chat7.097.516.366.906.97
karakuri-ai__karakuri-lm-70b-chat-v0.16.867.856.236.436.84
lightblue__ao-karasu-72B7.197.256.276.546.81
Qwen__Qwen1.5-14B-Chat6.706.546.206.546.50
xverse__XVERSE-13B-Chat6.346.654.885.345.80
Rakuten__RakutenAI-7B-chat5.926.585.244.605.58
Nexusflow__Starling-LM-7B-beta5.744.425.415.615.30
Qwen__Qwen1.5-7B-Chat5.544.825.295.415.27
elyza__ELYZA-japanese-Llama-2-13b-instruct5.605.625.524.315.26
lightblue__qarasu-14B-chat-plus-unleashed5.585.465.014.745.20
openchat__openchat-3.5-01065.823.775.495.045.03
cyberagent__calm2-7b-chat4.905.754.813.584.76
mistralai__Mistral-7B-Instruct-v0.25.783.804.534.654.69
meta-llama__Llama-2-13b-chat-hf5.642.274.675.714.58
meta-llama__Llama-2-7b-chat-hf4.782.083.925.454.06
augmxnt__shisa-7b-v13.722.233.412.232.89
<details><summary>カテゴリーごとの詳しい結果はこちらをご覧ください。</summary>
model_name('MT-Bench', 'coding')('MT-Bench', 'extraction')('MT-Bench', 'humanities')('MT-Bench', 'math')('MT-Bench', 'reasoning')('MT-Bench', 'roleplay')('MT-Bench', 'stem')('MT-Bench', 'writing')('Tengu-Bench', 'Function calling')('Tengu-Bench', 'アイデア生成')('Tengu-Bench', 'コスト見積')('Tengu-Bench', 'ダジャレ')('Tengu-Bench', 'ビジネス')('Tengu-Bench', 'フォーマット')('Tengu-Bench', 'プロジェクト作成')('Tengu-Bench', '会話要約')('Tengu-Bench', '倫理的制御')('Tengu-Bench', '建設')('Tengu-Bench', '抽出')('Tengu-Bench', '政治')('Tengu-Bench', '敬語')('Tengu-Bench', '数学')('Tengu-Bench', '日本')('Tengu-Bench', '架空の質問')('Tengu-Bench', '法律判断')('Tengu-Bench', '翻訳')('Tengu-Bench', '表の読み取り')('Tengu-Bench', '論理パズル')('Tengu-Bench', '長い文書のClosed QA(千トークン以上)')('Tengu-Bench', '長い文書要約(千トークン以上)')('Tengu-Bench', '雑談')
CohereForAI__c4ai-command-r-plus6.108.608.605.605.708.207.808.808.2010.009.003.602.609.0010.0010.002.807.009.405.609.201.204.303.607.608.405.402.338.2010.009.40
CohereForAI__c4ai-command-r-v016.906.608.404.705.207.677.708.409.2010.009.804.604.809.0010.0010.003.405.408.804.609.201.004.503.606.608.401.802.608.6010.009.80
Nexusflow__Starling-LM-7B-beta5.306.306.105.505.105.504.107.006.609.005.404.203.008.009.209.204.004.809.201.206.801.402.106.002.407.601.801.806.8010.007.20
Qwen__Qwen1.5-14B-Chat5.407.407.005.405.307.207.007.606.4010.008.804.404.607.809.4010.004.605.209.002.808.601.403.405.205.004.803.403.409.6010.007.60
Qwen__Qwen1.5-32B-Chat5.707.407.606.206.007.207.207.908.2010.008.805.202.606.0010.0010.003.404.409.802.809.003.803.606.005.407.806.003.007.409.007.60
Qwen__Qwen1.5-72B-Chat6.706.907.706.804.808.007.908.505.6010.009.603.602.607.8010.0010.005.807.409.604.208.204.002.907.606.007.806.203.409.6010.008.60
Qwen__Qwen1.5-7B-Chat4.906.305.204.704.305.805.007.105.009.607.605.802.404.807.809.604.403.408.402.007.600.801.805.203.006.603.601.209.009.406.20
Rakuten__RakutenAI-7B-chat4.703.906.804.203.304.604.205.103.209.808.603.202.804.6010.006.807.604.407.402.405.201.004.204.404.806.603.402.005.408.005.80
augmxnt__shisa-7b-v13.504.001.703.301.901.101.001.302.406.004.401.002.202.207.806.402.401.206.401.003.801.001.702.801.603.601.201.605.809.803.80
cyberagent__calm2-7b-chat2.304.106.101.302.404.304.403.704.409.005.203.603.003.409.207.202.802.807.603.403.601.003.405.205.805.201.001.006.409.608.20
elyza__ELYZA-japanese-Llama-2-13b-instruct3.205.205.602.904.004.704.404.509.608.606.605.803.805.8010.006.802.803.807.605.006.000.802.604.405.405.601.801.807.209.808.20
gpt-3.5-turbo-01257.008.807.306.804.207.407.007.306.8010.009.204.004.609.6010.009.203.404.8010.002.207.804.003.906.006.808.607.804.808.009.808.40
gpt-4-turbo-2024-04-098.509.108.809.507.708.608.808.908.4010.009.608.206.2010.0010.0010.0010.006.409.606.009.607.604.909.206.809.008.805.808.6010.009.80
gpt-4-turbo-preview8.109.008.908.508.008.708.609.1010.0010.009.804.006.6010.0010.0010.008.206.4010.006.409.606.604.206.006.008.808.404.209.0010.009.80
karakuri-ai__karakuri-lm-70b-chat-v0.15.906.908.304.105.096.806.308.206.008.208.204.004.406.8010.009.602.805.408.804.207.802.204.203.608.008.402.602.608.609.209.80
lightblue__ao-karasu-72B6.007.307.505.005.606.506.607.707.8010.008.405.403.006.209.008.802.605.6010.004.606.605.203.905.206.207.804.403.006.6010.006.20
lightblue__qarasu-14B-chat-plus-unleashed3.906.605.604.304.403.105.304.704.4010.006.804.202.004.407.409.002.805.808.602.405.402.801.305.206.606.603.000.807.407.005.00
meta-llama__Llama-2-13b-chat-hf3.606.008.703.502.905.907.907.207.809.407.802.402.204.008.404.809.604.602.201.406.600.401.802.804.806.802.801.404.809.204.40
meta-llama__Llama-2-7b-chat-hf4.205.608.102.703.206.106.407.303.009.407.401.600.603.808.805.006.003.802.800.205.400.801.906.003.405.002.401.402.605.605.20
mistralai__Mistral-7B-Instruct-v0.24.904.705.502.603.705.204.206.406.807.807.803.401.604.808.807.402.203.409.401.205.400.801.003.602.605.600.800.209.009.205.00
openchat__openchat-3.5-01065.405.905.204.404.705.204.604.907.609.006.002.003.407.209.4010.002.803.8010.002.605.000.802.205.204.607.603.201.808.809.207.40
xverse__XVERSE-13B-Chat4.606.705.504.003.306.105.906.602.0010.006.603.002.405.609.407.200.804.208.004.405.600.801.304.405.807.202.201.808.007.807.40
</details>

判明しているエラー

--num_proc=8でmeta-llama/Llama-2-7b-chat-hfを用いてlightblue/tengu_benchの回答を生成する際にエラーが発生する

→ --num_proc=1ならエラーが出ない(原因不明:tengu-benchのmax-tokenは2600程度だったので1024に変えたら大丈夫なはずだがダメだった)