Awesome

pfmt-bench-fin-ja: Preferred Multi-turn Benchmark for Finance in Japanese

Note: model judgments are changed from gpt-4o to gpt-4o-mini since September 2024. (Because of cost issue)

This is a benchmark measuring the generation quality of LLMs for financial conversations in Japanese. The benchmark consists of 360 dialogues, each containing 2 turns. There are 12 types of tasks, writing, roleplay, knowledge, extraction, reasoning, math, coding, idea, translation, ethics, trustworthiness, and ESGs. This benchmark aims to evaluate the generation quality of LLMs in financial conversations in Japanese.

Originally, MT-bench contains writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities tasks. Instead of STEM and humanities tasks, we added knowledge task for financial conversations. In addition, we also newly employed idea, translation, ethics, trustworthiness, and ESGs tasks.

The evaluation is carried out with a 10-grade scale, and the evaluation is done by gpt-4o-mini.

How to use

The usages are almost the same as MT-bench

set the environmental variable using .env file
generate models' answers:

# for HF models
python gen_model_answers.py --model-path rinna/nekomata-7b rinna/nekomata-7b-instruction ...
# for api models (currently, openai, anthropic are supported)
python gen_api_answer.py --model gpt-35-turbo --parallel 10

For gen_model_answers.py, you can set ----disable-strict-injection-check to disable strict injection check. The strict injection check is a function that checks whether the model's answer contains the generated next questions and answers, which could be a problem when the model outputs are evaluated in the next step. If you do not use this option (default), the output contains the special marker for Q&A, such as ###ユーザー and ###アシスタント, will be removed.

generate judgments using gpt-4o-mini:

# highly recommeded to evaluate model using single mode
python gen_judgment.py [--model-list gpt-35-turbo rinna/nekomata-7b-instruction] --mode single --parallel 10
# for pairwise evaluation (not recommended because a lot of requests are required)
python gen_judgment.py [--model-list gpt-35-turbo rinna/nekomata-7b-instruction] --mode pairwise-all --parallel 10

If you don't set model-list, all models generating answers are evaluated.

getting aggregated results:

python make_leaderboard.py

for getting aggregated results, which contain scores of each task.

python show_result.py

for calculating the results of each task on each turn.

Citation

@misc{Hirano2024-pfmt-jxiv,
    title={{金融分野に特化した複数ターン日本語生成ベンチマークの構築}},
    author={平野, 正徳 and 今城, 健太郎},
    year={2024},
    doi={10.51094/jxiv.1000},
    url = {https://jxiv.jst.go.jp/index.php/jxiv/preprint/view/1000}
}

@misc{Hirano2024-pfmt,
    title={{pfmt-bench-fin-ja: Preferred Multi-turn Benchmark for Finance in Japanese}},
    author={Masanori Hirano and Kentaro Imajo},
    year={2024},
    url = {https://github.com/pfnet-research/pfmt-bench-fin-ja}
}

Contribution

This project is created by Masanori Hirano and Kentaro Imajo, owned by Preferred Networks and maintained by Masanori Hirano.

Note

Models that are tuned based on this benchmark should not be evaluated by this benchmark. For example, model merging should not be performed based on this benchmark.

License

This benchmark code is licensed under the MIT License. However, questions and some results files are exposed under each model's license.