Home

Awesome

Japanese Language Model Financial Evaluation Harness

This is a harness for Japanese language model evaluation in the financial domain.

0-shot Leaderboard

<!-- lb start -->
ModelAve.chabsacma_basicscpa_auditfp2security_sales_1prompt
anthropic/claude-3-5-sonnet77.0293.4381.5861.8172.8475.44default
nvidia/nemotron-4-340b-instruct70.3191.9386.8440.7056.6375.44default
Qwen/Qwen2-72B69.3592.6484.2149.5052.0068.42default
Qwen/Qwen2-72B-Instruct67.7192.1884.2143.7251.7966.67default
openai/gpt-4-32k66.2793.1681.5837.4450.7468.42default
openai/gpt-466.0793.2078.9537.6950.3270.18default
anthropic/claude-3-opus65.8193.0471.0542.7155.5866.67default
openai/gpt-4o65.2690.9376.3253.0239.3766.67default
openai/gpt-4-turbo64.5992.8676.3236.1850.9566.67default
gemini/gemini-1.5-flash63.1092.3671.0535.9349.4766.67default
anthropic/claude-3-sonnet61.5989.7071.0538.4442.1166.67default
Qwen/Qwen1.5-72B-Chat59.6292.1571.0531.4136.8466.67default
Qwen/Qwen2-57B-A14B59.4590.5278.9524.6240.0063.16default
Qwen/Qwen2-57B-A14B-Instruct59.4091.0373.6827.3940.0064.911.0-0.1.2
Qwen/Qwen-72B59.0889.4676.3228.6439.5861.401.0-0.1.2
Qwen/Qwen1.5-72B58.8290.7771.0526.3837.4768.421.0-0.1
meta-llama/Meta-Llama-3-70B-Instruct58.4890.6176.3229.9042.9552.631.0-0.2.1
tokyotech-llm/Swallow-70b-NVE-instruct-hf58.3290.7263.1621.1153.4763.16default
gemini/gemini-1.5-pro57.9459.9568.4239.7049.6871.93default
Qwen/Qwen-72B-Chat57.3392.1071.0525.3840.2157.891.0-0.1.2
meta-llama/Meta-Llama-3-70B56.8790.1973.6824.8737.6857.891.0-0.1.2
tokyotech-llm/Swallow-70b-NVE-hf56.2686.4260.5320.1052.8461.40default
pfnet/plamo-1.0-prime-beta55.2489.3760.5321.8641.2663.16default
anthropic/claude-3-haiku55.1582.2573.6829.9037.2652.63default
tokyotech-llm/Swallow-70b-hf54.8689.2868.4219.8545.8950.88default
Qwen/Qwen1.5-32B-Chat54.5191.5257.8925.3838.1159.651.0-0.1.2
tokyotech-llm/Swallow-70b-instruct-hf54.4691.3665.7920.3545.6849.12default
Qwen/Qwen2-7B-Instruct53.7891.9460.5325.1335.1656.141.0-0.2.1
tokyotech-llm/Swallow-MX-8x7b-NVE-v0.153.5088.6465.7920.1031.5861.401.0-0.1.2
Qwen/Qwen1.5-32B53.3491.3768.4227.8929.8949.12default
Qwen/Qwen2-7B53.2890.7365.7924.1231.3754.391.0-0.1.2
Qwen/Qwen1.5-14B-Chat52.8290.4357.8925.6335.7954.391.0-0.1.2
pfnet/nekomata-14b-pfn-qfin52.7488.8747.3725.1339.1663.161.0-0.2.1
Qwen/Qwen1.5-14B52.2084.5565.7920.6033.8956.141.0-0.1.2
karakuri-ai/karakuri-lm-8x7b-instruct-v0.151.6383.8757.8916.3340.4259.651.0-0.2.1
pfnet/nekomata-14b-pfn-qfin-inst-merge51.1288.9350.0024.6237.6854.391.0-0.2.1
rinna/nekomata-14b-instruction50.9189.4052.6320.3536.0056.141.0-0.2.1
mistralai/Mixtral-8x7B-Instruct-v0.150.6391.0257.8924.3730.7449.121.0-0.2
gemini/gemini-1.0-pro50.5278.9455.2623.3740.6354.39default
rinna/nekomata-14b50.4685.8863.1620.6031.7950.881.0-0.1.2
Qwen/Qwen-14B50.3086.1463.1619.1032.2150.881.0-0.1.2
openai/gpt-35-turbo50.2789.9852.6318.0929.2661.40default
karakuri-ai/karakuri-lm-8x7b-chat-v0.150.0085.1960.5319.8537.0547.371.0-0.2.1
Qwen/Qwen1.5-7B-Chat49.7386.2750.0024.8731.3756.141.0-0.2.1
Qwen/Qwen-14B-Chat49.1391.0355.2616.8329.8952.63default
stabilityai/japanese-stablelm-instruct-beta-70b47.9384.7742.1119.8533.2659.651.0-0.1.2
rinna/nekomata-7b-instruction47.7586.7144.7417.3430.3259.65default
Qwen/Qwen1.5-MoE-A2.7B-Chat46.6482.1042.1122.8628.2157.891.0-0.1
Qwen/Qwen-7B45.9982.3047.3719.6031.5849.121.0-0.1.2
mistralai/Mistral-7B-Instruct-v0.245.8087.5939.4717.8429.6854.39default
SakanaAI/EvoLLM-JP-v1-7B45.7488.4039.4713.3231.3756.141.0-0.2.1
Xwin-LM/Xwin-LM-70B-V0.145.6587.5839.4716.5832.0052.631.0-0.5
Qwen/Qwen-7B-Chat45.3385.4047.3719.8528.4245.611.0-0.1.2
Rakuten/RakutenAI-7B-instruct44.9674.9850.0017.8432.8449.12default
meta-llama/Meta-Llama-3-8B-Instruct44.7086.7739.4716.8333.0547.371.0-0.2.1
karakuri-ai/karakuri-lm-70b-chat-v0.144.5988.5936.8418.0930.3249.121.0-0.2.1
SakanaAI/EvoLLM-JP-A-v1-7B44.5186.8255.2613.8226.3240.351.0-0.3
mistralai/Mixtral-8x7B-v0.144.2989.3942.1115.5825.2649.12default
meta-llama/Llama-2-70b-chat-hf44.2385.6744.7417.0926.3247.371.0-0.1
Qwen/Qwen1.5-7B43.9985.5439.4718.0929.4747.371.0-0.1.2
Qwen/Qwen1.5-MoE-A2.7B43.1269.2942.1121.6128.2154.391.0-0.1
stabilityai/japanese-stablelm-base-beta-70b43.1179.0536.8416.0825.6857.891.0-0.1.2
Qwen/Qwen1.5-4B42.6882.8242.1113.8229.0545.611.0-0.1.2
rinna/llama-3-youko-8b42.5479.2242.1117.8429.6843.86default
Qwen/Qwen2-1.5B42.2177.4644.7413.8225.8949.121.0-0.1.2
Qwen/Qwen2-1.5B-Instruct42.2074.0844.7413.5729.4749.12default
meta-llama/Meta-Llama-3-8B42.1385.7736.8419.8526.1142.11default
meta-llama/Llama-2-70b-hf41.9684.0734.2116.8329.0545.611.0-0.1.2
sbintuitions/sarashina2-13b41.7982.8426.3219.1026.3254.391.0-0.1.2
cyberagent/calm2-7b-chat-dpo-experimental41.7177.9634.2115.8329.6850.881.0-0.1
rinna/nekomata-7b41.5581.3431.5820.8524.8449.12default
stabilityai/japanese-stablelm-instruct-gamma-7b41.4679.0931.5817.3433.6845.611.0-0.2.1
tokyotech-llm/Swallow-MS-7b-v0.141.3779.2223.6817.0925.4761.401.0-0.2.1
llm-jp/llm-jp-13b-instruct-full-jaster-v1.041.3684.4834.2121.1123.1643.861.0-0.1
Qwen/Qwen1.5-4B-Chat41.2678.4039.4713.5729.2645.611.0-0.1.2
llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.041.1082.2828.9513.5726.3254.391.0-0.3
karakuri-ai/karakuri-lm-70b-v0.141.0458.6039.4718.0931.1657.89default
tokyotech-llm/Swallow-7b-NVE-hf41.0381.3439.4720.1027.3736.841.0-0.1
mosaicml/mpt-30b-instruct40.9583.2534.2119.6027.3740.35default
Fugaku-LLM/Fugaku-LLM-13B-instruct40.9081.9142.1112.8123.7943.861.0-0.1
meta-llama/Llama-2-7b-chat-hf40.6780.3228.9519.8523.3750.88default
elyza/ELYZA-japanese-Llama-2-7b-instruct40.5981.3936.8418.8423.7942.11default
sbintuitions/sarashina2-7b40.5185.1239.4712.5625.0540.351.0-0.1
rinna/youri-7b-chat40.4085.0826.3217.8427.1645.61default
meta-llama/Llama-2-13b-chat-hf40.2980.3639.4713.8225.6842.111.0-0.1
Rakuten/RakutenAI-7B40.2971.8731.5815.3331.7950.881.0-0.1
tokyotech-llm/Swallow-13b-instruct-hf40.2480.0842.1113.8224.8440.351.0-0.2
stabilityai/japanese-stablelm-base-gamma-7b40.1774.8031.5818.3430.5345.611.0-0.2.1
lmsys/vicuna-7b-v1.5-16k39.9179.9128.9516.3325.2649.121.0-0.1
cyberagent/calm2-7b39.8078.2731.5816.5826.9545.611.0-0.1
elyza/ELYZA-japanese-Llama-2-7b39.7879.7636.8413.8224.6343.86default
cyberagent/calm2-7b-chat39.6879.9731.5816.8324.4245.611.0-0.2
Xwin-LM/Xwin-LM-7B-V0.239.6267.6434.2117.5927.7950.881.0-0.2.1
tokyotech-llm/Swallow-7b-NVE-instruct-hf39.5674.2434.2118.3427.1643.861.0-0.1
tokyotech-llm/Swallow-13b-NVE-hf39.4960.9231.5815.0832.0057.891.0-0.1
rinna/youri-7b-instruction39.4778.8236.8419.1024.0038.601.0-0.3
elyza/ELYZA-japanese-Llama-2-13b-instruct39.4273.4634.2114.3229.4745.611.0-0.1
lmsys/vicuna-13b-v1.339.2078.8631.5816.5823.3745.611.0-0.2
elyza/ELYZA-japanese-Llama-2-13b-fast-instruct39.0855.2847.3718.8426.5347.371.0-0.1
rinna/japanese-gpt-neox-3.6b-instruction-ppo38.9073.6634.2114.0726.9545.61default
mistralai/Mistral-7B-Instruct-v0.138.8679.8531.5814.8224.2143.86default
lmsys/vicuna-7b-v1.338.5176.8123.6815.0826.1150.881.0-0.1
elyza/ELYZA-japanese-Llama-2-13b38.4376.6936.8414.0724.2140.35default
mosaicml/mpt-30b-chat38.3074.8526.3218.3424.6347.37default
lmsys/vicuna-33b-v1.338.2866.3126.3217.5925.0556.141.0-0.1
rinna/bilingual-gpt-neox-4b-instruction-sft38.1777.6723.6817.5926.3245.61default
stabilityai/japanese-stablelm-3b-4e1t-instruct38.1368.3734.2116.3326.1145.611.0-0.1
stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b38.0675.2928.9515.8324.6345.611.0-0.1.2
lmsys/longchat-7b-v1.5-32k37.8979.5331.5814.0725.6838.601.0-0.2.1
llm-jp/llm-jp-13b-v2.037.8271.1234.2116.3323.5843.861.0-0.6
rinna/japanese-gpt-neox-3.6b-instruction-sft37.7373.0023.6818.8424.0049.121.0-0.2.1
openai/text-davinci-00337.6853.9244.7417.5926.5345.61default
tokyotech-llm/Swallow-13b-hf37.5461.2828.9516.0825.2656.141.0-0.1
mistralai/Mistral-7B-v0.137.4574.7526.3217.3426.7442.111.0-0.1.2
rinna/youri-7b37.3968.0431.5819.8527.1640.351.0-0.1
mosaicml/mpt-30b37.3576.9523.6816.8327.1642.111.0-0.2.1
tokyotech-llm/Swallow-7b-plus-hf37.2579.0431.5812.8124.2138.601.0-0.1.2
moneyforward/houou-instruction-7b-v337.2273.4226.3216.5825.8943.861.0-0.1.2
Rakuten/RakutenAI-7B-chat37.2161.3026.3217.3432.0049.121.0-0.3
Qwen/Qwen1.5-1.8B37.0369.3328.9519.1025.6842.111.0-0.1
google/recurrentgemma-2b-it36.9461.0436.8417.8423.3745.611.0-0.2.1
google/gemma-2b36.9367.0928.9515.0824.4249.121.0-0.6
meta-llama/Llama-2-7b-hf36.8971.9731.5813.8226.7440.351.0-0.2
llm-jp/llm-jp-1.3b-v1.036.8157.6631.5818.3427.3749.121.0-0.1
google/gemma-1.1-2b-it36.4761.6834.2113.3224.0049.121.0-0.2.1
stabilityai/japanese-stablelm-base-beta-7b36.3662.0336.8415.3325.4742.111.0-0.1.2
matsuo-lab/weblab-10b36.3169.8231.5813.8224.2142.11default
rinna/bilingual-gpt-neox-4b-instruction-ppo36.2374.1523.6815.3325.8942.111.0-0.1
google/gemma-2b-it36.1766.7528.9515.3324.2145.611.0-0.1
moneyforward/houou-instruction-7b-v236.1572.2628.9514.8226.1138.601.0-0.1
sbintuitions/sarashina1-7b36.1158.9139.4713.8222.7445.611.0-0.1
stockmark/stockmark-100b-instruct-v0.136.0973.4626.3214.0722.7443.86default
rinna/japanese-gpt-neox-3.6b-instruction-sft-v236.0668.5221.0517.5924.0049.121.0-0.2.1
stabilityai/japanese-stablelm-base-ja_vocab-beta-7b36.0263.1436.8413.8224.2142.11default
Qwen/Qwen1.5-1.8B-Chat35.9865.5426.3216.8327.3743.861.0-0.2
moneyforward/houou-instruction-7b-v135.4566.8626.3216.3327.3740.351.0-0.1
llm-jp/llm-jp-13b-instruct-full-dolly-oasst-v1.035.4066.9123.6813.0724.2149.121.0-0.6
lmsys/vicuna-13b-v1.5-16k35.3669.0826.3213.8225.4742.111.0-0.2
stockmark/stockmark-13b35.3359.2031.5815.8324.4245.611.0-0.1
pfnet/plamo-13b-instruct35.2763.1026.3216.0825.2645.611.0-0.6
stockmark/stockmark-13b-instruct34.9854.3228.9515.8328.4247.371.0-0.1
stockmark/stockmark-100b34.9768.6326.3213.8224.0042.11default
tokyotech-llm/Swallow-7b-instruct-hf34.8849.4031.5820.6025.4747.37default
cyberagent/open-calm-large34.8153.5828.9516.8323.7950.881.0-0.1
meta-llama/Llama-2-13b-hf34.7556.3036.8413.3226.9540.351.0-0.2.1
llm-jp/llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.034.6356.8431.5816.3326.3242.111.0-0.5
stabilityai/japanese-stablelm-3b-4e1t-base34.5852.3234.2115.5826.9543.861.0-0.1
elyza/ELYZA-japanese-Llama-2-7b-fast34.4937.5436.8417.5926.1154.391.0-0.1
llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.034.4052.9628.9518.5925.8945.611.0-0.5
llm-jp/llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.034.3558.9031.5817.8424.8438.601.0-0.5
pfnet/plamo-13b34.2659.6928.9512.8124.2145.611.0-0.6
stabilityai/japanese-stablelm-instruct-alpha-7b34.2053.4326.3215.8326.3249.121.0-0.3
elyza/ELYZA-japanese-Llama-2-13b-fast34.0659.1231.5814.8224.4240.35default
stabilityai/japanese-stablelm-instruct-beta-7b33.8753.6436.8413.8222.9542.111.0-0.2
rinna/bilingual-gpt-neox-4b33.7958.6331.5814.8223.5840.351.0-0.4
Qwen/Qwen2-0.5B-Instruct33.7255.3328.9515.0821.8947.371.0-0.6
sbintuitions/sarashina1-13b33.7045.2036.8416.8324.0045.611.0-0.2.1
rinna/japanese-gpt-neox-3.6b33.5745.7223.6814.5724.2159.651.0-0.5
Xwin-LM/Xwin-LM-13B-V0.233.5640.3342.1115.8325.6843.861.0-0.1
sbintuitions/sarashina1-65b33.5557.2021.0514.8229.0545.611.0-0.1
pfnet/plamo-13b-instruct-nc33.1854.1523.6816.3326.1145.611.0-0.6
Fugaku-LLM/Fugaku-LLM-13B32.8955.3628.9512.0624.2143.861.0-0.6
google/gemma-7b-it32.4153.1526.3217.3423.1642.11default
llm-jp/llm-jp-13b-v1.032.3660.7621.0513.0724.8442.111.0-0.6
elyza/ELYZA-japanese-Llama-2-7b-fast-instruct32.1836.1639.4718.5926.3240.351.0-0.1.2
line-corporation/japanese-large-lm-1.7b32.1046.7734.2113.8223.5842.111.0-0.4
cyberagent/open-calm-medium32.0249.1226.3213.3224.0047.371.0-0.2.1
google/recurrentgemma-2b31.8449.5126.3215.0824.4243.861.0-0.6
google/gemma-7b31.7548.9123.6816.3324.2145.611.0-0.3
tokyotech-llm/Swallow-7b-hf31.5942.0028.9516.3325.0545.611.0-0.1
line-corporation/japanese-large-lm-1.7b-instruction-sft31.5150.5026.3213.3223.5843.861.0-0.5
google/gemma-1.1-7b-it31.3636.6828.9517.0926.7447.371.0-0.2
sbintuitions/tiny-lm-chat31.2046.7426.3213.8225.2643.86default
karakuri-ai/karakuri-lm-7b-apm-v0.231.1035.9536.8418.8425.2638.601.0-0.2
stockmark/gpt-neox-japanese-1.4b31.0751.1026.3215.8325.2636.841.0-0.6
llm-jp/llm-jp-13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.130.8742.1123.6818.0924.8445.611.0-0.4
Qwen/Qwen1.5-0.5B30.8250.4021.0515.5826.7440.351.0-0.6
cyberagent/open-calm-3b30.7637.4926.3215.3323.7950.881.0-0.1
stabilityai/japanese-stablelm-instruct-alpha-7b-v230.5535.9526.3217.0922.5350.881.0-0.2.1
cyberagent/open-calm-1b30.4630.0828.9516.8323.7952.631.0-0.1
sbintuitions/tiny-lm30.3040.4221.0519.6024.8445.611.0-0.1.2
abeja/gpt-neox-japanese-2.7b30.1740.4331.5814.0724.4240.351.0-0.1.2
stabilityai/japanese-stablelm-base-alpha-7b30.1635.9531.5816.3324.8442.11default
Qwen/Qwen1.5-0.5B-Chat29.9836.6934.2115.3325.0538.601.0-0.1
line-corporation/japanese-large-lm-3.6b-instruction-sft29.5435.9526.3214.0724.0047.371.0-0.2.1
line-corporation/japanese-large-lm-3.6b29.5435.9526.3214.0724.0047.371.0-0.1
Qwen/Qwen2-0.5B29.4935.9828.9517.3424.8440.351.0-0.2
cyberagent/open-calm-small29.4835.9523.6818.5923.5845.611.0-0.6
cyberagent/open-calm-7b28.8037.8328.9513.0723.7940.351.0-0.4
<!-- lb end -->

Note: Prompt selection is not performed only for Open AI models. For Open AI models, results are counted as wrong when the content filter is applied.

Recently, we updated the evaluation policy. Please refer to the UPDATE.md for more details.

How to evaluate your model

  1. git clone this repository
  2. Install the requirements
    poetry install
    
  3. Choose your prompt template based on docs/prompt_templates.md and num_fewshots (In this official leaderboard, we use prompt template peforming the best score.)
  4. Replace TEMPLATE to the version and change MODEL_PATH . And, save the script as harness.sh
    MODEL_ARGS="pretrained=MODEL_PATH,other_options"
    TASK="chabsa-1.0-TEMPLATE,cma_basics-1.0-TEMPLATE,cpa_audit-1.0-TEMPLATE,security_sales_1-1.0-0.2,fp2-1.0-TEMPLATE"
    python main.py --model hf --model_args $MODEL_ARGS --tasks $TASK --num_fewshot 0 --output_path "result.json"
    
  5. Run the script
    poetry run bash harness.sh
    

vllm is also supported. Please refer to model examples and lm_eval official pages.

Model Regulation

Citation

If you use this repository, please cite the following paper:

@preprint{Hirano2023-pre-finllm,
  title={{金融分野における言語モデル性能評価のための日本語金融ベンチマーク構築}},
  author={平野, 正徳},
  doi={10.51094/jxiv.564},
  year={2023}
}
@inproceedings{Hirano2023-finnlpkdf,
  title={{Construction of a Japanese Financial Benchmark for Large Language Models}},
  author={Masanori Hirano},
  booktitle={Joint Workshop of the 7th Financial Technology and Natural Language Processing (FinNLP), the 5th Knowledge Discovery from Unstructured Data in Financial Services (KDF), and The 4th Workshop on Economics and Natural Language Processing (ECONLP)},
  pages={1-9},
  doi={10.2139/ssrn.4769124},
  url={https://aclanthology.org/2024.finnlp-1.1},
  archivePrefix={arXiv},
  arxivId={2403.15062},
  year={2024}
}

Or cite directory this repository:

@misc{Hirano2023-jlfh
    title={{Japanese Language Model Financial Evaluation Harness}},
    author={Masanori Hirano},
    year={2023},
    url = {https://github.com/pfnet-research/japanese-lm-fin-harness}
}

Note:

cpa_audit data comes from an existing collection of Japanese CPA Audit exam questions and answers [1]. In addition, this dataset was built using data from the Institute of Certified Public Accountants and Auditing Oversight Board Web site and is subject to a CC-BY 4.0 license. We got special permission to include this data directly for this evaluation. Thanks to their contribution.

[1] Tatsuki Masuda, Kei Nakagawa, Takahiro Hoshino, Can ChatGPT pass the JCPA exam?: Challenge for the short-answer method test on Auditing, JSAI Technical Report, Type 2 SIG, 2023, Volume 2023, Issue FIN-031, Pages 81-88, Released on J-STAGE October 12, 2023, Online ISSN 2436-5556, https://doi.org/10.11517/jsaisigtwo.2023.FIN-031_81

Contribution

This project is owned by Preferred Networks and maintained by Masanori Hirano.

If you want to add models or evaluation dataset, please let me know via issues or pull requests.