Home

Awesome

<div style="text-align:center"> <!-- <img src="https://big-cheng.com/k2/k2.png" alt="k2-logo" width="200"/> --> <h2>📈 CFBenchmark: Chinese Financial Assistant Benchmark for Large Language Model</h2> </div> <div align="left"> <a href='https://arxiv.org/abs/2311.05812'><img src='https://img.shields.io/badge/Paper-ArXiv-C71585'></a> <a href='https://huggingface.co/datasets/TongjiFinLab/CFBenchmark'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging Face-CFBenchmark-red'></a> <a href=''><img src='https://img.shields.io/badge/License-Apache--2.0-blue.svg'></a> </div>

English | 简体中文

Introduction

Welcome to CFBenchmark

In recent years, with the rapid development of Large Language Models~(LLMs), outstanding performance has been achieved in various tasks by existing LLMs. However, we notice that there is currently a limited amount of benchmarks focused on assessing the performance of LLMs in specific domains.

The "InternLM·JiShi" Chinese Financial Evaluation Benchmark (CFBenchmark) basic version consists of data from CFBenchmark-Basic and CFBenchmark-OpenFinData, focusing on evaluating the capabilities and safety of related large models in practical financial applications in the following aspects:

In the future, the "InternLM·JiShi" Chinese Financial Evaluation Benchmark will continue to deepen the construction of the financial big model evaluation system, including assessing the accuracy, timeliness, safety, privacy, and compliance of the model-generated content in the financial industry application process.

<div align="center"> <img src="imgs/Framework.png" width="100%"/> <br /> <br /></div>

News

[2024.03.17] Added the evaluation on the financial dataset CFBenchmark-OpenFinData and released an implementation of the evaluation of the subjective questions in this dataset. In addition, we also reported the test results of 9 LLMs on the OpenFinData dataset.

OpenFinData is publicly released by EastMoney.com and Shanghai AI Lab. See details Github.

[2023.11.10] We released CFBenchmark-Basic and the corresponding technical report, mainly focusing on a comprehensive evaluation of large models in financial natural language tasks and financial text generation tasks.

Contents

CFBenchmark-Basic

CFBenchmark-Basic includes 3917 financial texts spanning three aspects and eight tasks, organized from three aspects, financial recognition, financial classification, and financial generation.

We provide two examples to reveal how the few-shot setting and zero-shot setting work during evaluation.

Example 1 Fewshot Input:

<div align="center"> <img src="imgs/fewshot.png" width="100%"/> <br /> <br /></div>

Example 2 Zeroshot Input:

<div align="center"> <img src="imgs/zeroshot.png" width="100%"/> <br /> <br /></div>

QuickStart

Installation

Below are the steps for quick installation.

   conda create --name CFBenchmark python=3.10
   conda activate CFBenchmark
    git clone https://github.com/TongjiFinLab/CFBenchmark
    cd CFBenchmark
    pip install -r requirements.txt

Evaluation

CFBenchmark-Basic

We have prepared the testing and evaluation codes for you in repo CFBenchmark-Basic/src.

To begin the evaluation, you can run the following code from the command line:

cd CFBenchmark-Basic/src
python -m run.py

You can enter CFBenchmark-Basic/src/run.py to modify the parameters in it to make the code running path meet your requirements.

from CFBenchmark import CFBenchmark
if __name__=='__main__':

    # EXPERIMENT SETUP
    modelname = 'YOUR-MODEL-NAME'
    model_type= 'NORMAL' #NORMAL or LoRA
    model_path= 'YOUR-MODEL-PATH'
    peft_model_path= ''#PASS YOUR OWN PATH OF PEFT MODEL IF NEEDED
    fewshot_text_path= '../fewshot'#DEFAULT PATH
    test_type='few-shot'#LET'S TAKE THE FEW-SHOT TEST AS AN EXAMPLE
    response_path='../cfbenchmark-response'#PATH TO RESERVE THE RESPONSE OF YOUR MODEL
    scores_path='../cfbenchmark-scores'	#PATH TO RESERVE THE SCORE OF YOUR MODEL
    embedding_model_path='../bge-zh-v1.5' #PASS YOUR OWN PATH OF BGE-ZH-V1.5
    benchmark_path='../data' #DEFAULT PATH

    #generate Class CFBenchmark
    cfb=CFBenchmark(
        model_name=modelname,
        model_type=model_type,
        model_path=model_path,
        peft_model_path=peft_model_path,
        fewshot_text_path=fewshot_text_path,
        test_type=test_type,
        response_path=response_path,
        scores_path=scores_path,
        embedding_model_path=embedding_model_path,
        benchmark_path=benchmark_path,
    )
    
    cfb.generate_model()# TO GET RESPONSE FROM YOUR MODEL
    cfb.get_test_scores()# TO GET YOUR MODEL SCORES FROM RESPONSE

We defined a class CFBenchmark to do the evaluation.

class CFBenchmark:
    def __init__(self,
                 model_name,
                 model_type,
                 model_path,
                 peft_model_path,
                 fewshot_text_path,
                 test_type,
                 response_path,
                 scores_path,
                 embedding_model_path,
                 benchmark_path
                 ) -> None:

CFBenchmark-OpenFinData

In the CFBenchmark-OpenFinData directory, we have prepared the code and data for testing and evaluation. The design of the evaluation code is similar to Fineva1.0, where the mode of calling the evaluation model is defined through CFBenchmark-OpenFinData/src/evaluator, and the key parameters are configured and experimented with through the bash files in CFBenchmark-OpenFinData/run_scripts.

To run the evaluation, you can execute the following code in the command line:

cd CFBenchmark-OpenFinData/run_scripts
sh run_baichuan2_7b.sh

It is important to note that since the evaluation process of OpenFinData involves subjective judgement, our evaluation framework utilizes ERNIE to evaluate financial interpretation and analysis problems as well as financial compliance issues. To smoothly use the ERNIE API for evaluation, please set BAIDU_API_KEY and BAIDU_SECRET_KEY in your environment variables, so that the get_access_token function in CFBenchmark-OpenFinData/src/get_score.py can run successfully.

def get_access_token():
    """
    使用 API Key,Secret Key 获取access_token,替换下列示例中的应用API Key、应用Secret Key
    """

    url = "https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={}&client_secret={}".format(os.environ.get("BAIDU_API_KEY"), os.environ.get("BAIDU_SECRET_KEY"))
    
    payload = json.dumps("")
    headers = {
        'Content-Type': 'application/json',
        'Accept': 'application/json'
    }
    
    response = requests.request("POST", url, headers=headers, data=payload)
    return response.json().get("access_token")

Performance of Existing LLMs

We utilize different metrics to evaluate the performance of LLMs in the financial domain on our CFBenchmark.

For recognition and classification tasks in CFBenchmark-Basic, we employ the F1 score as the evaluation metric, which balances precision and recall.

For the generation tasks in CFBenchmark-Basic, we utilize cosine similarity between the vector representation (generated by bge-zh-v1.5) of ground truth and the generated answer to measure the generation ability.

For the knowledge, calculation, and identification in CFBenchmark-OpenFinData, we calculate the accuracy of multi-choice question.

For the explanation, analysis, and compliance in CFBenchmark-OpenFinData, we exploit the ERNIE-Bot-4 as socrer to judge the correctness of generated answer with ground truth.

The performance of LLMs are demonstrated below:

CFBenchmark-Basic

ModelSizeCompanyProductR.AvgSectorEventSentimentC.AvgSummaryRiskSuggestionG.AvgAvg
GPT-3.5-79.719.849.845.345.842.545.559.354.177.163.552.9
GPT-4-83.338.260.848.250.149.949.465.360.279.268.259.4
ERNIE-Bot-3.5-80.730.053.340.835.018.631.571.559.071.667.350.7
ERNIE-Bot-4-81.941.761.841.835.837.538.472.162.971.868.956.4
ChatGLM2-6B6B74.731.353.028.530.035.731.465.745.467.159.447.9
ChatGLM3-6B6B75.125.250.233.532.739.735.368.453.670.564.249.9
GLM4-9B-Chat9B81.326.153.749.651.547.649.673.562.472.669.557.6
Qwen-Chat-7B7B76.336.056.240.036.726.534.454.830.737.941.143.9
Qwen1.5-Chat-7B7B83.535.359.434.337.551.641.173.758.773.168.556.3
Qwen2-Chat-7B7B82.434.858.654.449.941.148.575.055.976.969.258.8
Baichuan2-7B-Chat7B75.740.257.942.547.532.340.872.564.873.270.256.3
Baichuan2-13B-Chat13B79.731.455.647.250.738.745.573.963.474.670.657.2
InternLM2-7B-Chat7B75.719.547.646.428.442.239.073.754.374.967.651.4
InternLM2-20B-Chat20B74.227.650.948.432.437.439.473.258.074.168.452.9
InternLM2.5-7B-Chat7B75.224.349.853.134.345.744.474.557.073.268.254.1

CFBenchmark-OpenFinData

ModelSizeKnowledgeCaluationExplanationIdentificationAnalysisComplianceAverage
GPT-3.5-77.268.881.976.375.135.863.9
GPT-4-89.277.284.476.982.539.274.9
ERNIE-Bot-3.5-78.070.482.175.377.736.770.0
ERNIE-Bot-4-87.373.684.377.079.137.373.1
ChatGLM2-6B6B62.437.270.859.258.338.754.4
ChatGLM3-6B6B66.538.076.561.560.132.055.8
GLM4-9B-Chat9B81.856.979.363.578.229.564.9
Qwen-Chat-7B7B71.340.571.458.651.340.055.5
Qwen1.5-Chat-7B7B67.353.984.667.776.830.063.3
Qwen2-Chat-7B7B82.561.384.269.880.119.366.2
Baichuan2-7B-Chat7B46.237.076.560.255.028.750.6
Baichuan2-13B-Chat13B69.339.575.365.762.031.357.2
InternLM2-7B-Chat7B70.239.973.462.861.439.557.8
InternLM2-20B-Chat20B76.452.676.366.263.942.162.9
InternLM2.5-7B-Chat7B80.766.685.071.783.135.470.4

CFBenchmark

ModelSize金融自然语言金融场景计算金融分析与解读金融合规与安全平均
GPT-3.5-52.974.178.535.860.3
GPT-4-59.483.583.539.266.4
ERNIE-Bot-3.5-50.774.579.936.760.4
ERNIE-Bot-4-56.482.881.737.364.6
ChatGLM2-6B6B47.964.164.638.753.8
ChatGLM3-6B6B49.968.268.332.054.6
GLM4-9B-Chat9B57.667.478.829.558.3
Qwen-Chat-7B7B43.967.161.440.053.1
Qwen1.5-Chat-7B7B56.373.280.730.060.0
Qwen2-Chat-7B7B58.878.882.219.359.8
Baichuan2-7B-Chat7B56.361.065.828.753.0
Baichuan2-13B-Chat13B57.270.168.631.356.8
InternLM2-7B-Chat7B51.468.867.439.556.8
InternLM2-20B-Chat20B52.973.070.142.159.5
InternLM2.5-7B-Chat7B54.179.184.035.463.2

Acknowledgements

CFBenchmark has referred to the following open-source projects. We want to express our gratitude and respect to the researchers of the projects.

To-Do

License

CFBenchmark is a research preview intended for non-commercial use only, subject to the Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations. The code is released under the Apache License 2.0.

Thanks To Our Contributors :

<a href="https://github.com/TongjiFinLab/CFBenchmark/graphs/contributors"> <img src="https://contrib.rocks/image?repo=TongjiFinLab/CFBenchmark" /> </a>

Citation

@misc{lei2023cfbenchmark,
      title={{CFBenchmark}: Chinese Financial Assistant Benchmark for Large Language Model}, 
      author={Lei, Yang and Li, Jiangtong and Cheng, Dawei and Ding, Zhijun and Jiang, Changjun},
      year={2023},
      eprint={2311.05812},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}