Home

Awesome

<div style="text-align:center"> <!-- <img src="https://big-cheng.com/k2/k2.png" alt="k2-logo" width="200"/> --> <h2>📈 CFGPT: Chinese Financial Assistant with Large Language Model</h2> </div>

<a href='https://arxiv.org/abs/2309.10654'><img src='https://img.shields.io/badge/Paper-ArXiv-C71585'></a> <a href='https://huggingface.co/TongjiFinLab/CFGPT1-pt-7B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging Face-CFGPT(pt)-red'></a> <a href='https://huggingface.co/TongjiFinLab/CFGPT1-sft-7B-LoRA'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging Face-CFGPT(sft%20LoRA)-red'></a> <a href='https://huggingface.co/TongjiFinLab/CFGPT1-sft-7B-Full'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging Face-CFGPT(sft%20Full)-red'></a>

English | 简体中文

Introduction

We introduce CFGPT, an open-source language model trained by firstly further pretraining general LLMs on collected and cleaned Chinese finance text data (CFData-pt), including financial domain-specific data (announcement, finance articles, finance exams, finance news, finance research papers) and general data (Wikipedia), and secondly fine-tuning with knowledge-intensive instruction tuning data (CFData-sft). As for preliminary evaluation, we use CFBenchmark-Basic. CFGPT outperforms the baselines on objective and subjective tasks compared to several baseline models with similar parameters.

The following is the overview of training CFGPT:

<div align="center"> <img align="center" src=./figs/CFGPT-TRAIN.svg width="100%"/> </div>

Content

Quick Start

1. Prepare the code and the environment

Clone our repository, create a Python environment, and activate it via the following command

git clone https://github.com/TongjiFinLab/CFGPT.git
cd CFGPT
conda create -n env_name python=3.10   
source activate env_name 
pip install -r requirements.txt

2. Prepare the pretrained CFGPT1

The CFGPT1 consists of three parts: a pretrain model, continued pretraining InternLM-7B on our CFData-pt, an adapter model (trained via PEFT on our CFData-sft), and a Full-finetuned model trained base on the pretrain model.

Pretrain modelAdapter modelFull SFT Model
CFGPT1-pt-7BCFGPT1-sft-7B-loraCFGPT1-sft-7B-full

3. Use CFGPT1-sft-7B-LoRA

from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
base_model = 'TongjiFinLab/CFGPT1-pt-7B'
lora_weights = 'TongjiFinLab/CFGPT1-sft-7B-LoRA'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model,
    trust_remote_code=True,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
model = PeftModel.from_pretrained(
    model,
    lora_weights,
    device_map=device_map,
)
model = model.eval()
inputs = tokenizer("""你是一名金融从业者,请对这篇新闻进行情感分析。请从(中性、积极、消极)中选取答案。新闻内容:挖贝快讯:特步国际发布2023年第二季度中国内地业务营运状况,披露截至2023年6月30日止3个月零售销售实现高双位数同比增长(包括线上线下渠道),零售折扣水平约七五折。同时,2022年7月MSCI首次予以特步ESG评级,一年后评级表现即迎来提升。明晟MSCI上调特步ESG评级,由“BB”升至“BBB”。\n回答:""", return_tensors='pt').to(device_map)
pred = model.generate(**inputs, max_new_tokens=64, do_sample=False, repetition_penalty=1.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True).split('回答:')[1])

4. Use CFGPT1-sft-7B-Full

from transformers import AutoModel, AutoTokenizer
base_model = 'TongjiFinLab/CFGPT1-sft-7B-Full'
device_map = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModel.from_pretrained(
    base_model,
    trust_remote_code=True,
    device_map=device_map,
    torch_dtype=torch.bfloat16
)
model = model.eval()
inputs = tokenizer("""你是一名金融从业者,请对这篇新闻进行情感分析。请从(中性、积极、消极)中选取答案。新闻内容:挖贝快讯:特步国际发布2023年第二季度中国内地业务营运状况,披露截至2023年6月30日止3个月零售销售实现高双位数同比增长(包括线上线下渠道),零售折扣水平约七五折。同时,2022年7月MSCI首次予以特步ESG评级,一年后评级表现即迎来提升。明晟MSCI上调特步ESG评级,由“BB”升至“BBB”。\n回答:""", return_tensors='pt').to(device_map)
pred = model.generate(**inputs, max_new_tokens=64, do_sample=False, repetition_penalty=1.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True).split('回答:')[1])

User Cases

Data

In this repo, we share the samples of CFData:

Further pretrain

The pre-training dataset consists of 591 million documents and 193 billion tokens, including six sub-datasets

We sample a financial text sub-corpus from CFData-pt for further pretraining on InternLM-7B consists of 13.7 billion tokens from a large amount of Chinese financial data and analytics and a small amount of general-purpose text, such as announcements, research reports, social media content, financial news articles, and Wikipedia. And they were mainly collected by ourselves.

Supervised Finetuning

The supervised fine-tuning dataset consist 1.6 million instructions pairs and 1.5 billion tokens, including six financial tasks:

We employ high-quality domain specific data to achieve finance domain adaptation during supervised finetuing. The dataset includes six financial datasets to reflect different aspects of financial analysis and decision-making, which include sentiment analysis, event detection, report summarization, topic decomposition, question answering, and stock movement prediction. CFData-sft provides much text information in the financial domain, allowing a FinLLM to learn from different of sources. Considering requirement in reality, we reform these financial supervised finetuning dataset into ten tasks.

The details are as follows:

TaskTask DescriptionDatasetSize
SentimentIdentify the sentiment associated with financial documentCFData-SA13K
SummaryGenerate a content summary based on the provided financial documentCFData-RS18K
RiskGenerate risk alerts based on the provided financial documentCFData-RS20K
SuggestionGenerate investment recommendations based on the provided financial documentCFData-RS18K
EventIdentify the event categories associated with financial documentCFData-ED12K
IndustryIdentify the industry categories associated with financial documentCFData-ED14K
CompanyIdentify the company names associated with financial documentCFData-ED12K
ProductIdentify the product names associated with financial documentCFData-ED21K
ExamAnswer true-false questions related to finance questionCFData-QA16K
StockPredict stocks future movementCFData-SP15K

The researchers could read the sample case of CFData

Code

Further Pretrain

The training script is ./code/train/pretrain

deepspeed --include localhost:0,1,2,3,4,5,6,7 --master_port 60002 bf_16_parallel_train.py --config bf_16_parallel_train.yml > bf_16_parallel_train.log 2>&1
<!-- ![loss curve](https://big-cheng.com/k2/loss_curve.png) --> <div align="center"> <img align="center" src=./figs/CFGPT-Training-loss.svg width="100%"/> </div>

The trainer parameters we use are in ./code/train/pretrain/bf_16_parallel_train.yml:

# basic setting
model_name: path/of/your/further/pretrain/model
dataset: path/to/your/further/pretrain/dataset
deepspeed: ./ds_config.json
seed: 42
max_seq_length: 2048

# train setting 
output_dir: ./bf_16_parallel_train
logging_steps: 10
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 16
learning_rate: 2.0e-4
weight_decay: 0.01
warmup_steps: 1000
save_steps: 1000
fp16: 0
bf16: 1
torch_compile: 0
save_strategy: steps
remove_unused_columns: 0

The deepspeed parameters we use are in ./code/train/pretrain/ds_config.json:

{
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": 0.01
          }
        },
     "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 1,
        "reduce_bucket_size": 5e8
    }
}

Supervised Finetuning

The training script is in ./code/train/lora. Here we use the lora-bf16 as illustrations.

deepspeed --include localhost:6,7 --master_port 60005 lora_bf_16_parallel_train.py --config lora_bf_16_parallel_train.yml > lora_bf_16_parallel_train.log 2>&1

The trainer parameters we use are in ./code/train/lora/bf16/bf_16_parallel_train.yml:

# basic setting
model_name: path/of/your/supervised/finetuning/model
dataset: path/to/your/supervised/finetuning/dataset
dataset_eval: path/to/your/evaluate/dataset
deepspeed: ./ds_config.json
seed: 42
max_seq_length: 2048

# train setting 
output_dir: ./lora_bf_16_parallel_train
num_train_epochs: 1
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
gradient_accumulation_steps: 1
learning_rate: 2.0e-4
weight_decay: 0.01
warmup_steps: 500
fp16: 0
bf16: 1
torch_compile: 0
save_strategy: steps
save_steps: 500
evaluation_strategy: steps
eval_steps: 100
logging_steps: 10
remove_unused_columns: 0

# lora setting
rank: 64
lora_alpha: 16
lora_dropout: 0.05
target_modules: ['k_proj', 'o_proj', 'down_proj', 'v_proj', 'q_proj', 'gate_proj', 'up_proj']
bias: 'none'

# restart info
resume_from_checkpoint: null

The deepspeed parameters we use are in ./code/train/lora/bf16/ds_config.json:

{
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,

    "optimizer": {
      "type": "AdamW",
      "params": {
        "lr": "auto",
        "betas": "auto",
        "eps": "auto",
        "weight_decay": "auto"
        }
      
      },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "bf16": {
      "enabled": true
    },
    "zero_optimization": {
        "stage": 0
    }
}

Evaluation

The performance of our CFGPT2 (13B) is shown as follows:

C-Eval

ModelSizeSTEMSocial ScienceHumanitiesOthersAverageAverage(hard)
GPT-4-67.177.664.567.868.754.9
ChatGPT175B52.961.850.953.654.441.4
InternLM-7B7B48.067.455.445.852.837.1
ChatGLM2-6B6B48.660.551.349.851.737.1
Qwen-7B7B52.874.163.155.259.641.0
Qwen-14B14B65.785.475.368.472.153.7
Baichuan-7B7B38.252.046.239.342.831.5
Baichuan-13B13B47.066.857.349.853.636.7
Baichuan2-13B-Chat13B48.470.560.355.056.637.9
InternLM2-7B7B52.371.964.961.060.838.8
InternLM2-20B20B56.175.762.662.463.046.3
CFGPT2-7B7B56.776.463.963.063.543.2
CFGPT2-20B20B64.680.872.168.969.249.9

FinEval

ModelSizeFinanceEconomyAccountingCertificateAverage
GPT-4-71.074.559.370.468.6
ChatGPT175B59.361.645.255.155.0
InternLM-7B7B49.049.240.549.447.1
ChatGLM2-6B6B46.546.444.551.547.4
Qwen-Chat-7B7B51.552.144.553.650.5
Qwen-7B7B54.554.450.355.853.8
Baichuan-7B-Chat7B44.941.534.945.642.0
Baichuan-13B-Chat13B51.651.141.752.849.4
InternLM2-7B7B54.254.043.555.451.9
InternLM2-20B20B57.358.947.458.655.5
CFGPT2-7B7B62.663.958.966.062.9
CFGPT2-20B20B64.064.962.167.964.8

CFBenchmark-Basic

ModelSizeCompanyProductR.AvgSectorEventSentimentC.AvgSummaryRiskSuggestionG.AvgAvg
HUMAN-0.9310.7440.8380.9750.9390.9120.9421.0001.0001.0001.0000.927
ChatGPT20B0.7970.1980.4980.4530.4580.4250.4550.5930.5410.7710.6350.529
ERNIE-Bot260B0.8070.3000.5330.4080.3500.1860.3150.7150.5900.7160.6730.507
ERNIE-Bot-4-0.8190.4170.6180.4180.3580.3750.3840.7210.6290.7180.6890.564
Falcon-7B7B0.6710.1680.4200.1690.1320.2500.1840.3020.3010.2460.2830.296
Falcon-7B-chat7B0.5820.0460.3140.1120.1420.1530.1350.3070.2990.2580.2880.246
bloomz-7B17B0.7650.1660.4650.2520.1540.3940.2670.4510.3710.4620.4280.387
bloomz-7Bt1-mt7B0.7510.1570.4540.0870.1820.3800.2160.4250.3790.3960.4000.357
Qwen-7B7B0.7800.3570.5690.4800.3350.3790.3980.7500.5050.7130.6560.541
Qwen-Chat-7B7B0.7630.3600.5620.4000.3670.2650.3440.5480.3070.3790.4110.439
Qwen-14B14B0.8050.4210.6130.4810.3500.3850.4050.7540.6080.7170.6930.570
Qwen-Chat-14B14B0.8140.4420.6280.3820.4000.3500.3770.7320.4780.7360.6490.551
ChatGLM2-6B6B0.7470.3130.5300.2850.3000.3570.3140.6570.4540.6710.5940.479
Baichuan2-7B-Base7B0.6720.3400.5060.3420.4900.4800.4370.7390.6190.7510.7030.549
Baichuan2-7B-Chat7B0.7570.4020.5790.4250.4750.3230.4080.7250.6480.7320.7020.563
Baichuan2-13B-Base13B0.7810.3300.5550.4360.4960.4770.4700.7250.5030.7470.6580.561
Baichuan2-13B-Chat13B0.7970.3140.5560.4720.5070.3870.4550.7390.6340.7460.7060.572
InternLM-7B7B0.6120.2330.4230.2660.3110.3280.3020.3780.3360.3790.3640.363
InternLM-7B-Chat7B0.6320.2610.4470.2720.3640.3990.3450.3630.2700.3530.3290.374
InternLM-20B20B0.8090.3580.5830.5000.4270.4170.4480.7060.6530.7280.6950.575
InternLM-20B-Chat20B0.4880.3620.4250.3230.3270.3700.3400.7060.5780.7620.6620.476
CFGPT1-stf-LoRA7B0.8200.4140.6170.5690.7290.7690.6890.7450.5840.6090.6460.650
CFGPT1-sft-Full7B0.8360.4760.6560.7000.8080.8290.7790.7980.6690.8080.7580.731
CFGPT2-7B7B0.8340.4700.6520.6440.7500.7930.7290.8010.6920.7900.7610.714
CFGPT2-20B20B0.8910.5010.6960.7220.8250.8650.8060.8250.7270.8230.7920.755

OpenFinData

ModelSizeKnowledgeCaluationExplanationIdentificationAnalysisComplianceAverage
ERNIE-Bot-3.5-78.070.482.175.377.736.770.0
ERNIE-Bot-4-87.373.684.377.079.137.373.1
InternLM-7B7B65.345.871.462.559.237.256.9
ChatGLM2-6B6B62.437.270.859.258.338.754.4
Qwen-Chat-7B7B71.340.571.458.651.340.055.5
Qwen-Chat-14B14B78.057.675.671.659.340.663.8
Baichuan2-7B-Chat7B46.237.076.560.255.028.750.6
Baichuan2-13B-Chat13B69.339.575.365.762.031.357.2
InternLM2-7B7B70.239.973.462.861.439.557.8
InternLM2-20B20B76.452.676.366.263.942.162.9
CFGPT2-7B7B81.962.875.271.364.168.270.5
CFGPT2-20B20B84.666.578.175.966.071.973.8

Acknowledgements

CFGPT has referred to the following open-source projects. We want to express our gratitude to the researchers of the projects.

To-Do List

License

The use of the source code of CFGPT complies with the Apache 2.0 License. CFGPT model also supports commercial use under the base model Licenses of InternLM 7B&20B Model and the Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.

Thanks To Our Contributors :

<a href="https://github.com/TongjiFinLab/CFGPT/graphs/contributors"> <img src="https://contrib.rocks/image?repo=TongjiFinLab/CFGPT" /> </a>

Citation

If you find CFGPT is useful for your research, please consider citing the following papers.

@article{li2023cfgpt,
  title={CFGPT: Chinese financial assistant with large language model},
  author={Li, Jiangtong and Bian, Yuxuan and Wang, Guoxuan and Lei, Yang and Cheng, Dawei and Ding, Zhijun and Jiang, Changjun},
  journal={arXiv preprint arXiv:2309.10654},
  year={2023}
}

@article{li2024ra,
  title={RA-CFGPT: Chinese financial assistant with retrieval-augmented large language model},
  author={Li, Jiangtong and Lei, Yang and Bian, Yuxuan and Cheng, Dawei and Ding, Zhijun and Jiang, Changjun},
  journal={Frontiers of Computer Science},
  volume={18},
  number={5},
  pages={185350},
  year={2024},
  publisher={Springer}
}