Home

Awesome

CALM: Credit and Risk Assessment Large Language Model

Content

1. Preparing the environment

Creating the environment using Conda, followed by installing the required packages using pip.

pip install -r requirements.txt

2. Run

2.1 Download data

Before running, please download rawdata to data/CRA_resample_0.045M.json

2.1.1 Convert data format

export raw_data=/path_to/CRA_resample_0.045M.json
export conv_data=/path_to/CRA_resample_0.045M_conv.json
export data_name=CRA
export dev_data=/path_to/CRA-resample-dev3k.json
export train_data=/path_to/CRA-resample-train4w.json

python scripts/convert_to_conv_data.py \
    --orig_data ${raw_data} \
    --write_data ${conv_data} \
    --dataset_name CRA
head -n 3000 ${conv_data} > ${dev_data}
tail -n +3001 ${conv_data} > ${train_data}

We designate the first 3000 entries as the validation set, while the remaining data serves as the training set.

2.2 Model training

Training strategy

The initiation script for training is written in train/scripts/run.sh. You will need to modify the parameters in run.sh according to your specific requirements.

bash scripts/run_sft.sh

2.2.1 LoRA

nohup torchrun --nproc_per_node 2 src/entry_point/sft_train.py \
    --model_name_or_path ${model_name_or_path} \
    --bf16 True \
    --llama True \
    --use_lora True \
    --deepspeed configs/deepspeed_config_stage3.json \
    --lora_config configs/lora_config_llama.json \
    --train_file ${train_file} \
    --validation_file ${validation_file} \
    --per_device_train_batch_size 6 \
    --per_device_eval_batch_size 6 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 5 \
    --model_max_length ${cutoff_len} \
    --save_strategy "steps" \
    --save_total_limit 3 \
    --learning_rate 3e-4 \
    --weight_decay 0.00001 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --evaluation_strategy "steps" \
    --seed 1234 \
    --gradient_checkpointing \
    --cache_dir ${cache_dir} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    > ${log_dir}/train.log 2>&1 &

Parameters

Note: Please be aware that you can only choose between "use_int8_training" and "deepspeed"; they cannot be used simultaneously.

The structure of the output_dir:

output_dir/
├── checkpoint-244/
│   ├── pytorch_model.bin
│   └── trainer_state.json
├── checkpoint-527/
│   ├── pytorch_model.bin
│   └── trainer_state.json
├── adapter_model.bin
├── print_log.txt
└── adapter_config.json

The highest-level directory stores the final model obtained from the training process.

2.2.2 Merge Model with LORA

If you wish to merge the weights of LoRA with a pre-trained model, you can execute the following command:

model_name_or_path=model_path_to/llama-2-7b-chat-T/
lora_path=lora_path_to/checkpoint_2/3739
output_path=out_path_to/CRA__model_2/model_3739

CUDA_VISIBLE_DEVICES=0 python src/merge_llama_with_lora.py \
    --model_name_or_path ${model_name_or_path} \
    --output_path ${output_path} \
    --lora_path ${lora_path} \
    --llama

The merged weights will be saved in the "output_path" directory. You can subsequently load them directly using "from_pretrained".