Home

Awesome

⚖️LAiW: A Chinese Legal Large Language Models Benchmark

| English | Chinese

LAiW:A Comprehensive Benchmark for Chinese Legal Large Language Models (LLMs)

🔥 LAiW Leaderboard

🔥 Technical Report and Official Paper

News

🔄 Recent Updates

📅 Earlier News

Contents

Evaluation structure diagram

<img src="https://github.com/Dai-shen/LAiW/blob/main/resources/task_framework_en.png" width="70%" height="70%"></img>

Scores for LLMs

According to the calculation method of the large models' scoring mechanism, we have evaluated 7 mainstream legal large models and 6 general large models at this stage. The model scores are as follows:

ModelSizeModel DomainTotal ScoreBIRLFICLABase Model
GPT-4-General69.6380.9269.2758.69-
ChatGPT-General64.0975.9958.3257.96-
Baichuan2-Chat13BGeneral48.0453.6732.0358.40-
ChatGLM6BGeneral47.0151.5137.0852.44-
Ziya-LLaMA13BGeneral45.7961.4729.4446.45Llama-13B
Fuzi-Mingcha6BLegal40.6239.6827.4654.71ChatGLM-6B
HanFei7BLegal35.6937.4216.3353.31-
LexiLaw6BLegal31.3141.328.8843.73ChatGLM-6B
ChatLaw13BLegal25.7758.0212.546.74Ziya-LLaMA-13B
Llama2-Chat7BGeneral27.7631.8612.7738.64-
Lawyer-LLaMA13BLegal29.2530.856.3950.50Chinese-LLaMA-13B
Chinese-LLaMA13BGeneral24.9921.0219.1634.80Llama-13B
Chinese-LLaMA7BGeneral24.9122.3218.2534.16Llama-7B
Baichuan7BGeneral22.5121.2015.4630.86-
LaWGPT7BLegal22.6915.4714.2738.32Chinese-LLaMA-7B
Llama13BGeneral21.0018.5115.0829.40-
Wisdom-Interrogatory7BLegal18.8312.6610.4533.37Baichuan-7B
Llama7BGeneral16.3511.1215.4022.54-

The overall scores and scores for each level of legal capability of LLMs are ranked as follows:

Overall Histogram BIR Histogram LFI Histogram CLA Histogram

Tasks

With the joint efforts of legal experts and artificial intelligence experts, we categorize the Legal Capabilities of LLMs into three levels, ranging from easy to difficult: Basic Information Retrieval (BIR), Legal Foundation Inference (LFI), and Cplex Legal Application (CLA), totaling 14 foundational tasks. The diagram above shows the structure of these three capability levels.

Below is a brief description to each evaluation task.

<table> <tr> <td>Capability</td> <td>Task</td> <td>Description</td> </tr> <tr> <td rowspan="5">BIR</td> <td>Legal Article Recommendation</td> <td>It aims to provide relevant articles based on the description of the case.</td> </tr> <tr> <td>Element Recognition</td> <td>It analyzes and assesses each sentence to identify the pivotal elements of the case.</td> </tr> <tr> <td>Named Entity Recognition</td> <td>It aims to extract nouns and phrases with legal characteristics from various legal documents.</td> </tr> <tr> <td>Judicial Summarization</td> <td>It aims to condense, summarize, and synthesize the content of legal documents.</td> </tr> <tr> <td>Case Recognition</td> <td>It aims to determine, based on the relevant description of the case, whether it pertains to a criminal or civil matter.</td> </tr> <tr> <td rowspan="5">LFI</td> <td>Controversial Focus Mining</td> <td>It aims to extract the logical and interactive arguments between the defense and prosecution in legal documents, which will be analyzed as a key component for the tasks that relate to the case result.</td> </tr> <tr> <td>Similar Case Matching</td> <td>It aims to find cases that bear the closest resemblance, which is a core aspect of various legal systems worldwide, as they require consistent judgments for similar cases to ensure the fairness of the law.</td> </tr> <tr> <td>Criminal Judgment Prediction</td> <td>It involves predicting the guilt or innocence of the defendant, along with the potential sentencing, based on the results of basic legal NLP, including the facts of the case, the evidence presented, and the applicable law articles. Therefore, it is divided into two types of tasks: Charge Prediction and prison Term Prediction.</td> </tr> <tr> <td>Civil Trial Prediction</td> <td>It involves using factual descriptions to predict the judgment of the defendant in response to the plaintiff’s claim, which we should consider the Controversial Focus.</td> </tr> <tr> <td>Legal Question Answering</td> <td>It utilizes the model’s legal knowledge to address the national judicial examination, which encompasses various specific legal types.</td> </tr> <tr> <td rowspan="3">CLA</td> <td>Judicial Reasoning Generation</td> <td>It aims to generate relevant legal reasoning texts based on the factual description of the case. It is a complex reasoning task, because the court requires further elaboration on the reasoning behind the judgment based on the determination of the facts of the case. This task also involves aligning with the logical structure of syllogism in law</td> </tr> <tr> <td>Case Understanding</td> <td>It is expected to provide reasonable and compliant answers based on the questions posed regarding the case-related descriptions in the judicial documents, which is also a complex reasoning task.</td> </tr> <tr> <td>Legal Consultation</td> <td>It covers a wide range of legal areas and aims to provide accurate, clear, and reliable answers based on the legal questions provided by the different users. Therefore, it usually requires the sum of the aforementioned capabilities to provide professional and reliable analysis.</td> </tr> </table>

Datasets

We have reorganized and constructed the evaluation datasets for the aforementioned tasks based on existing publicly available Chinese legal datasets. These datasets are collectively referred to as the Legal Evaluation Dataset (LED). We present the evaluation datasets for each foundational task. For more detailed information about the datasets, please refer to here.

<table> <tr> <td>Level</td> <td>Task</td> <td>Main Dataset</td> <td>Evaluation Dataset</td> <td>Data Size</td> <td>Category</td> </tr> <tr> <td rowspan="5">BIR</td> <td>Legal Article Recommendation</td> <td>CAIL-2018</td> <td><a href="https://huggingface.co/datasets/daishen/legal-ar">legal_ar</a></td> <td>1,000</td> <td>Classification</td> </tr> <tr> <td>Element Recognition</td> <td>CAIL-2019</td> <td><a href="https://huggingface.co/datasets/daishen/legal-er">legal_er</a></td> <td>1,000</td> <td>Classification</td> </tr> <tr> <td>Named Entity Recognition</td> <td>CAIL-2021</td> <td><a href="https://huggingface.co/datasets/daishen/legal-ner">legal_ner</a></td> <td>1040</td> <td>named entity recognition</td> </tr> <tr> <td>Judicial Summarization</td> <td>CAIL-2020</td> <td><a href="https://huggingface.co/datasets/daishen/legal-js">legal_js</a></td> <td>364</td> <td>Text Generation</td> </tr> <tr> <td>Case Recognition</td> <td>CJRC</td> <td><a href="https://huggingface.co/datasets/daishen/legal-cr">legal_cr</a></td> <td>2,000</td> <td>Classification</td> </tr> <tr> <td rowspan="6">LFI</td> <td>Controversial Focus Mining</td> <td>LAIC-2021</td> <td><a href="https://huggingface.co/datasets/daishen/legal-cfm">legal_cfm</a></td> <td>306</td> <td>Classification</td> </tr> <tr> <td>Similar Case Matching</td> <td>CAIL-2019</td> <td><a href="https://huggingface.co/datasets/daishen/legal-scm">legal_scm</a></td> <td>260</td> <td>Classification</td> </tr> <tr> <td>Charge Prediction</td> <td>Criminal-S</td> <td><a href="https://huggingface.co/datasets/daishen/legal-cp">legal_cp</a></td> <td>827</td> <td>Classification</td> </tr> <tr> <td>prison Term Prediction</td> <td>MLMN</td> <td><a href="https://huggingface.co/datasets/daishen/legal-ptp">legal_ptp</a></td> <td>349</td> <td>Classification</td> </tr> <tr> <td>Civil Trial Prediction</td> <td>MSJudeg</td> <td><a href="https://huggingface.co/datasets/daishen/legal-ctp">legal_ctp</a></td> <td>800</td> <td>Classification</td> </tr> <tr> <td>Legal Question Answering</td> <td>JEC-QA</td> <td><a href="https://huggingface.co/datasets/daishen/legal-lqa">legal_lqa</a></td> <td>855</td> <td>Classification</td> </tr> <tr> <td rowspan="3">CLA</td> <td>Judicial Reasoning Generation</td> <td>AC-NLG</td> <td><a href="https://huggingface.co/datasets/daishen/legal-jrg">legal_jrg</a></td> <td>834</td> <td>Text Generation</td> </tr> <tr> <td>Case Understanding</td> <td>CJRC</td> <td><a href="https://huggingface.co/datasets/daishen/legal-cu">legal_cu</a></td> <td>1,054</td> <td>Text Generation</td> </tr> <tr> <td>Legal Consultation</td> <td>CrimeKgAssitant</td> <td><a href="https://huggingface.co/datasets/daishen/legal-lc">legal_lc</a></td> <td>916</td> <td>Text Generation</td> </tr> </table>

Scoring Mechanism

⭐️ socres for each task

<div align="center">

$$ S_{(Task)} = \begin{cases} F1 * 100, & \text{If }\quad Task\quad\in\quad Classification \ \frac{1}{3}*(R1 + R2 + RL) * 100, & \text{If }\quad Task \quad\in\quad Text\quad Generation \ Acc * 100, & \text{If }\quad Task\quad\in\quad NER \end{cases} $$

</div>

Currently, our evaluation benchmarks mainly consist of three types of tasks: classification tasks, text generation tasks and named entity recognition. For classification tasks, we use the F1 score. For text generation tasks, we use the average of Rouge1, Rouge2, and RougeL scores. Specifically, for legal Named Entity Recognition tasks, we use the extraction accuracy of legal entities as their score.

🌟 Scores for each LLM

For individual LLM, we first calculate the average score of tasks at each level as its legal capability score for that level. Then, we take the average of these three legal capability scores as the final evaluation score for the LLM. Model evaluation scores can be found here.

Run

We will continue to evaluate the performance of existing LLMs on these tasks according to the structure diagram of the 14 foundational tasks. For details, please refer to the leaderboard.

1.Preparation

git clone git clone https://github.com/Dai-shen/LAiW.git --recursive
cd LAiW
pip install -r requirements.txt
cd LAiW/src/financial-evaluation
pip install -e .[multilingual]

2.Output of LLM

We select the model and legal tasks to be evaluated. By running the following code, we can obtain the model's output.

export CUDA_VISIBLE_DEVICES="1,2"
python eval.py \
    --model "hf-causal-experimental" \
    --model_args "use_accelerate=True,pretrained=$pretrained_model,tokenizer=$pretrained_model,use_fast=False,trust_remote_code=True" \
    --tasks "legal_ar,legal_er,legal_js" \
    --no_cache \
    --num_fewshot 0 \
    --write_out \
    --output_base_path ""

Parameter Description

Contributors

Disclaimer

This project is provided for academic and educational purposes only. We do not take responsibility for any issues, risks, or adverse consequences that may arise from the use of this project.

Acknowledgements

This project is built upon the following open-source projects, and we are really thankful for them:

Cite

If this project has been helpful to your research, please consider citing our project.

@article{dai2023laiw,
  title={LAiW: A Chinese legal large language models benchmark},
  author={Dai, Yongfu and Feng, Duanyu and Huang, Jimin and Jia, Haochen and Xie, Qianqian and Zhang, Yifang and Han, Weiguang and Tian, Wei and Wang, Hao},
  journal={arXiv preprint arXiv:2310.05620},
  year={2023}
}