Awesome
<!-- * @Author: lihaitao * @Date: 2023-02-28 23:34:00 * @LastEditors: Do not edit * @LastEditTime: 2023-03-02 15:45:21 * @FilePath: /LeCaRD2.0/LeCaRDv2/README.md -->LeCaRDv2:A Large-Scale Chinese Legal Case Retrieval Dataset
Overview
Introduction
LeCaRDv2 is one of the largest Chinese legal case retrieval datasets with the widest coverage of criminal charges. The dataset comprises 800 query cases and 55,192 candidate cases extracted from 4.3 million criminal case documents. We believe it can be a reliable benchmark that promotes relevant research in the field.
Compared with LeCaRDv1, the data size of LeCaRDv2 is approximately 8 times larger.Moreover, we enrich the existing relevance criteria and consider characterization, penalty, procedure three aspects together. Also, we propose a new candidate pooling strategy to alleviate potential bias in the annotation process. All cases are annotated by multiple legal experts majoring in criminal law.
The statistics of LeCaRDv2 shows as follow:
LeCaRDv2 | |
---|---|
#Queries | 800 |
#Candidate cases/query | 55,192 |
Avg.length per case document | 4,766 |
#Avg.relevant case per query | 20.89 |
Data Structure
Query
query.json
has 800 lines. Each line is a dictionary representing a query. An example of a dictionary is:
{"id": 233, "query":"告石康军贩卖毒品一审判决书湖南省衡阳市珠晖区人民法院刑事判决书(2017)湘0405刑初149号:湖南省衡阳市珠晖区人民检察院以珠检公诉刑诉〔2017〕127号起诉书指控被告人石康军犯贩卖毒品罪,于2017年9月30日向本院提起公诉。本院于当日受理后,依法适用普通程序,由审判员陈东升担任审判长,与审判员阳福明、人民陪审员肖才闰组成合议庭,于2017年11月30日公开开庭审理了本案。代理书记员段佳丽担任记录。衡阳市珠晖区人民检察院指派检察员谢小芸出庭支持公诉。被告人石康军及其辩护人王剑清到庭参加诉讼......", "fact": "公诉机关指控:贩卖毒品罪被告人石康军系无业吸毒人员,以贩养吸,2017年5月至6月期间,被告人6次贩卖甲基苯丙胺(冰毒)给贩毒人员吸食共计约5.6克,查获尚未贩卖的甲基苯丙胺208.32克、甲基苯丙胺片剂3.94克。二、容留他人吸毒罪被告人石康军在其位于本市珠晖区家中,三次容留吸毒人员张某某吸食毒品甲基苯丙胺。"}
where id
is each query's unique ID, query
is the query content, and fact
is the fact description of this query case.
common_query.json
/controversial_query.json
/procedural_query.json
save the common query/controversial query/procedural query respectively. Details about the different types of queries can be viewed in our paper.
train_query.json
and test_query.json
are training data and testing data respectively, where the training data has 640 lines and the testing data has 160 lines.
query_allcontext.json
contains the entire content of the case in the query, which includes the title, programme, type, fulltext, fact, reason, result, claim, law, xf fields.”’
Candidates
You can download the candidate cases from this link download.
candidates
directory consists of 55,192 candidate documents, An example of a candidate documents is:
{"pid": 0, "qw": "山东省武城县人民法院刑事判决书(2019)鲁1428刑初153号公诉机关山东省武城县人民检察院。被告人张宝玉,男,1983年11月出生于山东省武城县,汉族,初中肄业,户籍所在地山东省武城县,住山东省武城县......", "fact": "经审理查明:2019年7月1日7时8分,被告人张宝玉驾驶鲁N×××××号小型轿车沿宏海路由北向南行驶至腾鲁大街路口处向东转弯时,与沿宏海路由南向北行驶的徐某1驾驶、徐某2乘坐的鲁N×××××号三轮汽车相撞......", "reason": "本院认为,被告人张宝玉违反交通运输管理法规,发生重大事故,致一人死亡,且承担事故的主要责任,其行为已构成交通肇事罪,武城县人民检察院指控的事实及罪名成立,本院予以确认......", "result": "公诉机关认为,被告人张宝玉违反交通运输管理法规,发生重大交通事故致一人死亡,并承担事故的主要责任,其行为触犯了《中华人民共和国刑法》第一百三十三条之规定,犯罪事实清楚,证据确实充分,应当以交通肇事罪追究其刑事责任。其法定刑为三年以下有期徒刑或者拘役,结合其自首、认罪认罚、赔偿谅解等量刑情节,建议判处被告人张宝玉有期徒刑六个月至十个月或拘役四个月至六个月,可以适用缓刑.....", "charge": ["交通肇事罪"], "article": [133, 67, 72, 73]}
where pid
is the ID of the case, qw
is the full content, fact
is the basic facts of the case, reason
is the analysis process of judges, result
is the case judgment, charge
is the criminal charge(s) of the case, article
is the the criminal law article of this case document.
Labels
relevence.trec
is the relevance annotation with TREC qrels format. Specifically, each line has the qid \t 0 \t pid \t label
format.
relevence_gold.trec
is annotated with relevance at level 2 or higher. For convenience, the annotation labels of the test data are organized in test_relevence.trec
/test_relevence_gold.trec
To facilitate the study of performance on ranking, ranking_pool.json provides the query and the corresponding ranking pool containing 100 documents, where qid
is the id of the query and rank_doc_id
contains document ids.
Experiment
The traditional methods are implemented with pyserini toolkit and the code of the pre-trained model is referenced from dense.
We conduct experiments on state-of-the-art models with zero-shot and fine-tuning settings. Under the zero-shot setting, all annotated data in LeCaRDv2 are employed to evaluate. The zero-shot results are:
Model | R@100 | R@200 | R@500 | R@1000 |
---|---|---|---|---|
BM25 | 0.6262 | 0.6629 | 0.6949 | 0.7207 |
QLD | 0.5984 | 0.6576 | 0.7065 | 0.7424 |
Bert | 0.1165 | 0.1526 | 0.2184 | 0.2805 |
Roberta | 0.3753 | 0.4739 | 0.6152 | 0.7126 |
Bert_xs | 0.0453 | 0.0614 | 0.0949 | 0.1343 |
Lawformer | 0.2432 | 0.3040 | 0.4054 | 0.4833 |
Condenser | 0.2215 | 0.2987 | 0.4321 | 0.5452 |
coCondenser | 0.2255 | 0.3093 | 0.4460 | 0.5514 |
SEED | 0.3544 | 0.4474 | 0.5745 | 0.6657 |
RetroMAE | 0.3193 | 0.3947 | 0.5010 | 0.5821 |
Under the fine-tuning setting, we sampled 20% cases from each charge as the test set. There are 640 queries for training and 160 queries for testing. The fine-tuning results are:
Model | R@100 | R@200 | R@500 | R@1000 |
---|---|---|---|---|
BM25 | 0.6050 | 0.6428 | 0.6735 | 0.7015 |
QLD | 0.5749 | 0.6354 | 0.6882 | 0.7222 |
Bert | 0.3849 | 0.5026 | 0.6649 | 0.7797 |
Roberta | 0.4136 | 0.5330 | 0.6964 | 0.7998 |
Bert_xs | 0.2074 | 0.2750 | 0.3935 | 0.4941 |
Lawformer | 0.3651 | 0.4851 | 0.6443 | 0.7629 |
Condenser | 0.3982 | 0.5003 | 0.6761 | 0.7969 |
coCondenser | 0.3998 | 0.5024 | 0.6861 | 0.8036 |
SEED | 0.4201 | 0.5437 | 0.7160 | 0.8132 |
RetroMAE | 0.4210 | 0.5397 | 0.7093 | 0.8174 |
Requirements
python=3.8
transformers==4.25.1
tqdm==4.64.1
datasets==2.8.0
torch==1.11.0
faiss==1.7.2
jieba==0.42.1
pytorch==1.12.1
pyserini==0.19.2
Authors
Haitao Li liht22@mails.tsinghua.edu.cn
Yunqiu Shao shaoyq18@mails.tsinghua.edu.cn
Yueyue Wu wuyueyue@mail.tsinghua.edu.cn
Qingyao Ai aiqy@tsinghua.edu.cn
Yixiao Ma mayx20@mails.tsinghua.edu.cn
Yiqun Liu yiqunliu@tsinghua.edu.cn
Contact
If you have any questions or suggestions, please contact liht22@mails.tsinghua.edu.cn
License
MIT © Haitao Li