Awesome

Legal-HeBERT

Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch. We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.

Training Data

Our training datasets are:

Name	Hebrew Description	Size (GB)	Documents	Sentences	Words	Notes
The Israeli Law Book	ספר החוקים הישראלי	0.05	2338	293352	4851063
Judgments of the Supreme Court	מאגר פסקי הדין של בית המשפט העליון	0.7	212348	5790138	79672415
custody courts	החלטות בתי הדין למשמורת	2.46	169,708	8,555,893	213,050,492
Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment	תזכירי חוק, טיוטות חקיקת משנה וטיוטות מבחני תמיכה שהופצו להערות הציבור	0.4	3,291	294,752	7,218,960
Supervisors of Land Registration judgments	מאגר פסקי דין של המפקחים על רישום המקרקעין	0.02	559	67,639	1,785,446
Decisions of the Labor Court - Corona	מאגר החלטות בית הדין לעניין שירות התעסוקה – קורונה	0.001	146	3505	60195
Decisions of the Israel Lands Council	החלטות מועצת מקרקעי ישראל		118	11283	162692	aggregate file
Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal	פסקי דין של בית הדין למשמעת ובית הדין לערעורים של משטרת ישראל	0.02	54	83724	1743419	aggregate files
Disciplinary Appeals Committee in the Ministry of Health	ועדת ערר לדין משמעתי במשרד הבריאות	0.004	252	21010	429807	465 files are scanned and didn't parser
Attorney General's Positions	מאגר התייצבויות היועץ המשפטי לממשלה	0.008	281	32724	813877
Legal-Opinion of the Attorney General	מאגר חוות דעת היועץ המשפטי לממשלה	0.002	44	7132	188053

total		3.665	389,139	15,161,152	309,976,419

We thank Yair Gardin for the referring to the governance data, Elhanan Schwarts for collecting and parsing The Israeli law book, and Jonathan Schler for collecting the judgments of the supreme court.

Training process

Vocabulary size: 50,000 tokens
4 epochs (1M steps±)
lr=5e-5
mlm_probability=0.15
batch size = 32 (for each gpu)
NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)

Additional training settings:

Fine-tuned HeBERT model: The first eight layers were freezed (like Lee et al. (2019) suggest) Legal-HeBERT trained from scratch: The training process is similar to HeBERT and inspired by Chalkidis et al. (2020)

How to use

The models can be found in huggingface hub and can be fine-tunned to any down-stream task:

# !pip install transformers==4.14.1
from transformers import AutoTokenizer, AutoModel

model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model 
model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model=model_name,
)
fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")

Stay tuned!

We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.

If you used this model please cite us as :

Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai, Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127

@article{chriqui2021hebert,
  title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
  author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
  journal={SSRN preprint:4147127},
  year={2022}
}

Contact us

Avichay Chriqui, The Coller AI Lab Inbal yahav, The Coller AI Lab Ittai Bar-Siman-Tov, the BIU Innovation Lab for Law, Data-Science and Digital Ethics

Thank you, תודה, شكرا