Home

Awesome

Legal Linking

The repository for Legal Linking: Citation Resolution and Suggestion in Constitutional Law, published at the Natural Legal Language Processing workshop at NAACL 2019.

Our trained models are available in the releases section, and the data is available in this repository.

Requirements:

AllenNLP is easy to install (preferably in a conda environment):

$ pip install allennlp

Running the Code

$ allennlp train legal.json --include-package mylib -s tmp

Where -s refers to the serialization directory where trained model will be stored. Probably this should be something more interesting that tmp

NOTE: for some reason, this creates a bert.txt vocab file under the model/vocabulary folder. This will cause issues with OOV tokens (I haven't figured why this happens). To get around this, just go into that folder and do cp bert.txt .bert.txt. The vocabulary reader will ignore hidden files. You will then have to recreate the model file (in the serialization directory):

$ cp best.th weights.th
$ tar czvf model.tar.gz weights.th vocabulary/ config.json

Preparing Data

All data is originally stored in JSON files, and the mkdata.sh script converts into the lines format. The format that the datareader expects is line<TAB>label1,label2.

You can give mkdata.sh a number as argument which will limit the number of examples (usually you don't want to do this, probably only for testing).

It takes a while (15min) to create a file called all_lines_labeled, so if this already exists, mkdata.sh won't make it again. Be careful though: perhaps you need to update the file. Consider deleting it first.

The data/stats.py script will give some stats on the data.

Get Rule-based Results

Look at score_all.sh for an example. You want to use tag_data.py but with the -d flag (destructive) which removes all prior annotations before adding new ones. If you don't use this flag, you will get an extremely high score (close to 100%).

Get Linear Model Results

Make sure to set the training data and output parameters in the file, then run

$ python linear_model.py

Get NN Model Results

Run this:

$ allennlp predict legal-bert-model.tar.gz data/validation/all_validation --include-package mylib --cuda-device 0 --use-dataset-reader --output bert.txt --predictor legal_predictor --silent

How to score result files

To score a result file against a gold file, run score.py on the predictions and gold line files.

$ python score.py --gold data/validation/all_validation --pred results/linear.txt

Running the Demo

Make sure the model is in the right place (if trained with GPU, you need to ask for CUDA), and run ./demo.sh. Very easy.

Static files for the demo are kept in demo_files.

The Models

See the paper for description of the models.

Training on the GPU

To train on the GPU, just change the device parameter in legal.json to 0. You probably also want to change the batch_size. I typically copy legal.json to legal_gpu.json and modify the legal_gpu.json file.

Data structure

Cases

For the US Supreme Court (USSC) files, data are organized as a jsonlines file, with one JSON object (representing a single Supreme Court case) per line. The data structure for each individual case is as follows:

[
  {
    'text': "In 2008, the California Supreme Court held that...",
    'meta': {
             'doc_type': "opinion",
             'id': 0
             'source_url': "https://www.law.cornell.edu/supremecourt/text/12-144"
            }
     'matches': [["Fourteenth Amendment", "https://www.law.cornell.edu/constitution/amendmentxiv", '55']]
  }, 
  ...
]

Each line in this data structure corresponds to a paragraph in the source case. Variables are as follows:

For case data, two files are included: a full version, which contains the unedited text of each case, and a stripped version, which strips out the text strings corresponding to constitutional named entities. In the stripped version, removed text strings are replaced with @@@ as a placeholder string. Each file is split into chunks of approximately 100

Constitutional text

For the Constitutional text, data are organized as a single json file. The data structure is as follows:

{
  '55': {'text': 'Amendment XIV\nSection 1.\n\nAll persons born or naturalized in the United States...',
         'i': 55,
         'link': 'https://www.law.cornell.edu/constitution/articlexiv'}
  ...
}

Each key in this file corresponds to a hyperlinked entity match scraped from the Cornell website. These keys match the indices from the matches field in the case files.