Home

Awesome

ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

DOI

Intro

Downloading dataset

The latest version of the dataset is publicly available on zenodo.

Dataset preparation

We highly recommend downloading the latest version of the dataset as described above. If you want to manually prepare the dataset, follow below instructions.

Requirements

Steps

  1. Clone the dataset:

    python -m repo_cloner -i ./mypy-dependents-by-stars.json -o repos
    
  2. To change the state of the cloned repositories to the ManyType4Py's, run the following command on the ManyTypes4PyDataset.spec:

    ./scripts/reset_commits.sh  ./ManyTypes4PyDataset.spec repos
    
  3. Generate duplicate tokens for dataset using cd4py

    cd4py --p repos --ot tokens --od manytypes4py_dataset_duplicates.jsonl.gz --d 1024
    
  4. Gather duplicate files from the cd4py output tokens, and output as a single text file (using collect_dupes.py)

    python3 scripts/collect_dupes.py manytypes4py_dataset_duplicates.jsonl.gz manytypes4py_dup_files.txt
    
  5. Create a copy dataset with duplicates removed from the duplicate files collected prior (using process_dataset.py)

    python3 scripts/process_dataset.py repos manytypes4py_dup_files.txt [new dataset path]
    
  6. Split dataset into test, train and validation data (using split_dataset.py)

    python3 scripts/split_dataset.py [new dataset path] manytypes4py_split.csv
    
  7. To process the Python repositories and produce JSON output files, run the libsa4py pipeline as follows:

    libsa4py process --p [new dataset path] --o [processed projects path] --s manytypes4py_split.csv --j [WORKERS COUNT]
    

    Check out the libsa4py README for more info on its usage.

  8. Create a tar of the full dataset & artifacts in one folder

    tar -czvf [output path] [dataset artifacts path]
    

Citing the dataset

If you have used the dataset in your research work, please consider citing it.

@inproceedings{mt4py2021,
author = {A. M. Mir and E. Latoskinas and G. Gousios},
booktitle = {IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
title = {ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference},
year = {2021},
pages = {585-589},
doi = {10.1109/MSR52588.2021.00079},
publisher = {IEEE Computer Society},
month = {May}
}

Roadmap