Home

Awesome

Megadiff, a dataset of source code changes

If you use Megadiff, please cite the following technical report:

"Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size". Technical Report 2108.04631, Arxiv; 2021.

@techreport{megadiff,
  TITLE = {{Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size}},
  AUTHOR = {Martin Monperrus and Matias Martinez and He Ye and Fernanda Madeiral and Thomas Durieux and Zhongxing Yu},
  URL = {http://arxiv.org/pdf/2108.04631},
  INSTITUTION = {Arxiv},
  NUMBER = {2108.04631},
  YEAR = {2021},
}

Architecture

Example usage:

xzcat ./8/ae49f3458915859104ebd1e0858a409e01291e6d.diff.xz

The datasets are also available on Huggingface:

Benchmark Leakage

Megadiff contains samples extracted from some projects contained in Defects4J. An analysis on single-function Megadiff and Defects4J samples shows that:

Note that this analysis only looks at the functions changed, and does not regard other code