Awesome
FixJS
The FixJS dataset contains information on every publicly available bug-fixing commit that affected JavaScript files in the first half of 2012. We started from scratch, and created ~300k samples in three settings including different source code representations. In the table below we summarize the assembled datasets. There we can see that the Large subset contains the majority of the samples, this and the sheer size of the functions implies that its size in megabytes is also the greatest.
FixJS: A Dataset of Bug-fixing JavaScript Commits
This repository contains open science data used in the paper
V. Csuvik and L. Vidacs, FixJS: A Dataset of Bug-fixing JavaScript Commits
submitted at the Proceedings of the 19th International Conference on Mining Software Repositories (MSR '22). If you use FixJS for academic purposes, please cite the appropriate publication:
@inproceedings{csuvik:fixjs,
title = {FixJS: A Dataset of Bug-fixing JavaScript Commits},
author = {Csuvik, Viktor and Vidacs, Laszlo},
booktitle={2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR)}
year = {2022},
doi = {10.1145/3524842.3528480},
}
FixJS contains both single- and multi-line bugs. Before- and after state of the mined functions are differentiated using their Abstract Syntax Tree, meaning that if only comments have changed the samples are filtered out.
The repository contains 2 folders:
- commits: raw information about the commits
- input: contains the constructed dataset
In each directory we procide a README file that describes the structure of the folder in question.
Get your hands dirty!
- Clone the repository and pick a dataset size (50, 100 or 100+)
git clone https://github.com/RGAI-USZ/FixJS.git
cd FixJS/input/50/
- Load the before_ rep.txt and after_ rep.txt (where rep can be [idiom, mapped, tokenized])
from pathlib import Path
import numpy as np
buggy_data = Path('./before_tokenized.txt').read_text(encoding='utf-8').splitlines()
fixed_data = Path('./after_tokenized.txt').read_text(encoding='utf-8').splitlines()
- Split the dataset and train the model
data_len = len(buggy_data)
indices = np.arange(data_len)
np.random.seed(13)
np.random.shuffle(indices)
buggy_data = np.array(buggy_data, dtype=object)[indices].tolist()
fixed_data = np.array(fixed_data, dtype=object)[indices].tolist()
valid_start = int(data_len * 0.8)
test_start = valid_start + int(data_len * 0.1)
train_input, train_target = buggy_data[:valid_start], fixed_data[:valid_start]
valid_input, valid_taret = buggy_data[valid_start:test_start], fixed_data[valid_start:test_start]
test_input, test_target = buggy_data[test_start:], fixed_data[test_start:]
train_model(train_input, train_target, valid_input, valid_taret)
- Evaluate the model on the test set
evaluate_model(test_input, test_target)