Awesome

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Introduction

Recent advances in machine learning have benefited a number of code related tasks, such as code translation, code summarization, and code synthesis. Open-source code repository websites like Github provide enormous amount of source code data, which enables the training of large-scale code language models such as CodeBERT (Feng et al., 2020), PLBART (Ahmad et al., 2021a), TransCoder (Roziere et al., 2020) and CodeT5 (Wang et al., 2021). Although the open-source code data is abundant in quantity, it has several disadvantages when serving as training data for code-related models. First, most of the available code data is unlabeled. For tasks like Code Translation, Code Summarization, and Code Synthesis, high quality parallel data is critical for model training.

We introduce XLCoST , a machine learning benchmark dataset that contains fine-grained parallel data in 7 commonly used programming languages (C++, Java, Python, C#, Javascript, PHP, C), and natural language (English). The data is parallel across 7 languages, at both code snippet level and program level. This means that given a program in one language, the dataset contains the same program in up to 6 other programming languages. Each program is divided into several code snippets, and programs in all the languages are aligned at the snippet level. Moreover, each of the snippets is accompanied with a comment, and the comment for a particular snippet is the same across all the languages. Please find the full paper here.

The figure below shows a schematic diagram of how the dataset is organised and the possible tasks that can be performed with it.

Tasks

We introduce the following 10 cross-lingual tasks. All the tasks have pairwise data at both snippet-level and program-level in 7 programming languages, C++, Java, Python, C#, Javascript, PHP, and C. The tasks can be divided into two categories, generation and retrieval. The generation tasks include Code Translation, Code Summarization and Code Syntheis; the retrieval tasks include NL (natural language) Code Search and XL (Cross-Lingual) Code Search. All the tasks are in both snippet-level and program-level. We use 3 state-of-the-art baselines for the generation tasks and 2 for the retrieval tasks.

Category		Task	Data	Description	Baselines
Generation	Code-to-Code	Snippet Translation	872K/47K/83K	Translate code snippet across programming languages	CodeBERT(enc-dec), PLBART, CodeT5
		Program Translation	106K/6K/11K	Translate program across programming languages
	Code-to-Text	Snippet Summarization	446K/22K/41K	Generate comment for given code snippet
		Program Summarization	50K/3K/5K	Generate problem description for given program
	Text-to-Code	Snippet Synthesis	446K/22K/41K	Generate code snippet giving comment
		Program Synthesis	50K/3K/5K	Generate program giving problem description and comments
Retrieval	NL Code Search	Comment-to-Snippet Search	446K/22K/41K	Retrieve code snippet for given comment	RoBERTa, CodeBERT
		Problem-to-Program Search	50K/3K/5K	Retrieve program for given problem description
	XL Code Search	Snippet-to-Snippet Search	872K/47K/83K	Retrieve code snippets in other languages for given snippet
		Program-to-Program Search	106K/6K/11K	Retrieve programs in other languages for given snippet

How to use this repository

Use the requirements.txt file to setup your environment.

Code for this repository has been adapted from CodeXGLUE and PLBART.

Instructions to run the generation tasks can be found here.

Instructions to run the code search tasks can be found here.

Data

The data can be downloaded here.

Data Description (Metadata)

Details about the data files and metadata can be found here.

Statistics

Some basic averaged statistics of the dataset are presented below. "#" means number. #comments/program is the same as #snippets/program. (Py is short for Python; JS for Javascript; TOK for tokens; SN for snippets; PR for programs; com for comments;)

	C++	Java	C#	Python	JS	PHP	C	Avg
# tokens/snippet	21.52	24.1	21.63	23.06	22.52	28.14	25.37	22.83
# tokens/program	204.97	227.09	188.54	215.29	184.63	163.51	197.95	201.96
# tokens/comment	8.25	8.14	7.97	8.23	7.96	8.45	9.67	8.15
# tokens/desc	10.68	10.67	10.75	10.7	10.87	9.91	8.19	10.66
# snippet/program	9.52	9.42	8.51	9.33	8.2	5.81	7.77	8.81
# lines/snippet	3.41	3.71	2.41	3.82	3.23	4	4.05	3.37
# lines/program	32.45	34.93	20.54	35.64	26.47	23.23	31.5	29.71
total snippets	106,397	103,703	92,446	100,032	81,511	20,639	4,363	-
total programs	11,198	11,028	10,622	10,735	9,951	3,553	574	-

Number of pairwise code-code data in training, validation and testing splits for each language-pair are presented in the following table. The upper triangle shows the number of parallel code snippets, and the lower triangle shows the number of parallel programs. This data is used for the Code Translation and XL Code Search tasks. (Py is short for Python. JS is short for Javascript.)

Code-Code Pairs		C++	Java	Python	C#	JS	PHP	C
C++	train		89,040	80,100	85,662	69,507	17,811	3,386
	val		4,419	3,913	4,408	3,808	923	352
	test		8,059	7,228	7,922	6,965	1,647	222
Java	train	9,450		77,759	87,065	69,341	17,853	2,996
	val	490		3,938	4,437	3,826	929	353
	test	901		7,259	8,011	7,005	1,672	238
Python	train	9,139	8,991		75,843	67,219	17,616	2,478
	val	468	471		3,922	3,750	923	311
	test	878	882		7,215	6,861	1,655	203
C#	train	9,187	9,301	8,826		68,093	17,873	2,958
	val	488	491	470		3,826	928	352
	test	890	898	877		6,961	1,668	238
JS	train	8,482	8,470	8,182	8,367		17,117	1,875
	val	472	475	459	475		921	309
	test	878	881	864	877		1,617	200
PHP	train	3,056	3,68	3,003	3,071	2,971		856
	val	157	158	153	158	157		271
	test	303	307	304	307	302		183
C	train	402	409	380	394	308	170
	val	59	59	59	59	59	55
	test	45	49	48	49	49	43

Number of pairwise code-text data in each language are presented in the table below. "Snippet" means snippet-comment pairs, and "Program" means program-description (problem description) pairs. This data is used for Code Summarization (Code-to-Text), Code Synthesis (Text-to-Code) and NL Code Search tasks.

NL-Code Pairs		C++	Java	Python	C#	JS	PHP	C	Total
Snippet	train	93,847	91,089	81,207	87,583	70,649	18,027	3,763	446,165
	valid	4,432	4,460	3,946	4,436	3,829	930	350	22,383
	test	8,118	8,154	7,293	8,013	7,033	1,682	250	40,543
Program	train	9,797	9,623	9,263	9,345	8,590	3,087	463	50,168
	valid	492	494	472	491	475	158	60	2,642
	test	909	911	887	899	886	308	51	4,851

With the release of this dataset hope to enable more research into the domain of Deep Learning for Software Engineering tasks. We believe that this dataset is a valuable asset for the research community and can potentially benefit a number of code-related research problems.

Citation

If you use this dataset in your work, please consider citing us. The arXiv version of the paper can be found here.

@misc{zhu2022xlcost,
     title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence},
     url = {https://arxiv.org/abs/2206.08474},
     author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.},
     year = {2022},
     eprint={2206.08474},
     archivePrefix={arXiv}
}