Awesome

MathComp Corpus

A corpus of code for the Coq proof assistant along with several independently machine-readable representations, derived from verification projects related to the Mathematical Components (MathComp) library. The corpus can be used, e.g., in machine learning and data mining applications. Machine-readable representations are in the form of S-expressions (sexps), and were created using the SerAPI library.

Obtaining the corpus

The latest released version of the corpus can be downloaded from GitHub. The archive includes both the Coq source files and corresponding machine-readable representations.

Corpus contents

The latest corpus release is based on Coq 8.10.2 and MathComp 1.9.0, and contains 449 source files from 21 Coq projects --- in total over 297k lines of code (LOC). For each source file (e.g., theory.v), there are two corresponding files with lists of sexps, organized at the Coq sentence level, for tokens (theory.tok.sexp) and abstract syntax trees (theory.ast.sexp). Moreover, three sexp representations are provided for each Coq lemma statement in the corpus: tokens, abstract syntax tree, and elaborated term. All machine-readable representations were created using SerAPI 0.7.1, via the SerAPI-bundled programs sercomp, sertok, and sername.

A research paper describes the corpus in more detail and provides additional statistics. The corpus is divided into three tiers based on how well projects conform to the MathComp conventions.

Project	Revision SHA	No. files	LOC	Tier	License
finmap	27642a8	4	6451	1	CECILL-B
fourcolor	0851d49	60	37138	1	CECILL-B
math-comp	748d716	89	84713	1	CECILL-B
odd-order	ca602a4	34	36125	1	CECILL-B
analysis	9e5fe1d	17	11739	2	CECILL-C
bigenough	5f79a32	1	78	2	CECILL-B
elliptic-curves	631af89	18	9596	2	CECILL-B
grobner	dfa54f9	1	1330	2	CECILL-B
infotheo	6c17242	81	42295	2	GPL-3.0-only
multinomials	691d795	5	7363	2	CECILL-B
real-closed	495a1fa	10	8925	2	CECILL-B
robot	b341ad1	13	11130	2	LGPL-3.0-only
two-square	1c09aca	2	1721	2	CECILL-B
bits	3cd298a	10	4041	3	Apache-2.0
comp-dec-pdl	c1f813b	16	4419	3	CECILL-B
disel	e8aa80c	20	4473	3	BSD-2-Clause
fcsl-pcm	eef4503	12	5789	3	Apache-2.0
games	3d3bd31	12	4953	3	BSD-2-Clause
monae	9d0e461	18	6655	3	GPL-3.0-only
reglang	da333e0	12	3033	3	CECILL-B
toychain	97bd697	14	5275	3	BSD-2-Clause

The structure of the corpus is as follows:

File/directory	Contents
`projects.yml`	List of projects, along with their URL, SHA, build command, installation command, etc.
`raw-files`	Project source files and their machine-readable representations.
`lemmas`	Lemmas for all projects in the corpus and their machine-readable statements.
`lemmas-filtered`	Subset of lemmas obeying restrictions on the maximum sizes of elaborated terms.
`definitions`	Definitions extracted from the corpus.

Applications

The corpus was used to train and evaluate the tool Roosterize, which suggests lemma names in Coq code using neural networks.

If you have used the corpus in a research project, please cite the corresponding research paper in any related publication:

@inproceedings{NieETAL20Roosterize,
  author = {Nie, Pengyu and Palmskog, Karl and Li, Junyi Jessy and Gligoric, Milos},
  title = {Deep Generation of {Coq} Lemma Names Using Elaborated Terms},
  booktitle = {International Joint Conference on Automated Reasoning},
  pages = {97--118},
  doi = {10.1007/978-3-030-51054-1_6},
  year = {2020},
}

Example data

The Coq lemma sentence

Lemma mg_eq_proof L1 L2 (N1 : mgClassifier L1) : L1 =i L2 -> nerode L2 N1.

is serialized into the following tokens (simplified):

(Sentence((IDENT Lemma)(IDENT mg_eq_proof)(IDENT L1)(IDENT L2)
  (KEYWORD"(")(IDENT N1)(KEYWORD :)(IDENT mgClassifier)
  (IDENT L1)(KEYWORD")")(KEYWORD :)(IDENT L1)(KEYWORD =i)(IDENT L2)
  (KEYWORD ->)(IDENT nerode)(IDENT L2)(IDENT N1)(KEYWORD .)))

and the corresponding parsed syntax (simplified):

(VernacExpr()(VernacStartTheoremProof Lemma (Id mg_eq_proof)
 (((CLocalAssum(Name(Id L1))(CHole()IntroAnonymous()))
   (CLocalAssum(Name(Id L2))(CHole()IntroAnonymous()))
   (CLocalAssum(Name(Id N1))
    (CApp(CRef(Ser_Qualid(DirPath())(Id mgClassifier)))(CRef(Ser_Qualid(DirPath())(Id L1))))))
  (CNotation(InConstrEntrySomeLevel"_ -> _")
   (CNotation(InConstrEntrySomeLevel"_ =i _")
    (CRef(Ser_Qualid(DirPath())(Id L1)))(CRef(Ser_Qualid(DirPath())(Id L2))))
   (CApp(CRef(Ser_Qualid(DirPath())(Id nerode)))
    (CRef(Ser_Qualid(DirPath())(Id L2)))(CRef(Ser_Qualid(DirPath())(Id N1))))))))

and finally, the corresponding elaborated term (simplified):

(Prod (Name (Id char)) ... (Prod (Name (Id L1)) ...
 (Prod (Name (Id L2)) ... (Prod (Name (Id N1)) ...
  (Prod Anonymous (App (Ref (DirPath ((Id ssrbool) (Id ssr) (Id Coq))) (Id eq_mem)) ...
   (Var (Id L1)) ... (Var (Id L2)))
  (App (Ref (DirPath ((Id myhill_nerode) (Id RegLang))) (Id nerode)) ...
   (Var (Id L2)) ... (Var (Id N1))))))))

License

As described in more detail in the license file, all corpus metadata is licensed under the MIT license, while all Coq source files and corresponding S-expression files are licensed under their respective original open-source licenses.