Home

Awesome

MathComp Corpus

A corpus of code for the Coq proof assistant along with several independently machine-readable representations, derived from verification projects related to the Mathematical Components (MathComp) library. The corpus can be used, e.g., in machine learning and data mining applications. Machine-readable representations are in the form of S-expressions (sexps), and were created using the SerAPI library.

Obtaining the corpus

The latest released version of the corpus can be downloaded from GitHub. The archive includes both the Coq source files and corresponding machine-readable representations.

Corpus contents

The latest corpus release is based on Coq 8.10.2 and MathComp 1.9.0, and contains 449 source files from 21 Coq projects --- in total over 297k lines of code (LOC). For each source file (e.g., theory.v), there are two corresponding files with lists of sexps, organized at the Coq sentence level, for tokens (theory.tok.sexp) and abstract syntax trees (theory.ast.sexp). Moreover, three sexp representations are provided for each Coq lemma statement in the corpus: tokens, abstract syntax tree, and elaborated term. All machine-readable representations were created using SerAPI 0.7.1, via the SerAPI-bundled programs sercomp, sertok, and sername.

A research paper describes the corpus in more detail and provides additional statistics. The corpus is divided into three tiers based on how well projects conform to the MathComp conventions.

ProjectRevision SHANo. filesLOCTierLicense
finmap27642a8464511CECILL-B
fourcolor0851d4960371381CECILL-B
math-comp748d71689847131CECILL-B
odd-orderca602a434361251CECILL-B
analysis9e5fe1d17117392CECILL-C
bigenough5f79a321782CECILL-B
elliptic-curves631af891895962CECILL-B
grobnerdfa54f9113302CECILL-B
infotheo6c1724281422952GPL-3.0-only
multinomials691d795573632CECILL-B
real-closed495a1fa1089252CECILL-B
robotb341ad113111302LGPL-3.0-only
two-square1c09aca217212CECILL-B
bits3cd298a1040413Apache-2.0
comp-dec-pdlc1f813b1644193CECILL-B
disele8aa80c2044733BSD-2-Clause
fcsl-pcmeef45031257893Apache-2.0
games3d3bd311249533BSD-2-Clause
monae9d0e4611866553GPL-3.0-only
reglangda333e01230333CECILL-B
toychain97bd6971452753BSD-2-Clause

The structure of the corpus is as follows:

File/directoryContents
projects.ymlList of projects, along with their URL, SHA, build command, installation command, etc.
raw-filesProject source files and their machine-readable representations.
lemmasLemmas for all projects in the corpus and their machine-readable statements.
lemmas-filteredSubset of lemmas obeying restrictions on the maximum sizes of elaborated terms.
definitionsDefinitions extracted from the corpus.

Applications

The corpus was used to train and evaluate the tool Roosterize, which suggests lemma names in Coq code using neural networks.

If you have used the corpus in a research project, please cite the corresponding research paper in any related publication:

@inproceedings{NieETAL20Roosterize,
  author = {Nie, Pengyu and Palmskog, Karl and Li, Junyi Jessy and Gligoric, Milos},
  title = {Deep Generation of {Coq} Lemma Names Using Elaborated Terms},
  booktitle = {International Joint Conference on Automated Reasoning},
  pages = {97--118},
  doi = {10.1007/978-3-030-51054-1_6},
  year = {2020},
}

Example data

The Coq lemma sentence

Lemma mg_eq_proof L1 L2 (N1 : mgClassifier L1) : L1 =i L2 -> nerode L2 N1.

is serialized into the following tokens (simplified):

(Sentence((IDENT Lemma)(IDENT mg_eq_proof)(IDENT L1)(IDENT L2)
  (KEYWORD"(")(IDENT N1)(KEYWORD :)(IDENT mgClassifier)
  (IDENT L1)(KEYWORD")")(KEYWORD :)(IDENT L1)(KEYWORD =i)(IDENT L2)
  (KEYWORD ->)(IDENT nerode)(IDENT L2)(IDENT N1)(KEYWORD .)))

and the corresponding parsed syntax (simplified):

(VernacExpr()(VernacStartTheoremProof Lemma (Id mg_eq_proof)
 (((CLocalAssum(Name(Id L1))(CHole()IntroAnonymous()))
   (CLocalAssum(Name(Id L2))(CHole()IntroAnonymous()))
   (CLocalAssum(Name(Id N1))
    (CApp(CRef(Ser_Qualid(DirPath())(Id mgClassifier)))(CRef(Ser_Qualid(DirPath())(Id L1))))))
  (CNotation(InConstrEntrySomeLevel"_ -> _")
   (CNotation(InConstrEntrySomeLevel"_ =i _")
    (CRef(Ser_Qualid(DirPath())(Id L1)))(CRef(Ser_Qualid(DirPath())(Id L2))))
   (CApp(CRef(Ser_Qualid(DirPath())(Id nerode)))
    (CRef(Ser_Qualid(DirPath())(Id L2)))(CRef(Ser_Qualid(DirPath())(Id N1))))))))

and finally, the corresponding elaborated term (simplified):

(Prod (Name (Id char)) ... (Prod (Name (Id L1)) ...
 (Prod (Name (Id L2)) ... (Prod (Name (Id N1)) ...
  (Prod Anonymous (App (Ref (DirPath ((Id ssrbool) (Id ssr) (Id Coq))) (Id eq_mem)) ...
   (Var (Id L1)) ... (Var (Id L2)))
  (App (Ref (DirPath ((Id myhill_nerode) (Id RegLang))) (Id nerode)) ...
   (Var (Id L2)) ... (Var (Id N1))))))))

License

As described in more detail in the license file, all corpus metadata is licensed under the MIT license, while all Coq source files and corresponding S-expression files are licensed under their respective original open-source licenses.

Authors