


Python tools for creating suitable dataset for OpenAI's im2latex task: https://openai.com/requests-for-research/#im2latex. You can download a prebuilt dataset from here. The data is split into train (~84k), validation (~9k) and test (~10k) sets, which possibly isn't quite enough for this task.

Note: This code is very ad-hoc and requires tinkering with the source



Building your own dataset

  1. Download bunch of LaTeX sources packed in .tar files (by using the latex_urls.txt, for example)
  2. Run python latex2formulas.py [directory where .tars are stored]
  3. Run python formula2image.py [path to generated formula text file]
  4. Run python formula2image.py [dataset_file] [formula_file] [image_dir] to confirm dataset is valid

Issues and possible TODOs

Ultimate goals (Update: Likely not going to happen, but kept here as a food for thought)