Awesome

im2latex-dataset

Python tools for creating suitable dataset for OpenAI's im2latex task: https://openai.com/requests-for-research/#im2latex. You can download a prebuilt dataset from here. The data is split into train (~84k), validation (~9k) and test (~10k) sets, which possibly isn't quite enough for this task.

Note: This code is very ad-hoc and requires tinkering with the source

/src/latex2formulas.py
- Script for parsing downloaded latex sources for formulas. Stores formulas in single .txt file (one formula per line)
/src/stackexchange2formulas.py
- Similar to latex2formulas.py, but for parsing StackExchange XMLs.
/src/arxiv2formulas.py
- Similar to latex2formulas.py, but for parsing arXiv .tar/.tar.gz files (source downloads).
/src/formula2image.py
- Creates images and dataset from a file of formulas
/src/im2latex_utils.py
- Collection of misc functions for handling these formulas
latex_urls.txt
- Text file containing urls to LaTeX dataset from here. Use wget -i latex_urls.txt to download these files.

Dependencies

Python 2.x or 3.x (only ran on 2.x, should work on 3.x too. Haven't tried running on Windows)
For running the script with current settings and generating full-page images:
- Properly installed LaTeX-to-PDF chain (eg. calling pdflatex outputs .pdf for .tex file)
- ImageMagick installed so that convert command works
For creating more compact images of formulas (image cropped so that formula fits)
- textogif and its dependencies
- textogif needs to be placed in same directory where images are generated, otherwise it won't work.

Building your own dataset

Download bunch of LaTeX sources packed in .tar files (by using the latex_urls.txt, for example)
Run python latex2formulas.py [directory where .tars are stored]
Run python formula2image.py [path to generated formula text file]
Run python formula2image.py [dataset_file] [formula_file] [image_dir] to confirm dataset is valid

The end result should have two files and one directory (names can be changed in formula2image.py:
- im2latex.lst
  - Each line is in format formula_idx image_name render_type
    - formula_idx is the line number where formula is in im2latex_formulas.lst
    - image_name is the name of the image connected to this rendering (without '.png')
    - render_type is the name of render setup used, defined in formula2image.py
- im2latex_formulas.lst
  - Each line contains one formula
- /formula_images
  - Directory where images are stored
Sometimes pdflatex gets stuck inside an infinite loop when compiling an image.
- To fix this you need to manually kill stuck pdflatex processes, otherwise script won't end

Issues and possible TODOs

If pdflatex is used with convert this will generate pictures of whole page
- While this might be a good thing (eg. fixed input size), it might also severly slow down training
textogif generates smaller images but these will have varying dimensions.
Possible TODOs:
- Finish tokenizer function / output list of tokens instead of raw formula in formula list
- Add accuracy metric (eg. word-error-rate or similar).
  - Check this repository for some evaluation scripts: https://github.com/harvardnlp/im2markup
- Combine ...2formula.py scripts into one, or at least make system more sensible rather than bunch of separate scripts.

Ultimate goals (Update: Likely not going to happen, but kept here as a food for thought)