Home

Awesome

Paperify

Paperify transforms any document, web page, or ebook into a research paper.

The text of the generated paper is the same as the text of the original document, but figures and equations from real papers are interspersed throughout.

A paper title and abstract are added (optionally generated by ChatGPT, if you provide an API key), and the entire paper is compiled with the IEEE $\LaTeX$ template for added realism.

<div align="center">

example

</div>

Install

First, install the dependencies (or use Docker):

For example, on Debian-based systems (e.g., Debian, Ubuntu, Kali, WSL):

sudo apt update
sudo apt install --no-install-recommends \
  pandoc \
  curl ca-certificates \
  jq \
  python3 \
  imagemagick \
  texlive texlive-publishers texlive-science lmodern texlive-latex-extra

Then, clone the repo (or directly pull the script), and execute it.

curl -L https://github.com/jstrieb/paperify/raw/master/paperify.sh \
  | sudo tee /usr/local/bin/paperify
sudo chmod +x /usr/local/bin/paperify

paperify -h

Examples

Docker

Alternatively, run Paperify from within a Docker container. To run the first example from within Docker and build to ./build/cox.pdf:

docker run \
  --rm \
  -it \
  --volume "$(pwd)/build":/root/build \
  jstrieb/paperify \
    --from-format html \
    "https://research.swtch.com/bell-labs" \
    build/cox.pdf

Usage

usage: paperify [OPTIONS] <URL or path> <output file>

OPTIONS:
  --temp-dir <DIR>            Directory for assets (default: /tmp/paperify)
  --from-format <FORMAT>      Format of input file (default: input suffix)
  --arxiv-category <CAT>      arXiv.org paper category (default: math)
  --num-papers <NUM>          Number of papers to download (default: 100)
  --max-parallelism <PROCS>   Maximum simultaneous processes (default: 32)
  --figure-frequency <N>      Chance of a figure is 1/N per paragraph (default: 25)
  --equation-frequency <N>    Chance of an equation is 1/N per paragraph (default: 25)
  --max-size <BYTES>          Max allowed image size in bytes (default 2500000)
  --min-equation-length <N>   Minimum equation length in characters (default 5)
  --max-equation-length <N>   Maximum equation length in characters (default 120)
  --min-caption-length <N>    Minimum figure caption length in characters (default 20)
  --chatgpt-token <TOKEN>     ChatGPT token to generate paper title, abstract, etc.
  --chatgpt-topic <TOPIC>     Paper topic ChatGPT will generate metadta for
  --quiet                     Don't log statuses
  --skip-downloading          Don't download papers from arXiv.org
  --skip-extracting           Don't extract equations and captions
  --skip-metadata             Don't regenerate metadata
  --skip-filtering            Don't filter out large files or non-diagram images

Note that the --skip-* flags are useful when you have already run the script once and do not want to repeat the process of downloading and extracting data.

Known Issues

How to Read the Code

In general, I'm a proponent of reading (or at least skimming) code before you run it, when possible. Usually, my code is written to be read. In this case, not so much.

Apologies in advance to anyone who tries to read the code. It started as four very cursed lines of Bash (without line wrapping) that I attempted to clean up a little. It is now many more than four lines of Bash, most of which remain very cursed. The small Python portion is particularly hard on the eyes, though it may possess a grotesque beauty for true functional programmers.

Everything is in paperify.sh. It can be read top-to-bottom or bottom-to-top, and there is a fat LaTeX template as a heredoc smack in the middle.

Project Status

Strange as it may sound, this project is complete. I want to live in a world where working software doesn't always grow until it becomes a Lovecraftian spaghetti monster.

I have added every feature that I wanted to add. It does what I wanted it to do, as well as I wanted it to do it. No further development required.

As such, I will try to address issues opened on GitHub, but I do not expect to address feature requests. I may merge pull requests.

Even if there are no recent commits, I'm hopeful that this script will continue to work many years from now.

Greetz & Acknowledgments

Greetz to several unnamed friends who offered helpful commentary prior to release.

Special shout out to the friends who suggested, as a follow-up project, making a browser extension to transform the current web page into a scientific paper. Sort of like Firefox reader mode, but for viewing Twitter when someone looking over your shoulder expects you to be doing something else.

Thanks to arXiv.org for hosting tons of papers with LaTeX source to mine.

Greetz to Project Gutenberg, Standard Ebooks, and Alexandra Elbakyan.

Lovingly released on Labor Day 2023; dedicated to procrastinating laborers of knowledge.