Home

Awesome

DEPRECATION NOTICE

This repository is no longer maintained or supported, largely due to the retirement of the MRAN server, on which the checkpoint package relied.

rddj-template

A template for bootstrapping reproducible RMarkdown documents for data journalistic purposes.

Features

For more information please see the accompanying blog post.

Setup

First, clone and reset git repository.

git clone https://github.com/grssnbchr/rddj-template.git
cd rddj-template
rm -rf .git
git init

If you have a remote repository, you can add it like so:

git remote add origin https://github.com/user/repo.git

How to run

  1. The main document main.Rmd lies in the folder analysis. This is where most of your code resides.

  2. Set config variables in the very first chunk, specifically:

  1. Run the script: The individual R chunks should be run in the interpreter (Code > Run Region > Run All) on Linux/Windows: <kbd>Ctrl</kbd>+<kbd>Alt</kbd>+<kbd>R</kbd>, on Mac: <kbd>Cmd</kbd>+<kbd>Alt</kbd>+<kbd>R</kbd>). Be advised that some packages, like rgdal, need additional third party libraries installed. Watch out for compiler/installation messages in the R console. Also, you need to have the knitr and rstudioapi packages globally installed, e.g. installed via the RStudio package manager. On a Mac, occasional y/n: prompts may show up in the R console during package installation (section "install packages") – just confirm them by pressing y and <kbd>Enter</kbd>. Knitting the RMarkdown should not be done with RStudio (see below).

WARNING: It is recommended to restart R (Session > Restart R) when starting from scratch, i.e. use Session > Restart R and Run All Chunks instead of Run All Chunks. If you don't do that, checkpoint will be re-installed in your local .checkpoint folder, or other errors might occur.

  1. Knitting the RMarkdown: Because of how RStudio and checkpoint works, the use of the "knit" functionality in RStudio is strongly discouraged. It might work, but the preferred way is using the knit.sh shell script, execute it in a terminal like so: ./knit.sh. This will make sure the rmarkdown package from the specified package date will be used, not the globally installed one. knit.sh knits the script into a html document analysis/main.html. If you get an error saying that Pandoc could not be found, you need to let your terminal know where the pandoc binary resides by adjusting the PATH variable. This holds true for both Linux and Mac OS. Pandoc comes with RStudio, and the binary usually resides in /usr/lib/rstudio/bin and /Applications/RStudio.app/Contents/MacOS/pandoc respectively. So add the respective directory to your path. Workaround without setting the PATH variable: Executing knit.sh in the built in RStudio terminal (not the R console!) always works because RStudio obviously knows the location of the Pandoc binary. Knitting to PDF is currently not supported.

Branches

There are four branches at the moment:

Use whichever you want.

OS support

☑️: Full functionality (including knitting RMarkdown with knit.sh)

(☑️): Limited functionality (without knit.sh)

branchUbuntu 16.04Ubuntu 18.04macOS High SierramacOS MojaveWindows 10
master (R-3.6.x)not tested☑️not testednot testednot tested
R-3.5.xnot tested☑️☑️☑️(☑️)
R-3.4.x☑️☑️☑️☑️(☑️)
R-3.3.xnot tested☑️<sup>1</sup>☑️☑️<sup>3</sup>(☑️)<sup>2</sup>

More about checkpoint

This template uses the checkpoint package by Microsoft for full package reproducibility. With this package, all necessary packages (specified in the Define packages R chunk) are from a certain CRAN snapshot which you can specify in the very same R chunk (package_date). For each package_date, the necessary source and compiled packages will be installed to a local .checkpoint folder that resides in your home directory.

This has two big advantages:

  1. All packages are from the same CRAN snapshot, i.e. are supposed to play nicely together.
  2. If you re-run your script two or three years after initial creation, exactly those packages that were used at that point in time, that work with your code you wrote back then, are loaded and executed. No more deprecated code pieces and weird-looking ggplot2 plots!

In order to make checkpoint work with knitr, this vignette was adapted (it is now archived).

The downside(s) of checkpoint

With checkpoint, you can only access archived packages from CRAN, i.e. MRAN. As others have pointed out, GitHub repositories don't fit into this system. I wouldn't consider this as a big issue as you can install specific versions (i.e. releases/tags) from GitHub and as long as the GitHub repository stays alive, you can access these old versions. This is how the checkpoint package itself is installed in this template, by the way:

devtools::install_github("checkpoint",
                           username = "RevolutionAnalytics",
                           ref = "v0.3.2")

A second possible disadvantage is the reliance on Microsoft's snapshot system. Once these snapshots are down, the whole system is futile. I reckon/hope there will be third party mirrors though once the system gets really popular. Update September 2017: Apparently you can roll your own checkpoint server.

Deployment to GitHub pages

The knitted RMarkdown may be deployed to a respective GitHub page. If your repository repo is public, it can then be accessed via https://user.github.io/repo (example: https://grssnbchr.github.io/rddj-template). In order to do that,

  1. Make sure there are no unstaged changes in your working directory. Either git commit them or git stash them before continuing.

  2. Make sure you're in the root folder of your project (the one above analysis)

  3. Then locally create a gh-pages branch first, checkout master again and run the deploy.sh script in the root folder:

git checkout -b gh-pages
git checkout master
./deploy.sh
  1. For further deployments, it is sufficient to re-run ./deploy.sh. Make sure your working directory is clean before that step. If that is not the case, deployment will not work.

deploy.sh does the following:

Linting / styleguide

Code is automatically linted with lintr, i.e. checked for good style and syntax errors according to the tidyverse style guide. When being knitted, the lintr output is at the very end of the document. When being interpreted, the lintr output appears in a new Markers pane at the bottom of RStudio. If you want to disable linting, just comment that last line in main.Rmd out.

Other stuff / more features

Versioning of input and output

input and output files are not ignored by default. This has the advantage that output can be monitored for change when (subtle) details of the R code are changed.

If you want to ignore (big) input or output files, put them into the respective ignore folders. GitHub only allows a maximum file size of 100MB as of summer 2017.

Ability to outsource code to script files

If you want to keep your main.Rmd as tidy and brief as possible, you have the possibility to put separate functions and other code into script files that reside in the scripts folder. An example of this is provided in main.Rmd.

Multiple CPU cores for faster package installation

By default, more than one core is used for package installation, which significantly speeds up the process.

Optimal RStudio settings

It is recommended to disable workspace saving in RStudio, see https://mran.microsoft.com/documents/rro/reproducibility/doc-research/

Installation of older R versions

The idea of this template is that you specify your currently used R version, and that people trying to reproduce your scripts will use that very same R version (or at least up to the two first version numbers, e.g. 3.4.x). This makes it necessary to install old R versions. Here's some advice on how to do that on a couple of OSes.

Debian (tested on Ubuntu 16.04 and higher)

Compiled with information from here, here and here.

macOS X (tested on High Sierra and higher)

Windows 10