Awesome
rflow - Flexible R Pipelines with Caching
The package is currently under active development, please expect major changes while the API stabilizes.
Motivation
A common problem when processing data as part of a pipeline is avoiding unnecessary calculations. For example, if a function is called over and over with the same arguments, it should not recalculate the result each time but it should provide the cached (pre-computed) result.
While caching of the function output resolves the first problem, a second issue occurs when large data sets are being processed. In this case, hashing of the input arguments each time might take too long. This issue can be solved by hashing the data only once (as output) and then by noticing changes in the hash received by the downstream function. In other words, it is not the data that flows through the pipeline (as is the case with standard function), but hashes of the data.
A third issue is output sub-setting. When working with a pipeline there is often the case (e.g. ETL, Machine Learning) that we need to pass the whole data frame but the function is going to use only a subset (e.g. a CV fold). Since the main data frame has changes, caching of the result is no longer efficient. The solution involves hashing of the subset of interest which can be done by introducing additional intermediate functions in the pipeline. However, there is a loss of efficiency due to excessive rehashing as the main data frame passes through many functions.
The package rflow
addresses these inefficiencies and makes pipelines as easy
to use as in tidyverse.
Installation
# install.packages("devtools")
devtools::install_github("numeract/rflow")
Use
Simple Example
x1 <- 10
x2 <- 0.5
x3 <- 2
f1 <- function(a, b, c = 1) {a * b + c}
f2 <- function(d, e) {d / e}
# passing the results downstream using functions
(o1 <- f1(x1, x2)) # 6
(o2 <- f2(o1, x3)) # 3
# variant 1: declaring flows for each function using default options
ff1 <- make_flow_fn(f1)
ff2 <- make_flow_fn(f2)
# passing to the downstream flow and collecting the results
r1 <- ff1(x1, x2) # does not trigger re-calc
r2 <- ff2(r1, x3) # does not trigger re-calc; first arg. is a flow arg.
collect(r1) # 6
collect(r2) # 3
# variant 2: arguments and functions withing one call
library(dplyr) # makes life easier
flow_fn(x1, x2, fn = f1) %>% # reuses cache created by ff1
flow_fn(x3, fn = f2) %>% # reuses cache created by ff2
collect() # 3, no actual re-calc takes place
Pipelines
- Create your function, e.g.
f <- function(...) {...}
rflow
works best with pure functions, i.e. functions that depend only on their inputs (and not on variables outside the function frame) and do not produce any side effects (e.g. printing, modifying variables in the global environment).
-
"flow" the function:
ff <- make_flow_fn(f))
-
When pipelining
ff
into anotherrflow
function, simply supplyff()
as an argument, for example:ff(x) %>% ff2(y) %>% ff3(z)
-
At the end of the
rflow
pipeline you must usecollect()
to collect the actual data (and not just the cached structure). Alternatively, useflow_ns_sink()
to dump the data into an environment or aShiny::reactiveValues
name space.
Shiny
Shiny from RStudio uses reactive values to know what changes took place and what to recompute. It is thus possible to use a series of reactive elements in Shiny to prevent expensive re-computations from taking place. Example:
rv1 <- reactive({
... input$x ...
})
rv2 <- reactive({
... rv1() .... input$y ...
})
rv3 <- reactive({
... rv2() .... input$z ...
})
The downside is that we need one reactive element for each function in the
pipeline - this makes data processing dependent on UI / Shiny. Using rflow
,
we can separate the UI from the data processing, maintaining the caching
not only for the current state but for all previously computed states.
rv <- reactive({
rf1(input$x, ...) %>%
rf2(input$y, ...) %>%
rf3(input$z, ...) %>%
collect()
})
While a similar workflow can be achieved with package memoise
, it suffers from
several disadvantages (below).
Output Subset
(to be updated)
Other frameworks
Memoise
Package memoise by Hadley Wickham, Jim Hester and others was the main source of inspiration. Memoise is elegant, fast, simple to use, but it suffers from certain limitations that we hope to overcome in this package:
- excessive rehashing of inputs
- only one cache layer (although its cache framework is extensible)
- no input/output sub-setting, it uses the complete set of arguments provided
- no reactivity (yet to be implemented in
rflow
)
Drake
Package drake by Will Landau and others
provides a complete framework for large data sets, including using
files as inputs and outputs. The downside is that it requires additional
overhead to get started and its focus is on the pipeline as a whole. If your
work requires many hours of computations (which increases the value of each
result), the overhead due to the setup has a relatively lower cost - in this
scenario drake
is an excellent choice.
Package rflow
is somewhere between memoise
and drake
:
- one can start using
rflow
right away, with minimal overhead - allows focusing on the data processing (e.g., EDA) and not on the framework
TODO list
- reactivity
- multi-layer cache (with file locking)
- files sinks
- parallel processing