frollapply {data.table} R Documentation

## Rolling user-defined function

### Description

Fast rolling user-defined function (UDF) to calculate on sliding windows. Experimental. Please read, at least, caveats section below.

### Usage

  frollapply(x, n, FUN, ..., by.column=TRUE, fill=NA,
give.names=FALSE, simplify=TRUE)


### Arguments

 x Atomic vector, data.frame or data.table columns over which to calculate the windowed aggregations. When by.column=FALSE then x expects to be data.frame or data.table (optionally also list of equal length vectors), then rolling window subset is applied not for each column but for a dataset as a whole, and then passed to FUN as data.frame or data.table (or list). When by.column=TRUE then x may also be a list, supporting vectorized input, then rolling function is calculated for each element of the list. by.column=FALSE does not support vectorized input. n Integer vector giving rolling window size(s). This is the total number of included values in aggregate function. Adaptive rolling functions also accept a list of integer vectors when applying multiple window sizes, see adaptive argument description for details. In both adaptive cases n may also be a list, supporting vectorized input, then rolling function is calculated for each element of the list. FUN The function to be applied on a subsets of x. ... Extra arguments passed to FUN. by.column Logical. When TRUE (default) then x of types list/data.frame/data.table is treated as vectorized input rather an object to apply rolling window on. Setting to FALSE allows rolling window to be applied on multiple variables, using data.frame or data.table (optionally a list) as a whole. For details see by.column argument section below. fill An object; value to pad by. Defaults to NA. align Character, specifying the "alignment" of the rolling window, defaulting to "right". For details see froll. adaptive Logical, default FALSE. Should the rolling function be calculated adaptively? For details see froll. partial Logical, default FALSE. Should the rolling window size(s) provided in n be trimmed to available observations. For details see froll. give.names Logical, default FALSE. When TRUE, names are automatically generated corresponding to names of x and names of n. If answer is an atomic vector, then the argument is ignored, see examples. simplify Logical or a function. When TRUE (default) then internal simplifylist function is applied on a list storing results of all computations. When FALSE then list is returned without any post-processing. Argument can take a function as well, then the function is applied to a list that would have been returned when simplify=FALSE. If results are not automatically simplified when simplify=TRUE then, for backward compatibility, one should use simplify=FALSE explicitly. See simplify argument section below for details.

### Value

A list except when the input is not vectorized (x is not a list to apply function by column, and n specify single rolling window), in which case a vector is returned, for convenience. Thus, rolling functions can be used conveniently within data.table syntax.

### by.column argument

Setting by.column to FALSE allows to apply function on multiple variables rather than a single vector. Then x expects to be data.table or data.table (or a list of equal length vectors) and window size provided in n refers to number of rows (or length of a vectors in a list). See examples for use cases. For by.column=FALSE function does not support vectorized input, like list of data.frames. In case of getting incorrect number of dimensions error ensure that by.column is set to FALSE.

### simplify argument

One should avoid simplify=TRUE when writing robust code. One reason is performance, as explained in Performance consideration section below. Another is backward compatibility. If results are not automatically simplified when simplify=TRUE then, for backward compatibility, one should use simplify=FALSE explicitly. In future version we may improve internal simplifylist function, then simplify=TRUE may return object of a different type, breaking downstream code. If results are already simplified with simplify=TRUE, then it can be considered backward compatible.

### Caveats

With great power comes great responsibility.

1. An optimization used to avoid repeated allocation of window subsets (explained more deeply in Implementation section below) may, in special cases, return rather surprising results:

setDTthreads(1)
frollapply(c(1, 9), n=1L, FUN=identity) ## unexpected
#[1] 9 9
frollapply(c(1, 9), n=1L, FUN=list) ## unexpected
#      V1
#   <num>
#1:     9
#2:     9
frollapply(c(1, 9), n=1L, FUN=identity) ## good only because threads >= input
#[1] 1 9
frollapply(c(1, 5, 9), n=1L, FUN=identity) ## unexpected again
#[1] 5 5 9


Problem occurs, in rather unlikely scenarios for rolling computations, when objects returned from a function can be its input (i.e. identity), or a reference to it (i.e. list), then one has to add extra copy call:

setDTthreads(1)
frollapply(c(1, 9), n=1L, FUN=function(x) copy(identity(x))) ## only 'copy' would be equivalent here
#[1] 1 9
frollapply(c(1, 9), n=1L, FUN=function(x) copy(list(x)))
#      V1
#   <num>
#1:     1
#2:     9

2. Due to parallel evaluation of FUN calls handled by parallel package:

• Warnings produced inside the function are silently ignored.

• FUN should not use any on-screen devices, GUI elements, tcltk, multithreaded libraries. FUN is internally passed to mcparallel, see its manual for details. Note that setDTthreads(1L) is also passed to forked processes, therefore any data.table code inside FUN will be forced to be single threaded. It is advised to not call setDTthreads inside FUN. frollapply is already parallelized and nested parallelism is rarely a good idea.

• Objects returned from forked processes, FUN, are serialized. This may cause problems for objects that are meant not to be serialized, like data.table. We are handling that for data.table class internally in frollapply whenever FUN is returning data.table (which is checked on the results of first FUN call so assume function is type stable). If data.table is nested in another object returned from FUN then problem may still manifest, then one has to call setDT on objects returned from FUN. This can be also nicely handled via simplify argument when passing a function that calls setDT on nested data.table objects returned from FUN. Anyway, returning data.table from FUN should, in majority of cases, be avoided from the performance reasons, see UDF optimization section for details.

is.ok = function(x) {stopifnot(is.data.table(x)); format(attr(x, ".internal.selfref", TRUE))!="<pointer: (nil)>"}
ans = frollapply(1:2, 2, data.table) ## default: fill=NA
is.ok(ans[[2L]]) ## mismatch of 'fill' type so simplify=TRUE did not run rbindlist but frollapply detected DT and fixed
#[1] TRUE
ans = frollapply(1:2, 2, data.table, fill=data.table(NA)) ## fill type match
is.ok(ans) ## simplify=TRUE did run rbindlist, but frollapply fixed anyway
#[1] TRUE
ans = frollapply(1:2, 2, data.table, fill=data.table(NA), simplify=FALSE)
is.ok(ans[[2L]]) ## detected and fixed by frollapply
#[1] TRUE
ans = frollapply(1:2, 2, function(x) list(data.table(x)), fill=list(data.table(NA)), simplify=FALSE)
is.ok(ans[[2L]][[1L]]) ## not detected and not fixed
#[1] FALSE
set(ans[[2L]][[1L]],, "newcol", 1L)
#Error in set(ans[[2L]][[1L]], , "newcol", 1L) :
#  This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.
ans = lapply(ans, lapply, setDT)
is.ok(ans[[2L]][[1L]]) ## fix after
#[1] TRUE
ans = frollapply(1:2, 2, function(x) list(data.table(x)), fill=list(data.table(NA)), simplify=function(x) lapply(x, lapply, setDT))
is.ok(ans[[2L]][[1L]]) ## fix inside frollapply via simplify
#[1] TRUE
f = function(x) (if (x[1L]==1L) data.frame else data.table)(x)
ans = frollapply(1:3, 2, f, fill=data.table(NA), simplify=FALSE)
is.ok(ans[[3L]]) ## automatic fix may not work for a non-type stable function
#[1] FALSE
ans = frollapply(1:3, 2, f, fill=data.table(NA), simplify=function(x) lapply(x, function(y) if (is.data.table(y)) setDT(y) else y))
is.ok(ans[[3L]]) ## fix inside frollapply via simplify
#[1] TRUE

3. Due to possible future improvements of handling simplification of results returned from rolling function, the default simplify=TRUE may not be backward compatible for functions that produce results that haven't been already automatically simplified. See simplify argument section for details.

4. Function has not been tested with non-atomic data types (default by.column=TRUE actually prohibits those), list columns in data.frame (by.column=FALSE). If one is working with not basic objects (non atomic types, list columns, S4 classes, etc.), it is advised to double check if answer returned is correct. In examples below one can find very basic implementation rollapply, it can be used to compare the results.

### Performance consideration

frollapply is meant to run any UDF function. If one needs to use a common function like mean, sum, max, etc., then we have highly optimized implemented in C, rolling functions described in froll manual.
Most crucial optimizations are the ones to be applied on UDF. Those are discussed in next section UDF optimization below.

• When using by.column=FALSE one can subset dataset before passing it to x to keep only columns relevant for the computation:

x = setDT(lapply(1:100, function(x) as.double(rep.int(x,1e4L))))
f = function(x) sum(x$V1*x$V2)
system.time(frollapply(x, 100, f, by.column=FALSE))
#   user  system elapsed
#  0.157   0.067   0.081
system.time(frollapply(x[, c("V1","V2"), with=FALSE], 100, f, by.column=FALSE))
#   user  system elapsed
#  0.096   0.054   0.054

• Avoid partial, see partial argument section of froll manual.

• Avoid simplify=TRUE and provide a function instead:

x = rnorm(1e5)
system.time(frollapply(x, 2, function(x) 1L, simplify=TRUE))
#   user  system elapsed
#  0.308   0.076   0.196
system.time(frollapply(x, 2, function(x) 1L, simplify=unlist))
#   user  system elapsed
#  0.214   0.080   0.088

• CPU threads utilization in frollapply can be controlled by setDTthreads, which by default uses half of available CPU threads.

• Optimization that avoids repeated allocation of a window subset (see Implementation section for details), in case of adaptive rolling function, depends on R's growable bit. This feature has been added in R 3.4.0. Adaptive frollapply will still work on older versions of R but, due to repeated allocation of window subset, it will be much slower.

• Parallel computation of FUN is handled by parallel package (part of R core since 2.14.0) and its fork mechanism. Fork is not available on Windows OS therefore it will be always single threaded on that platform.

### UDF optimization

FUN will be evaluated many times so should be highly optimized. Tips below are not specific to frollapply and can be applied to any code is meant to run in many iterations.

• It is usually better to return the most lightweight objects from FUN, for example it will be faster to return a list rather a data.table. In the case presented below, simplify=TRUE is calling rbindlist on the results anyway, which makes the results equal:

fun1 = function(x) {tmp=range(x); data.table(min=tmp[1L], max=tmp[2L])}
fun2 = function(x) {tmp=range(x); list(min=tmp[1L], max=tmp[2L])}
fill1 = data.table(min=NA_integer_, max=NA_integer_)
fill2 = list(min=NA_integer_, max=NA_integer_)
system.time(a<-frollapply(1:1e4, 100, fun1, fill=fill1))
#   user  system elapsed
#  2.047   0.337   0.788
system.time(b<-frollapply(1:1e4, 100, fun2, fill=fill2))
#   user  system elapsed
#  0.205   0.125   0.138
all.equal(a,  b)
#[1] TRUE

• Code that is not dependent on rolling window should be taken out as pre or post computation:

x = c(1L,3L)
system.time(for (i in 1:1e6) sum(x+1L))
#   user  system elapsed
#  0.308   0.004   0.312
system.time({y = x+1L; for (i in 1:1e6) sum(y)})
#   user  system elapsed
#  0.203   0.000   0.202

• Being strict about data types removes the need for R to handle them automatically:

x = vector("integer", 1e6)
system.time(for (i in 1:1e6) x[i] = NA)
#   user  system elapsed
#  0.160   0.000   0.161
system.time(for (i in 1:1e6) x[i] = NA_integer_)
#   user  system elapsed
#   0.05    0.00    0.05

• If a function calls another function under the hood, it is usually better to call the latter one directly:

x = matrix(c(1L,2L,3L,4L), c(2L,2L))
system.time(for (i in 1:1e4) colSums(x))
#   user  system elapsed
#  0.051   0.000   0.051
system.time(for (i in 1:1e4) .colSums(x, 2L, 2L))
#   user  system elapsed
#  0.015   0.000   0.015

• There are many functions that may be optimized for scaling up for bigger input, yet for a small input they may carry bigger overhead comparing to their simpler counterparts. One may need to experiment on own data, but low overhead functions are likely be faster when evaluating in many iterations:

## uniqueN
x = c(1L,3L,5L)
system.time(for (i in 1:1e4) uniqueN(x))
#   user  system elapsed
#  0.156   0.004   0.160
system.time(for (i in 1:1e4) length(unique(x)))
#   user  system elapsed
#  0.040   0.004   0.043
## column subset
x = data.table(v1 = c(1L,3L,5L))
system.time(for (i in 1:1e4) x[, "v1"])
#   user  system elapsed
#  0.959   0.024   0.983
system.time(for (i in 1:1e4) x[["v1"]])
#   user  system elapsed
#  0.068   0.000   0.068


### Implementation

Evaluation of UDF comes with very limited capabilities for optimizations, therefore speed improvements in frollapply should not be expected as good as in other data.table fast functions. frollapply is implemented almost exclusively in R, rather than C. Its speed improvement comes from two optimizations that have been applied:

1. No repeated allocation of a rolling window subset.
Object (type of x and size of n) is allocated once (for each CPU thread), and then for each iteration this object is being re-used by copying expected subset of data into it. This means we still have to subset data on each iteration, but we only copy data into pre-allocated window object, instead of allocating in each iteration. Allocation is carrying much bigger overhead than copy. The faster the FUN evaluates the more relative speedup we are getting, because allocation of a subset does not depend on how fast or slow FUN evaluates. See caveats section for possible edge cases caused by this optimization.

2. Parallel evaluation of FUN calls.
Until now (October 2022) all the multithreaded code in data.table was using OpenMP. It can be used only in C language and it has very low overhead. Unfortunately it could not be applied in frollapply because to evaluate UDF from C code one has to call R's C api that is not thread safe (can be run only from single threaded C code). Therefore frollapply uses parallel package to provide parallelism on R language level. It uses fork parallelism, which has low overhead as well, unless results of computation are big in size. Fork is not available on Windows OS. See caveats section for limitations caused by using this optimization.

### Note

Be aware that rolling functions operates on the physical order of input. If the intent is to roll values in a vector by a logical window, for example an hour, or a day, then one has to ensure that there are no gaps in input or use adaptive rolling function to handle gaps by specifying expected window sizes. For details see issue #3241.

froll, shift, data.table, setDTthreads

### Examples

frollapply(1:16, 4, median)
frollapply(1:9, 3, toString)

## vectorized input
x = list(1:10, 10:1)
n = c(3, 4)
frollapply(x, n, sum)
## give names
x = list(data1 = 1:10, data2 = 10:1)
n = c(small = 3, big = 4)
frollapply(x, n, sum, give.names=TRUE)

## by.column=FALSE
x = as.data.table(iris)
flow = function(x) {
v1 = x[[1L]]
v2 = x[[2L]]
(v1[2L] - v1[1L] * (1+v2[2L])) / v1[1L]
}
x[,
"flow" := frollapply(.(Sepal.Length, Sepal.Width), 2L, flow, by.column=FALSE),
by = Species
][]

## rolling regression: by.column=FALSE
f = function(x) coef(lm(v2 ~ v1, data=x))
x = data.table(v1=rnorm(120), v2=rnorm(120))
coef.fill = c("(Intercept)"=NA_real_, "v1"=NA_real_)
frollapply(x, 4, f, by.column=FALSE, fill=coef.fill)

## rollaply, not vectorized, no auto simplify
rollapply = function(x, n, FUN, ..., by.column=TRUE, fill=NA, simplify=identity) {
stopifnot((by.column && is.atomic(x)) || (!by.column && is.data.frame(x)),
is.numeric(n), is.function(FUN), is.function(simplify))
len = if (by.column) length(x) else nrow(x)
ans = vector("list", len)
ans[seq_len(n-1L)] = rep(list(fill), n-1L)
for (i in n:len) {
ans[[i]] = if (by.column)
FUN(x[(i-n+1L):i], ...)
else
FUN(x[(i-n+1L):i,, drop=FALSE], ...)
}
simplify(ans)
}
all.equal(
frollapply(1:16, 4, median),
rollapply(1:16, 4, median, simplify=unlist)
)


[Package data.table version 1.14.3 Index]