3 Deep dive

Many complex queries will described in R and SQL syntax. We will examine R feature computing on the language. Check how lazy evaluation allows to make optimizations. Write C code in your R package using OpenMP for parallel processing. We check how AutoML feature of h2o can just look for best models/ensambles.

3.1 R language

3.1.1 Computing on the language

3.1.1.1 building a call

fun_name = function(x, y, ...) list(x, y, ...)
var1_name = 1
var2_name = 2
l = list(
  as.name("fun_name"),
  as.name("var1_name"),
  as.name("var2_name")
)
as.call(l)
as.call(setNames(l, c("ignored in as.call","x","y")))
as.call(setNames(l, c("","y","x")))
eval(as.call(l))
eval(as.call(setNames(l, c("","y","x"))))

3.1.2 Lazy evaluation

x = 10:1
y = rnorm(10)
plot(x, y)
asd = y
plot(x, asd)

3.1.3 Tips

3.1.3.1 Assignment <- vs =

There is no strict rule to use one of those over the other one

x <- 1
x = 1
f <- function() 1
f = function() 1

There is not difference in above case if you use <- or = but there are cases where it is important to distinguish use of those.

3.1.3.1.1 assigning names to values

When assigning names to values we have to use = sign:

c(a=1, b=2)
c(a<-1, b<-2)
data.frame(a=1, b=2)
data.frame(a<-1, b<-2)
3.1.3.1.2 passing arguments to function

When passing arguments to functions by their name you must have use =

f <- function(x, y) list(x=x, y=y)
f(y=1, x=2)
f(y<-1, x<-2)

When it actually may be useful. Consider functions:

f <- function(x, y, z) {
  if (isTRUE(x)) {
    cat("f doing branch 1\n")
    list(y, z)
  } else {
    cat("f doing branch 2\n")
    invisible(FALSE)
  }
}
g <- function(x) {
  cat("g doing heavy computation\n")
  Sys.sleep(5)
  x
}
h <- function(x) x+1

and we want to calculate

f(x=TRUE, y=v<-g(1), z=h(v))

many people would advocate to write it as

v = g(1)
f(x=TRUE, y=v, z=h(v))

which is quite reasonable but it is not taking advantage of the language feature lazy evaluation.

Using first call

f(x=TRUE, y=v<-g(1), z=h(v))

R language feature lazy evaluation, makes function to evaluate arguments when they are actually used inside function. First argument x=FALSE is a switch to exit function faster and not evaluate time consuming y=v<-g(1). We can still achieve same functionality by wrapping f into another function to handle that.

ff <- function(x, val) {
  if (isTRUE(x)) {
    v = g(val)
    z = h(v)
  } else {
    v = NULL
    z = NULL
  }
  f(x, v, z)
}
ff(TRUE, 1)
ff(FALSE, 1)

Keep in mind that above example is simplified.

3.2 Advanced queries

One of the main tasks for data scientist/analyst is ability to investigate data. In this chapter I am going to cover advanced queries that can be run on datasets based on their SQL syntax. SQL language is standard for quering data for many years already.

3.2.1 top N by group

3.2.2 partition by over group

3.2.3 partition aggregate by over group

3.2.4 update from

3.2.5 temporal join

3.2.6 latheral join

3.2.7 aggregate on join

3.2.8 cross join

3.2.9 distinct

3.2.10 groupings sets

3.2.11 set operators

3.3 Machine Learning

automl

put model into production

3.4 Package development

3.4.1 profiling

3.4.1.1 time

3.4.1.2 memory

3.4.2 src

compiled code, most commonly C or C++

3.4.2.1 R CMD SHLIB

3.4.2.2 structure

init.c?

3.4.2.3 Rtools dependency

3.4.2.4 gcc options

-O0 vs -O3

3.4.2.5 debugging C code

gdb valgrind

3.4.2.6 multithreading with OpenMP

3.4.2.6.1 #pragma omp parallel for

working example, env var to control

3.4.2.6.2 thread safety

memory allocation (pre-allocate memory and use C pointers from within parallel region) REAL() allocVector()