3 Deep dive
Many complex queries will described in R and SQL syntax. We will examine R feature computing on the language. Check how lazy evaluation allows to make optimizations. Write C code in your R package using OpenMP for parallel processing. We check how AutoML feature of h2o
can just look for best models/ensambles.
3.1 R language
3.1.1 Computing on the language
3.1.1.1 building a call
fun_name = function(x, y, ...) list(x, y, ...)
var1_name = 1
var2_name = 2
l = list(
as.name("fun_name"),
as.name("var1_name"),
as.name("var2_name")
)
as.call(l)
as.call(setNames(l, c("ignored in as.call","x","y")))
as.call(setNames(l, c("","y","x")))
eval(as.call(l))
eval(as.call(setNames(l, c("","y","x"))))
3.1.2 Lazy evaluation
x = 10:1
y = rnorm(10)
plot(x, y)
asd = y
plot(x, asd)
3.1.3 Tips
3.1.3.1 Assignment <-
vs =
There is no strict rule to use one of those over the other one
x <- 1
x = 1
f <- function() 1
f = function() 1
There is not difference in above case if you use <-
or =
but there are cases where it is important to distinguish use of those.
3.1.3.1.1 assigning names to values
When assigning names to values we have to use =
sign:
c(a=1, b=2)
c(a<-1, b<-2)
data.frame(a=1, b=2)
data.frame(a<-1, b<-2)
3.1.3.1.2 passing arguments to function
When passing arguments to functions by their name you must have use =
f <- function(x, y) list(x=x, y=y)
f(y=1, x=2)
f(y<-1, x<-2)
When it actually may be useful. Consider functions:
f <- function(x, y, z) {
if (isTRUE(x)) {
cat("f doing branch 1\n")
list(y, z)
} else {
cat("f doing branch 2\n")
invisible(FALSE)
}
}
g <- function(x) {
cat("g doing heavy computation\n")
Sys.sleep(5)
x
}
h <- function(x) x+1
and we want to calculate
f(x=TRUE, y=v<-g(1), z=h(v))
many people would advocate to write it as
v = g(1)
f(x=TRUE, y=v, z=h(v))
which is quite reasonable but it is not taking advantage of the language feature lazy evaluation.
Using first call
f(x=TRUE, y=v<-g(1), z=h(v))
R language feature lazy evaluation, makes function to evaluate arguments when they are actually used inside function. First argument x=FALSE
is a switch to exit function faster and not evaluate time consuming y=v<-g(1)
.
We can still achieve same functionality by wrapping f
into another function to handle that.
ff <- function(x, val) {
if (isTRUE(x)) {
v = g(val)
z = h(v)
} else {
v = NULL
z = NULL
}
f(x, v, z)
}
ff(TRUE, 1)
ff(FALSE, 1)
Keep in mind that above example is simplified.
3.2 Advanced queries
One of the main tasks for data scientist/analyst is ability to investigate data. In this chapter I am going to cover advanced queries that can be run on datasets based on their SQL syntax. SQL language is standard for quering data for many years already.
3.2.1 top N by group
3.2.2 partition by over group
3.2.3 partition aggregate by over group
3.2.4 update from
3.2.5 temporal join
3.2.6 latheral join
3.2.7 aggregate on join
3.2.8 cross join
3.2.9 distinct
3.2.10 groupings sets
3.2.11 set operators
3.3 Machine Learning
automl
put model into production
3.4 Package development
3.4.1 profiling
3.4.1.1 time
3.4.1.2 memory
3.4.2 src
compiled code, most commonly C or C++
3.4.2.1 R CMD SHLIB
3.4.2.2 structure
init.c?
3.4.2.3 Rtools dependency
3.4.2.4 gcc options
-O0
vs -O3
3.4.2.5 debugging C code
gdb valgrind
3.4.2.6 multithreading with OpenMP
3.4.2.6.1 #pragma omp parallel for
working example, env var to control
3.4.2.6.2 thread safety
memory allocation (pre-allocate memory and use C pointers from within parallel region)
REAL()
allocVector()