data.table
!R extensions: packages
Easiness to learn
Packages quality standards
R community
Packages provide a mechanism for loading optional code, data and documentation as needed. The R distribution itself includes about 30 packages.
Repositories:
Very few restriction on how to code:
In R {
and }
curly brackets have to match, other than that user is free to code any style.
This is very useful when doing first steps in a programming language.
Python on the other hand requires developer to follow strict rules about:
R community is very helpful in learning how to code in R.
Packages that are submitted to biggest R packages repository CRAN undergo regular package checks.
Those checks ensure that packages are compatible with coming R version, as well compatible with their dependencies. They even check if they are compatible with the other packages which depends on them.
They also check for quality of compiled code through detecting compiler warnings and errors.
Additionally package tests are being run regularly.
Each public function in a package must be documented.
Any issues are reported to a package maintainer requesting to fix.
R ecosystem has very helpful community which is priceless when it comes to learning R.
Resources to efficiently learn R:
#rstats
tagdata.table
package!Me: if not the syntax and speed of data.table, I would leave for another ecosystem long time ago. Using R since 2013.
data.table
library(data.table)
DF = iris
DT = as.data.table(iris)
DF[DF$Petal.Width > 2.1,]
subset(DF, Petal.Width > 2.1)
DT[Petal.Width > 2.1]
DF[, c("Petal.Width", "Petal.Length", "Species")]
DT[, .(Petal.Width, Petal.Length, Species)]
DT[, c("Petal.Width", "Petal.Length", "Species")]
single column
aggregate(DF$Petal.Width, by = list(cyl = DF$Species), FUN = mean)
DT[, mean(Petal.Width), by = Species]
multiple columns
aggregate(list(mean_width=DF$Petal.Width, mean_length=DF$Petal.Length),
by = list(cyl = DF$Species), FUN = mean)
DT[, .(mean_width = mean(Petal.Width),
mean_length = mean(Petal.Length)),
by = Species]
A = data.table(id = c(1L, 1L, 1L, 2L, 2L),
code = c("abc", "abc", "abd", "apq", "apq"),
valA = c(0.1, 0.6, 1.5, 0.9, 0.3))
B = data.table(id = c(1L, 2L), code = c("abd", "apq"),
mul = c(2.0, 0.5))
## joins as subset
A[B, on = .(id, code)]
data.table
supports various types of joins, like:
non-equi, rolling, overlapping
For more examples see my past talk High-productivity data frame operations with data.table.
DT[i = rows,
j = columns,
by = groups,
...]
i
: rows specified by logical condition or table to join toj
: columns or expression that leads to evaluated columnsby
: columns or expression that leads to evaluated grouping variable(s)...
: extra parameters, for example:
on
to specify join conditionenv
to parameterize DT[]
queryj
DT[, {
cat("calculating group", format(.BY$Species), "...")
tmp = Sepal.Length^2 + Sepal.Width^2
Sys.sleep(1) ## let's have a nap here
ans = mean(sqrt(tmp))
cat(" done with", .GRP, "/", .NGRP, "group\n")
list(mean.sepal.hypotenuse = ans)
}, by = Species] |> invisible()
#> calculating group setosa ... done with 1 / 3 group
#> calculating group versicolor ... done with 2 / 3 group
#> calculating group virginica ... done with 3 / 3 group
env
argumentDT[, {
cat("calculating group", format(.BY$grp), "...")
tmp = var1^2 + var2^2
Sys.sleep(nap)
ans = fun(sqrt(tmp))
cat(" done with", .GRP, "/", .NGRP, "group\n")
list(out_colname = ans)
}, by = grp,
env = list(
grp="Species", var1="Sepal.Length", var2="Sepal.Width",
fun="mean", out_colname="mean.sepal.hypotenuse", nap=1
)
] |> capture.output() |> invisible()
data.table
performancePerformance of a code developer/maintainer to read and write code…
Let’s take as an example mtcars
dataset
base R
aggregate(
mtcars$mpg[mtcars$am==1],
by = list(cyl = mtcars$cyl[mtcars$am==1]),
FUN = mean
)
89 characters
41 repetitive
data.table
mtcars[am==1, mean(mpg), cyl]
27 characters
4 repetitive
Finding order: Fast, stable and scalable true radix sorting
useR!, Aalborg, 02 July 2015, Matt Dowle:
Terdiman, 2000: Radix Sort Revisited
Herf, 2001: Radix Tricks
Arun Srinivasan implemented forder() in data.table entirely in C for integer, character and double
Matt Dowle changed from LSD (backwards) to MSD (forwards) for cache efficiency and to benefit from (partially) sorted data inputs
Later on forder()
has been contributed to base R!
Mon, 28 Mar 2016
CHANGES IN R 3.3.0 NEW FEATURES
The radix sort algorithm and implementation from ‘data.table’ (‘forder’) replaces the previous radix (counting) sort and adds a new method for ‘order()’. Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ‘?sort’).
Moreover, since 2018 forder()
can use multiple CPU threads!
Sort
Directions in Statistical Computing, 2 July 2016, Stanford, Matt Dowle:
x = runif(N)
ans1 = base::sort(x, method='quick')
ans2 = data.table::fsort(x)
identical(ans1, ans2)
#N=500m 3.8GB 8TH laptop: 65s => 3.9s (16x)
#N=1bn 7.6GB 32TH server: 140s => 3.5s (40x)
#N=10bn 76GB 32TH server: 25m => 48s (32x)
More on that topic in slides Proposal for parallel sort in base R (and Python/Julia), Matt Dowle, DSC Stanford from 2016.
Use of multiple CPU threads can greatly reduce time required for computation.
We use low-level parallelism provided by OpenMP API for C language.
More on that topic in slides Success with OpenMP in R package data.table, Matt Dowle, JSM Vancouver from 2018.
Algorithms
Memory conservative algorithms.
Tricks like memory reuse
Avoids repeated memory allocations by reusing once allocated memory.
Reference semantic
Avoids unnecessary in-memory copies.
Calculate aggregate in-place
DF = data.frame(x = sample(5e7,,TRUE), y = rnorm(5e7))
## 0.55 GB; 63% unique x
DF = merge(DF, aggregate(DF$y, by=list(DF$x), FUN=sum), by="x")
## base R: killed OutOfMemory
TB = TB %>% group_by(x) %>% mutate(sum_y_by_x = sum(y))
## dplyr: 10.0 GB; 169s
DT[, sum_y_by_x := sum(y), by = x]
## data.table: 3.0 GB; 5s
Avoid repeated memory allocations can save a lot of time too.
group by non-GForce (dogroup.c) allocates memory for the biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and therefore be more cache efficient. Then reuses once allocated memory for further groups without the need for repeated allocation of memory for each group. It works for any function.
group by GForce is optimization that redirects commonly used function (sum
, mean
, etc.) to data.table
‘s highly optimized implementations. It assigns to many group results at once; it doesn’t gather the groups together.
Sort table
DF = as.data.frame(lapply(setNames(nm=letters), function(y) rnorm(1e7)))
## 1.9 GB
DF = DF[with(DF, order(a, b, c)),]
## uses 4.1 GB
TB = TB %>% arrange(a, b, c)
## uses 4.0 GB
setorder(DT, a, b, c)
## uses 2.3 GB
Not only number of rows matters, see data cardinality!
Cardinality of data is very important factor when measuring how an algorithm scales.
set.seed(108)
N = 1e6 ## 1'000'000 rows
## data of 2 groups
DF1 = data.frame(id1 = sample(2L, N, TRUE), v1 = rnorm(N))
DT1 = as.data.table(DF1)
## data of 500'000 groups
DF2 = data.frame(id1 = sample(N/2, N, TRUE), v1 = rnorm(N))
DT2 = as.data.table(DF2)
system.time( ## 2 groups
aggregate(DF1$v1, by=list(id1=DF1$id1), FUN = mean)
)
#> user system elapsed
#> 0.452 0.043 0.502
system.time( ## 5e5 groups
aggregate(DF2$v1, by=list(id1=DF2$id1), FUN = mean)
)
#> user system elapsed
#> 6.804 0.122 6.936
system.time( ## 2 groups
DT1[, mean(v1), id1]
)
#> user system elapsed
#> 0.133 0.001 0.026
system.time( ## 5e5 groups
DT2[, mean(v1), id1]
)
#> user system elapsed
#> 0.170 0.000 0.022
In software development, number of dependencies is an important factor for risk assessment of a system to be delivered.
data.table
has zero dependencies. To deploy data.table
all what user needs is a single package tar.gz or a binary of data.table
package. No need to worry about extra (possibly recursive) dependencies and their compatibility across different versions.
Dependencies are invitations for other people to break your package. – Josh Ulrich, private communication Dirks blog: #17: Dependencies
My slides from Sevilla R User Group conference covers dependencies topic in depth: R package dependencies in production: risks and management
mkdir empty
R_LIBS_USER=./empty Rscript -e 'rownames(installed.packages(priority=NA_character_))'
#NULL
R_LIBS_USER=./empty Rscript -e 'install.packages("data.table")'
#* DONE (data.table)
R_LIBS_USER=./empty Rscript -e 'install.packages("dplyr")'
#also installing the dependencies ‘utf8’, ‘fansi’, ‘pkgconfig’, ‘withr’, ‘cli’, ‘generics’, ‘glue’, ‘lifecycle’, ‘magrittr’, ‘pillar’, ‘R6’, ‘rlang’, ‘tibble’, ‘tidyselect’, ‘vctrs’
2014: Grouping: github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping
2021: db-benchmark: h2oai.github.io/db-benchmark
2025: db-benchmark fork: duckdblabs.github.io/db-benchmark
github.com/jangorecki
fosstodon.org/@jangorecki
jangorecki @ protonmail.ch