What makes R strong?

packages! data.table!

Jan Gorecki

2025-06-24

What makes R strong?


R extensions: packages

Packages provide a mechanism for loading optional code, data and documentation as needed. The R distribution itself includes about 30 packages.

Repositories:


Easiness to learn

Very few restriction on how to code:

In R { and } curly brackets have to match, other than that user is free to code any style.
This is very useful when doing first steps in a programming language.

Python on the other hand requires developer to follow strict rules about:

R community is very helpful in learning how to code in R.


R extensions quality

Packages that are submitted to biggest R packages repository CRAN undergo regular package checks.
Those checks ensure that packages are compatible with coming R version, as well compatible with their dependencies. They even check if they are compatible with the other packages which depends on them.

They also check for quality of compiled code through detecting compiler warnings and errors.

Additionally package tests are being run regularly.

Each public function in a package must be documented.

Any issues are reported to a package maintainer requesting to fix.


R community

R ecosystem has very helpful community which is priceless when it comes to learning R.

Resources to efficiently learn R:


What makes R strong? The data.table package!

why I still use R

Me: if not the syntax and speed of data.table, I would leave for another ecosystem long time ago. Using R since 2013.

underrated and top 7 packages


Introduction to data.table

logo

library(data.table)

DF = iris
DT = as.data.table(iris)

subset rows and columns

DF[DF$Petal.Width > 2.1,]
subset(DF, Petal.Width > 2.1)

DT[Petal.Width > 2.1]
DF[, c("Petal.Width", "Petal.Length", "Species")]

DT[, .(Petal.Width, Petal.Length, Species)]
DT[, c("Petal.Width", "Petal.Length", "Species")]

group by

single column

aggregate(DF$Petal.Width, by = list(cyl = DF$Species), FUN = mean)

DT[, mean(Petal.Width), by = Species]

multiple columns

aggregate(list(mean_width=DF$Petal.Width, mean_length=DF$Petal.Length),
  by = list(cyl = DF$Species), FUN = mean)

DT[, .(mean_width = mean(Petal.Width),
       mean_length = mean(Petal.Length)),
   by = Species]

join

A = data.table(id = c(1L, 1L, 1L, 2L, 2L),
  code = c("abc", "abc", "abd", "apq", "apq"),
  valA = c(0.1, 0.6, 1.5, 0.9, 0.3))
B = data.table(id = c(1L, 2L), code = c("abd", "apq"),
  mul = c(2.0, 0.5))

## joins as subset
A[B, on = .(id, code)]

data.table supports various types of joins, like:
non-equi, rolling, overlapping

For more examples see my past talk High-productivity data frame operations with data.table.


General syntax

DT[i = rows,
   j = columns,
   by = groups,
   ...]

Expression in j

DT[, {
  cat("calculating group", format(.BY$Species), "...")
  tmp = Sepal.Length^2 + Sepal.Width^2
  Sys.sleep(1) ## let's have a nap here
  ans = mean(sqrt(tmp))
  cat(" done with", .GRP, "/", .NGRP, "group\n")
  list(mean.sepal.hypotenuse = ans)
}, by = Species] |> invisible()
#> calculating group setosa ... done with 1 / 3 group
#> calculating group versicolor ... done with 2 / 3 group
#> calculating group virginica ... done with 3 / 3 group

Parametrization: env argument

DT[, {
  cat("calculating group", format(.BY$grp), "...")
  tmp = var1^2 + var2^2
  Sys.sleep(nap)
  ans = fun(sqrt(tmp))
  cat(" done with", .GRP, "/", .NGRP, "group\n")
  list(out_colname = ans)
}, by = grp,
  env = list(
    grp="Species", var1="Sepal.Length", var2="Sepal.Width",
    fun="mean", out_colname="mean.sepal.hypotenuse", nap=1
  )
] |> capture.output() |> invisible()

Secrets of data.table performance

syntax

Performance of a code developer/maintainer to read and write code…

Let’s take as an example mtcars dataset


base R

aggregate(
  mtcars$mpg[mtcars$am==1],
  by = list(cyl = mtcars$cyl[mtcars$am==1]),
  FUN = mean
)

89 characters
41 repetitive


data.table

mtcars[am==1, mean(mpg), cyl]

27 characters
4 repetitive


Computation time

Algorithms

Finding order: Fast, stable and scalable true radix sorting

useR!, Aalborg, 02 July 2015, Matt Dowle:

Terdiman, 2000: Radix Sort Revisited
Herf, 2001: Radix Tricks
Arun Srinivasan implemented forder() in data.table entirely in C for integer, character and double
Matt Dowle changed from LSD (backwards) to MSD (forwards) for cache efficiency and to benefit from (partially) sorted data inputs


Later on forder() has been contributed to base R!

Mon, 28 Mar 2016
CHANGES IN R 3.3.0 NEW FEATURES

The radix sort algorithm and implementation from ‘data.table’ (‘forder’) replaces the previous radix (counting) sort and adds a new method for ‘order()’. Contributed by Matt Dowle and Arun Srinivasan, the new algorithm supports logical, integer (even with large values), real, and character vectors. It outperforms all other methods, but there are some caveats (see ‘?sort’).

Moreover, since 2018 forder() can use multiple CPU threads!


Sort

Directions in Statistical Computing, 2 July 2016, Stanford, Matt Dowle:

x = runif(N)
ans1 = base::sort(x, method='quick')
ans2 = data.table::fsort(x)
identical(ans1, ans2)
#N=500m 3.8GB 8TH laptop: 65s => 3.9s (16x)
#N=1bn 7.6GB 32TH server: 140s => 3.5s (40x)
#N=10bn 76GB 32TH server: 25m => 48s (32x)

More on that topic in slides Proposal for parallel sort in base R (and Python/Julia), Matt Dowle, DSC Stanford from 2016.


Multithreading

Use of multiple CPU threads can greatly reduce time required for computation.

fread openmp

We use low-level parallelism provided by OpenMP API for C language.

More on that topic in slides Success with OpenMP in R package data.table, Matt Dowle, JSM Vancouver from 2018.


Memory optimized


Memory conservative algorithms

Calculate aggregate in-place

DF = data.frame(x = sample(5e7,,TRUE), y = rnorm(5e7))
## 0.55 GB; 63% unique x

DF = merge(DF, aggregate(DF$y, by=list(DF$x), FUN=sum), by="x")
## base R: killed OutOfMemory

TB = TB %>% group_by(x) %>% mutate(sum_y_by_x = sum(y))
## dplyr: 10.0 GB; 169s

DT[, sum_y_by_x := sum(y), by = x]
## data.table: 3.0 GB; 5s

Memory reuse

Avoid repeated memory allocations can save a lot of time too.

group by non-GForce (dogroup.c) allocates memory for the biggest group, then copy there values of a group, to aggregate, so it can benefit from being contiguous in memory and therefore be more cache efficient. Then reuses once allocated memory for further groups without the need for repeated allocation of memory for each group. It works for any function.

group by GForce is optimization that redirects commonly used function (sum, mean, etc.) to data.table‘s highly optimized implementations. It assigns to many group results at once; it doesn’t gather the groups together.


Reference semantic

Sort table

DF = as.data.frame(lapply(setNames(nm=letters), function(y) rnorm(1e7)))
## 1.9 GB

DF = DF[with(DF, order(a, b, c)),]
## uses 4.1 GB

TB = TB %>% arrange(a, b, c)
## uses 4.0 GB

setorder(DT, a, b, c)
## uses 2.3 GB

Data complexity

Not only number of rows matters, see data cardinality!
Cardinality of data is very important factor when measuring how an algorithm scales.

set.seed(108)
N = 1e6 ## 1'000'000 rows

## data of 2 groups
DF1 = data.frame(id1 = sample(2L, N, TRUE), v1 = rnorm(N))
DT1 = as.data.table(DF1)

## data of 500'000 groups
DF2 = data.frame(id1 = sample(N/2, N, TRUE), v1 = rnorm(N))
DT2 = as.data.table(DF2)

system.time( ## 2 groups
  aggregate(DF1$v1, by=list(id1=DF1$id1), FUN = mean)
)
#>    user  system elapsed 
#>   0.452   0.043   0.502 
system.time( ## 5e5 groups
  aggregate(DF2$v1, by=list(id1=DF2$id1), FUN = mean)
)
#>    user  system elapsed 
#>   6.804   0.122   6.936 

system.time( ## 2 groups
  DT1[, mean(v1), id1]
)
#>    user  system elapsed 
#>   0.133   0.001   0.026 
system.time( ## 5e5 groups
  DT2[, mean(v1), id1]
)
#>    user  system elapsed 
#>   0.170   0.000   0.022 

Reliability

In software development, number of dependencies is an important factor for risk assessment of a system to be delivered.

data.table has zero dependencies. To deploy data.table all what user needs is a single package tar.gz or a binary of data.table package. No need to worry about extra (possibly recursive) dependencies and their compatibility across different versions.

Dependencies are invitations for other people to break your package. – Josh Ulrich, private communication Dirks blog: #17: Dependencies

www.tinyverse.org

My slides from Sevilla R User Group conference covers dependencies topic in depth: R package dependencies in production: risks and management


mkdir empty
R_LIBS_USER=./empty Rscript -e 'rownames(installed.packages(priority=NA_character_))'
#NULL

R_LIBS_USER=./empty Rscript -e 'install.packages("data.table")'
#* DONE (data.table)

R_LIBS_USER=./empty Rscript -e 'install.packages("dplyr")'
#also installing the dependencies ‘utf8’, ‘fansi’, ‘pkgconfig’, ‘withr’, ‘cli’, ‘generics’, ‘glue’, ‘lifecycle’, ‘magrittr’, ‘pillar’, ‘R6’, ‘rlang’, ‘tibble’, ‘tidyselect’, ‘vctrs’

Benchmarks

2014: Grouping: github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping

2021: db-benchmark: h2oai.github.io/db-benchmark

2025: db-benchmark fork: duckdblabs.github.io/db-benchmark


Questions?

logo

 
 
 
 
github.com/jangorecki
fosstodon.org/@jangorecki
jangorecki @ protonmail.ch