R package dependencies in production

risks and management

Jan Gorecki

2024-11-07

Software dependencies

What are dependencies?

Dependencies are a way to outsource computation that your software needs to perform to another piece of software.

Types of dependencies

Dependencies are generally classified in two types:

not defined in your software,
defined in software that your software depends on, or…
defined in software that depends on the software that depends … that your software depends on.


Why we want to use them?

Why we want to avoid them?


Dependencies in R ecosystem

Types of dependencies in R





For example...

curl CRAN preview


In order to install curl in R we need to first install required OS package:

## Debian / Ubuntu
sudo apt install libcurl4-openssl-dev
## Fedora / Red Hat Enterprise
sudo dnf install libcurl-devel

And then in R

install.packages("curl")

For curl package it was “simple”,

so let’s try…


pkgdown, CRAN’s pkgdown site says:

SystemRequirements:	pandoc

So we proceed with specified requirement…

sudo apt install pandoc

And we could expect that is enough and we can now install pkgdown

install.packages("pkgdown")
#...
#ERROR: dependencies httr2, openssl, ragg, xml2
#  are not available for package pkgdown

What happened?

CRAN mentions only OS dependencies of pkgdown
but not OS dependencies of its 58 (incl recursive) R dependencies.

In fact, we also need to install 10 other OS packages

sudo apt install libcurl4-openssl-dev libssl-dev libxml2-dev
  libfontconfig1-dev libharfbuzz-dev libfribidi-dev
  libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev

Actually it installs many more than 10 packages
as some of those have recursive dependencies as well!


In fact… actually…

We can see how tedious process it is to setup an enviroment.
And we usually need three: dev, test, prod.

Those who experienced that might have felt they entered "Dependency hell",
but this is just the tip of the iceberg!

There are projects addressing the problem by providing R CRAN packages as OS level software to install, so OS dependencies can be resolved automatically: ubuntu: r2u, fedora: cran2copr, …

There exists at least multiple 3rd party tools to manage dependencies, each in its own way.

It is also important to weight their use carefully, as by using those we are addressing some of the problems by introducing another (a deployment-time) dependency.
And that new dependency is no different in terms of risks discussed before: trust, ability to fix, breaking changes (recursively!).


R packages

Within R packages there are different dependency relations that can be defined.
Let’s briefly remind them here.


Depends, Imports, LinkingTo

In general “Depends” should be avoided in favor of “Imports” as the former pollutes search path of users of your package.


Suggests, Enhances


Current state of dependencies in R

ap = available.packages(); tpd = tools::package_dependencies
pd = tpd(rownames(ap), ap, recursive=FALSE, which="strong")
summary(lengths(pd)) ## --- direct dependencies ---
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>   0.000   2.000   4.000   5.546   8.000  57.000
pd[order(lengths(pd))] |> tail(n=3) |> lengths()
#> TOmicsVis immunarch    Seurat
#>        43        45        57
pr = tpd(rownames(ap), ap, recursive=TRUE, which="strong")
summary(lengths(pr)) ## --- all dependencies ---
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>    0.00    5.00   20.00   32.95   49.00  260.00
pr[order(lengths(pr))] |> tail(n=3) |> lengths()
#>    wallace      BioM2 TestAnaAPP
#>        240        252        260

Why we want to avoid them?

Trust and reliablity - security risks

your_pkg
depends on your_dependency
and then your_dependency (which is outside of your quality control)
adds single line to DESCRIPTION file:

Additional_repositories: https://r-repo.evil

Then dependencies of your_dependency will be resolved from a malicious server.
As smoothly as from CRAN.

And that is already enough.


Modify or fix

It happens that your dependency might not be actively maintained anymore,
or its maintainer rejects modification/fix you proposed.

Then you are left with

  1. forking the repo
  2. fixing your local fork
  3. maintaining it

Breaking changes

Avoid writing production code that depends on pre-1.0.0 versions of your dependencies.

Prefer software that puts effort in backward compatibility.

Submit your usage patterns as new unit tests to the upstream.

Packages that are pre-1.0.0 could be considered not stable, where exported functions or arguments are still very likely to change.


Operational costs

More dependencies in majority of cases implies:

Often a package might as well be slower than its corresponding base R implementations,
leading to higher computation resources usage, thus higher operational costs.

C code compiles significantly faster than C++,
and does not require C++ compiler! (heavy OS dependency)


How to avoid dependencies?

do you really need to avoid them?

There are many situations where avoiding dependencies may not be really important:


examine if you really need them

For a package being a shiny app, or a markdown report, having colored console output will not be of much use.


write them yourself

Let’s say you need to bind list of data.frames

l = list(data.frame(a = 1, b = 2),
         data.frame(a = 4, b = 3))
my_rbindlist = function(l) do.call("rbind", l)
all.equal(
  my_rbindlist(l),
  data.table::rbindlist(l) |> as.data.frame()
)
#> [1] TRUE

Although base R will be slower and more memory demanding, and not as feature rich.


find more lightweight alternative

Choose vignettes rendering engine that is sufficient for your needs.
A slide from Yihui’s (author of knitr blogdown bookdown and many others) presentation: markdown vs rmarkdown comparison


Since then Yihui developed even more lightweight alternative to rmarkdown and markdown: litedown


Similarly to vignettes engine,
we have some choice when it comes to testing your package.
The most popular ones are testthat and tinytest.

str(tools::package_dependencies(
  c("testthat","tinytest"),
  which="strong", recursive=TRUE
))
#List of 2
# $ testthat: chr [1:36] "brio" "callr" "cli" "desc" ...
# $ tinytest: chr [1:2] "parallel" "utils"

It’s good, it’s simple, it’s easy to use, it’s lightweight.

But still base R has more than enough what is necessary for package testing.


copy source to your package

In case licenses (of your package and your dependency) are not exactly the same this will add licensing burden to your project.
That implies following changes:

In case if licenses are not well compatible it may eventually require to change the license of your project!

Moreover when deciding to copy the code, then all the code should be well understood, as you are the new maintainers of the local copy.


Examples


When “Everything” Becomes Too Much

A package that depeneds on everything in JavaScript npm ecosystem:

The npm Package Chaos of 2024

Denial of Service (DOS) for anyone who installs it

locked down the ability for authors to unpublish their packages

In R when we submit package to CRAN then it undergoes a review.

But what about code review when submitting updates to package that is already on CRAN?
Other repositories? r-universe.dev


Popular npm package hijacked

Another vector of attack, repo account hijacked

crypto-mining and password-stealing malware embedded in “UAParser.js,”
a popular JavaScript NPM library with over 6 million weekly downloads

UAParser.js’s developer Faisal Salman:

I believe someone was hijacking my NPM account and published some compromised packages (0.7.29, 0.8.0, 1.0.0) which will probably install malware


The event-stream vulnerability

npm again

However I’m sure it’s a dependency or a dependency of a dependency (of a dependency) of a number of packages being used in production by plenty of applications.


npm is-even

npm is even


heaviest objects in the universe

heavy node modules


R examples


DBI introduces new heavy dependency

Before

Depends:
    R (>= 2.15.0),
    methods

After, 25th Feb 2015, DBI 1.2.3.9013: Integrate SQL package into DBI

Depends:
    R (>= 2.15.0),
    methods
LinkingTo: Rcpp
Imports:
    Rcpp

Jan: Rcpp dependency #40

…new heavy dependency which I’m currently not using in multiple production environments …chance to move Rcpp to Suggests or make a non-Rcpp branch?

Hadley:

…most people will have any way (since rcpp is most downloaded package)

Hannes: DBI depending on Rcpp? #82

…Rcpp is a huge dependency …Rcpp is only used to parse SQL strings. IMHO, this is also not really a job for DBI to semantically analyze the query strings.


Hannes: R implementation of sqlParseVariablesImpl #83

Fixes #82.

hannes DBI fix

Hannes Mühleisen - creator of duckdb


RSQLite introduces new heavy dependency

RSQLite v1.0.0 (2014)

Depends:
    R (>= 2.10.0),
    DBI (>= 0.3.1),
    methods

RSQLite v1.1.0 (2016)

Depends:
    R (>= 3.1.0)
Imports:
    DBI (>= 0.4-9),
    memoise,                ## new
    methods,
    Rcpp (>= 0.12.7)        ## new
LinkingTo: Rcpp, BH, plogr  ## new

RSQLite (2024)

Depends:
    R (>= 3.1.0)
Imports:
    bit64,               ## new
    blob (>= 1.2.0),     ## new
    DBI (>= 1.2.0),
    memoise,
    methods,
    pkgconfig,           ## new
    rlang                ## new
 LinkingTo:
      plogr (>= 0.2.0),  ## new
      cpp11 (>= 0.4.0)   ## new

is it still SQLite?


Dirk Eddelbuettel / RSQLite
eddelbuettel/RSQLite diff


Fork is made on RSQLite v1.0.0 (2014-10-25), at the time when package was lightweight. No new features are being added here, it works good enough, and light enough. SQLite has been upgraded to 3.31.1 (2020-01-27).


‘evaluate’ package

Reason for increasing the required R version to >= 4.0? evaluate#173
…I see that the required R version has been bumped from 3.0.2 to 4.0
@etiennebacher (R polars maintainer)

…unlikely to have much impact on testthat, since it already depends on brio, fs, glue, lifecycle and waldo which all depend on R 3.6.0 (and will be bumped to 4.0.0
@hadley

Couple other projects affected by that changed joined to this reported issue.


RPostgreSQL vs RPostgres

str(tools::package_dependencies(
  c("RPostgreSQL","RPostgres"),
  which="strong", recursive=TRUE
))
#List of 2
# $ RPostgreSQL: chr [1:2] "methods" "DBI"
# $ RPostgres  : chr [1:22] "bit64" "blob" "DBI" "hms" ...

less is more – tiny versus tidy by @eddelbuettel

While there is (considerable) variability (likely stemming from heterogenous setups at GitHub Action) the tiny approach is on average about twice as fast as the tidy approch


Other

Many other that I was not personally involved into.

Having an identifiable set of package dependencies at any point in time is
a beginning. Its difficult to effectively control developer behaviour, so
there is a risk there, but what makes it into production can in principle
be identified and controlled.


Good practices


Lock dependencies

Mirror all R packages that your project(s) required, including recursive dependencies.

tools::write_PACKAGES()
drat::insertPackages()
tools4pkgs::mirror.packages()

Essential for Test and Prod deployments.
We never want to deploy Prod using the most up-to-date dependencies, which were not yet tested against our code.
We need exact set of packages that is shipped to Test to be also deployed to Prod.


Investigate dependencies

Base R tools::package_dependencies is your friend.

str(tools::package_dependencies(
  c("curl","jsonlite","data.table","duckdb"),
  which="strong", recursive=TRUE
))
#List of 4
# $ curl      : chr(0)
# $ jsonlite  : chr "methods"
# $ data.table: chr "methods"
# $ duckdb    : chr [1:3] "DBI" "methods" "utils"

Move to Suggests

Suggested dependencies are not mandatory for your package installation.
They are completely optional until user will not reach the functionality within the package that needs to use a suggested dependency.
In such cases we escape every call to functionality in a suggested dependency by raising meaningful error if package is not installed.

if (requireNamespace("pkg", quietly=TRUE)) {
  pkg::fun()
} else {
  stop("'pkg' is required to run this functionality, retry after:\n",
       "install.packages('pkg')\n")
}

‘testthat’ should be escaped

WRE (Writing R Extensions), the bible for R packages developers, says:

Note that the recommendation to use suggested packages conditionally in tests does also apply to packages used to manage test suites: a notorious example was testthat which in version 1.0.0 contained illegal C++ code and hence could not be installed on standards-compliant platforms.

The same as every other suggested dependency! File tests/testthat.R:

if (requireNamespace("testthat", quietly=TRUE)) {
  library("testthat")
  test_check("your.pkg")
} else cat("Package tests have been skipped\n")

Use base R for unit test

expect_identical <- function(x, y) {
  stopifnot(identical(x, y))
}
expect_equal <- function(x, y, ...) {
  stopifnot(all.equal(x, y, ...))
}

You may find it useful to set _R_CHECK_NO_STOP_ON_TEST_ERROR_=true or run R CMD check with --no-stop-on-test-error flag.


vignettes using markdown rather than rmarkdown

markdown is like rmarkdown but has less features.

You just need to check if you use any of those extra features.

So, we moved 10 vignettes from engine rmarkdown to engine markdown…


data.table: vignette render with markdown rather than rmarkdown #5773

datatable-markdown

So saving 12 minutes on each test job means saving a lot of CI compute minutes. We have 8 test jobs currently, and likely we will have more. There is also build job which needs to install those. So savings of CI compute minutes are more than 100 min on a single pipeline. Aside from time we can also use lighter image for build (no need for C++ toolchain).


litedown

From Yihui Xie (author of knitr bookdown blogdown) developed this year.


avoid tidyverse in favor of dplyr and friends

Avoid using tidyverse as a dependency of your package.

Depends: tidyverse

Instead list explicitly tidyverse packages that you directly depend on.

Imports: dplyr, ggplot2

Same goes for others meta packages (mlr3verse, etc.).


make OS deps a suggested deps!

It turns out it is possible by adjusting compilation flags dynamically depending on zlib availability.

Therefore if zlib OS dependency was not available during data.table installation,
then package will install correctly and only the compression feature in fwrite() will raise an error of missing ‘zlib’.

data.table::fwrite(iris, "iris.tar.gz")
#Error in data.table::fwrite(iris, "iris.tar.gz") : 
#  Compression in fwrite uses zlib library. Its header files were not found at
#  the time data.table was compiled. To enable fwrite compression, please
#  reinstall data.table and study the output for further guidance.

More on the topic

Dependencies are invitations for other people to break your package.
– Josh Ulrich, private communication

Dirks blog: #17: Dependencies






Do not stop using dependencies.

Use them wisely!





Thanks to