4 Delivery
There is not that much R in this chapter. It is more about tools to ease delivery of R projects.
4.1 Tools
We will create example package to present full reproducible workflow using listed tools that it can be easily adapted to own project.
Populate example project:
mkdir mypkg
cd mypkg
echo '' > DESCRIPTION
echo '' > NAMESPACE
echo '' > R/f.R
echo '' > man/f.Rd
echo 'library(mypkg); x=1; stopifnot(identical(f(x), x))' > tests/test-f.Rd
We have example package, before proceeding lets test it:
R CMD check mypkg
4.1.1 make
Adding Makefile
to create aliases for commong actions required in your workflow. It define intermediate layer of abstraction. We add Makefile in project directory.
cd mypkg
touch Makefile
- make build
- make test
R CMD check . --ignore-vignettes --no-stop-on-test-error
- make integration
R CMD check . --as-cran
- make deploy
- make all
4.1.2 git
Environments can have their code mirrors in git branches, easing more daily stuff like:
git checkout devel
echo 'g<-function() stop("lets crash")' > R/g.R
echo 'export(g)' >> NAMESPACE
echo 'library(mypkg); stopifnot(inherits(try(g(), silent=TRUE), "try-error"))' > tests/test-g.R
make test
Unit tests passes but package structure tests did not pass, missing g
manual entry is a problem.
git checkout master
git checkout test
make test
4.1.3 cron
# list jobs
crontab -l
# edit jobs
crontab -e
Test our devel
and test
branches every day.
* * * * * git checkout branch devel && make test && make clean && git checkout branch test && make test
4.1.4 tracking status
You should obviously have as much unit tests coverage as possible, it does not exactly mean some particular measure in online service, but to have business requirements satisfied by defined unit tests fully. In is easy to apply test-driven development if required. All those should be included in tests
directory, and tested each time package check is run.
You can and you should produce logs from your application. It helps in testing as you can verify points in code that was reached by an unit test just by testing log output and string matching to expected value.
echo -e 'f<-function(x=NULL, verbose=FALSE) { if (verbose) cat("doing something 123\n"); x}' > tests/test-verbose.R
echo -e 'stopifnot(grepl("doing something", capture.output(f(verbose=TRUE))))' >> tests/test-verbose.R
make test
- simulate failure
echo -e 'stop("something goes wrong")' > pkg/tests/fail.R
make test
This kind of errors you track already in automated way inside unit tests. Another set of failures might come on OS level, and for this a little bit of maintenance might be required, ultimately can be also automated. There is set of command like tools which works really perfect.
- count number of lines in a file
wc -l data.csv
# we expect data will have at least 5 rows
#wc -l data.csv >= 6
- going through logs
tail log.out
more log.out
less log.out
- log entries that match pattern
grep ERROR log.out
- using pipes
|
looking up particular manual entry of R package check help
R CMD check --help | grep test-error
tail of lines from log containing WARNING
grep WARNING log.out | tail
System monitoring
htop
free -h
lscpu | less
Working on remote sessions
ssh-copy-id
byobu
4.2 Deployment
Example of vps/cloud deployment, cover resolving common deployment challanges.
4.2.1 Obtain machine
As an example we will use virtual private server (VPS) on Amazon Web Services (AWS) running spot instances.
Spot instances can be terminated at any time when your maximum price per hour has been overbidden by another person, but they are cheap.
4.2.2 Configure machine
4.2.2.1 Update OS software
sudo apt update
sudo apt upgrade
sudo apt dist-upgrade
4.2.2.2 Add sources for recent version of software
sudo ...
4.2.2.3 Install software
sudo apt install gcc-8 r-base
4.2.2.4 Configure software
~/.R/Makevars
r-setup
.Rprofile
.Renviron
This tutorial does not goes into OS users management and security.
4.2.3 Clone dependencies
Clone required dependencies for your package to CRAN-like subset of CRAN to freeze version of all dependencies.
# helper script, can be done by hand or some other helper tool, like miniCRAN package
source("https://raw.githubusercontent.com/Rdatatable/data.table/master/ci.R")
cran <- mirror.packages()
ap <- available.packages(cran)
4.2.4 Clone your code into dev/test/prod paths, to be used with .libPath()
when needed
- it starts allways with
dev
environment at the beginning doing “proof of contept” - when it is mature enough we add
test
environment which has to be completed set of tests - in case of extended integration tests they should be placed in new environment
integration
where such tests are being run. - you might also want to run benchmark tests, you can easily create
benchmark
environment and upgrade it for benchmarking when required, or on schedule. - where
integration
andbenchmark
should wait fortest
environment to succeed.
4.2.5 Before upgrading R or packages you should run tests in new upgraded environment to confirm all tests are passing
- migration must be smooth process to be easily automated, reducing place for human errors, there exists some helper tools, sometimes big projects, but the core functionality required to manage environments can be easily expressed in base R, avoiding use of “package managers”, stripping extra dependency.
4.2.6 Production environment
Should be stripped to minimum to do exactly what it has to do, so no development tools in production. Exclude any non-essential files, applications, configuration files, logs, user profiles. Ideally production environment should be easily reproducible. Production environment should be upgraded only after pending upgrades were tested.
4.2.7 Automate deployment
You may want to retain ultimate control over deployment to production but automate everything else.
In production
environment you point to use R package repository path as the one prepared for deployment to production.
You can use automation of deploying tags. You can do a lot by using git checkout mybranch
4.2.8 Delivery
Is it important to clealy define what has to be delivered. One of the structures that might be used as delivery unit can be R package repository. How to finally rollup production can be highly customized. There is much you can achieve using bash scripts, grep, cron, git and R. Those are highly standardized approaches, but require different type of maintenance than automated ones on CI/delivery platforms like GitLab Pipelines. Using just linux tool makes whole chain more lightweight. Ideally you should always snapshot the version of software you are dependent on, to avoid breaking changes in future versions of a dependency, that includes your CI pipelines. Because GitLab is open source project it is not a challanging task, but most of CI platforms are not open sourced.
4.3 CI pipelines
Optional, but recommended.
- register free account at https://gitlab.com
- create project repository
- note on GitLab vs GitHub
There are two very important feautres that GL > GH:
- in GitLab private repository comes for free so it is easy to have many workspaces
you can self-host it and reduce external dependency if needed.
define CI in yaml format:
.gitlab-ci.yml
build:
test:
deploy:
pages:
- push your code to run automated workflow
git checkout gl-ci
git add .gitlab-ci.yml
git commit -m 'defined basic GL CI workflow'
git push origin gl-ci
- check status on
https://gitlab.com/_user/_project/pipelines
If we are satisfied with change we can pull into head of our working branch.
git checkout master
git merge gl-ci
git push origin master
When need push to different environments
git checkout test
...
Publishing CRAN-like repository