1 Crash course

The course will go through many topic very briefly just to present simple example of Data Science workflow.

1.1 Basics of R language

1.1.1 R language

1.1.1.1 What is R?

Programming language and environment for statistical computing
Modern implementation of S language that first appeared in 1976
First released in 1993
As a statistical software it is difficult to use
As a programming language it is easy to use

1.1.1.2 Features

Vector as atomic data type
Open source, free to use and extend without asking for permission
Interactive - no compiler
Extensible - thousands of R packages (CRAN, Bioconductor, GitHub, others)
Visualisation (graphics, lattice; contributed packages: ggplot2, rgl, others)
Native support for missing values (NA)
Computing on the language (metaprogramming)

1.1.1.3 Limitations

By default single threaded
By default in-memory processing
Lack of security features

1.1.1.4 R extensions (packages) capabilities

Import and export data from/to various file formats and databases
Efficient data cleansing and transformation
Plotting multidimensional data using multi panel charts or 3D graphs
Statistical modeling
Signal processing
Distributed parallel computing
Machine learning
Time series data support
Spatial data support
much more…

1.1.1.5 Install R

R-project website download
Optional RStudio IDE download

1.1.1.6 Start R

Windows: C:\Program Files\R\R-3.5.0\bin\x64\Rgui.exe
MacOSX: R
Linux: R

1.1.2 Basic syntax and data structures

1.1.2.1 Assignment and basic vector examples

We can assign value to variable using <- and = operators.

x <- 1
y = 5
sum(x, y)
x + y
length(x)
z <- c(1, 5)
length(z)
sum(x, z)
x + z     # element wise with recycling
sum(z, z)
z + z     # element wise

1.1.2.2 Atomic data types

# integer
1L
# real (numeric)
1.5
# string (character)
"a"
# categorical (factor)
as.factor("a")
# logical
TRUE
# complex (imaginary numbers)
1i
# raw (binary type)
as.raw(10)

1.1.2.3 Vectors

# integer
x <- c(1, 2, 3, 4, 5)    # technically it is not integer
x <- seq(1, 5) 
y <- 6:10
x * y
# numeric
x <- c(1, 1.5, 2, 2.5, 3)
x <- seq(1, 3, by = 0.5)
# logical
lgc <- c(TRUE, FALSE, TRUE)
# character
chr <- letters   # R built-in, same as c("a", "b", ..., "z")

1.1.2.4 Operations on vectors

-x

lgc <- x < 2
!x < 2          # negation
x >= 2

lgc & !lgc      # AND
lgc | !lgc      # OR

chr <-  c("a", "b", "c", "d", "e")
paste(x, chr)   # element wise

1.1.2.5 Subsetting using integer type

x[1]
x[2:4]
x[-(2:4)]
x[2:10]
x[c(1, 2, 3, 5, 4)]

lgc[2:4]
lgc[-5]

chr[1:3]
chr[10]
chr[c(1, 2, 3, 2, 1)]

1.1.2.6 Subsetting using logical type

x > 2
x[x > 2]
x[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
x[c(FALSE, TRUE, TRUE)]   # unexpected due recycling
x[x < 2 | x >= 3]

# %in% operator
chr %in% c("d", "e", "f")
chr[chr %in% c("d", "e", "f")]
chr[chr >= "d"]           # OS locale specific!

1.1.2.7 Names and subsetting using character type

x
chr
names(x) <- chr
x
x <- c(a = 1.0, b = 1.5, c = 2.0, d = 2.5, e = 3.0)

x["a"]
x[c("d", "e")]
x[c("d", "f")]

1.1.2.8 Modify elements in vector (sub-assign)

y[1] <- 100
y
y[c(8, 10)] <- c(5, 6)
length(y)
sum(y)
is.na(y)
y[!is.na(y)]
y <- c(y, 7)
y[y > 50] <- NA 
y <- y[!is.na(y)]

1.1.2.9 Matrices and arrays

mx <- matrix(1:25, 5, 5)
mx
mx <- 1:25
dim(mx) <- c(5, 5)
mx
mx[1:2, 1:3]

ar <- 1:27
dim(ar) <- c(3, 3, 3)
ar
str(ar)
ar[1:2, 1:3, 2:3]
ar[1, 1:3, 2:3, drop = FALSE]

1.1.2.10 Lists and data frames

lst <- list(1:5, letters[1:3], c(TRUE, FALSE))
lst
str(lst)
names(lst) <- c("a1", "a2", "a3")
lst$a1

df <- data.frame(c1 = 1:5, c2 = letters[1:5],
                 c3 = c(TRUE, FALSE, TRUE, TRUE, TRUE))
df <- rbind(df, df, df, df, df)
df
str(df)
head(df)

1.1.2.11 Base plot

x <- rnorm(50)
y <- rnorm(50)
plot(x, y)

class(mtcars)
head(mtcars)
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg ~ wt))
title("Regression of MPG on Weight")
detach(mtcars)

1.1.2.12 Getting help

Function manuals, use question mark in front of function name: ?sum, ?"["
R manuals: R-intro, Manuals
R packages vignettes (tutorials)
Post question on stackoverflow.com, use R tag, provide reproducible example
Read examples in blog posts - R blogs aggregator: r-bloggers.com

1.2 Exploratory Data Analysis

Using base R

1.2.1 Why Exploratory Data Analysis?

better understand data
detect data issues and bad data
find patterns in data
iterate exploration process to get better insight

1.2.2 Investigate variables

head(airquality)
str(airquality)
summary(airquality$Temp)
hist(airquality$Temp)

# count observations by month
table(airquality$Month)

# check missing values
table(is.na(airquality$Ozone))
barplot(table(is.na(airquality$Ozone)), main = "Ozone NA")

1.2.3 Investigate variables in batch

vars <- c("Ozone","Solar.R","Wind","Temp")
# use 2x2 plot grid as single plot
par(mfrow = c(2, 2))
# loop over variables and plot histogram
for (var in vars) {
  hist(airquality[[var]], main = var, xlab = "")
}
# restore default plot grid
par(mfrow = c(1, 1))

# count missing values in each column
sapply(airquality, function(x) sum(is.na(x)))

1.2.4 Cleaning bad data

# we assume Ozone sensors to have upper bound at 125 ppb
plot(airquality$Ozone)
abline(h = 125)

# we define 'bad data' for Ozone > 125
airquality[airquality$Ozone > 125 & !is.na(airquality$Ozone), ]

# we define fix for bad data as value 125
airquality[airquality$Ozone > 125 & !is.na(airquality$Ozone), "Ozone"] <- 125

1.2.5 Data transformation

# convert temperature from Fahrenheit to Celsius
FtoC <- function(x) (x - 32) * (5/9)
airquality$Temp <- FtoC(airquality$Temp)

# add month names
month.name
airquality$Month.Name <- month.name[airquality$Month]
# keep month name as factor type to get proper order on plot
airquality$Month.Name <- factor(airquality$Month.Name, levels = month.name)
airquality$Month.Name <- droplevels(airquality$Month.Name)

1.2.6 Data insight: seasonality

# mean solar radiation by month
aggregate(Solar.R ~ Month.Name, data = airquality, FUN = mean)
boxplot(Solar.R ~ Month.Name, data = airquality, xlab = "Month", ylab = "Solar radiation (lang)")

# mean temperature by month
aggregate(Temp ~ Month.Name, data = airquality, FUN = mean)
boxplot(Temp ~ Month.Name, data = airquality, xlab = "Month", ylab = "Temperature (C)")

1.2.7 Data insight: Ozone variable

plot(airquality$Temp, airquality$Ozone)
plot(airquality$Temp, airquality$Ozone, xlab="Temp (C)", ylab="Ozone (ppb)", main="NYC May-Sep '73 air quality")
ozone_temp_lm <- lm(airquality$Ozone ~ airquality$Temp)
summary(ozone_temp_lm)
abline(ozone_temp_lm, col = "red")

# check other variables
pairs(Ozone ~ Temp + Wind + Solar.R, data = airquality)

1.2.8 Data insight: Ozone variable in batch + `lm`

# add linear model to pairs plot
pointslm <- function(x, y, ...){
  model <- lm(y ~ x)
  points(x, y)
  abline(a = model$coefficients[1], b = model$coefficients[2], col = "red")
}
pairs(Ozone ~ Temp + Wind + Solar.R + Month, data = airquality, lower.panel = pointslm, upper.panel = NULL)

1.2.9 Homework: read about and play with

sapply and lapply functions to hide loops and make code cleaner
subsetting data frames by integer and logical types
plot and pairs functions
graphic options like par(mfrow = c(2, 1))

1.3 Efficient Exploratory Data Analysis

Using data.table, ggplot2, lattice

1.3.1 Install and load R packages

# install R packages
pkgs <- c("data.table","ggplot2","lattice","bikeshare14")
install.packages(pkgs)

# check version of packages
packageVersion("data.table")
sapply(pkgs, packageVersion, simplify = FALSE)

# load packages
require("data.table")
sapply(pkgs, require, character.only = TRUE)

1.3.2 Investigate dataset

head(batrips)
?batrips
str(batrips)

summary(batrips)
boxplot(batrips$duration/60, log="y", ylab="duration (min) - log scale")

# proceed with batrips as data.frame/data.table objects
DF <- batrips
DT <- as.data.table(batrips)

1.3.3 Cleaning data

# investigate trips longer than 12 hours
# data.frame way
DF[DF$duration > 60*60*12, ]
# data.table way
DT[duration > 60*60*12]

# exclude trips longer than 12h
DF <- DF[DF$duration <= 60*60*12, ]
DT <- DT[duration <= 60*60*12]

1.3.4 `data.table` syntax

1.3.5 Trips by `subscription_type`

# count trips
table(DF$subscription_type)
aggregate(duration ~ subscription_type, data=DF, FUN=length)
DT[, length(duration), subscription_type]
DT[, .N, subscription_type]
# mean trips duration
aggregate(duration ~ subscription_type, data=DF, FUN=mean)
DT[, mean(duration), subscription_type]

# count, mean and sum
DT[, .(count=.N, mean_s=mean(duration), sum_d=sum(duration)/(60*60*24)), subscription_type]

1.3.6 Plotting using `ggplot2` and `lattice`

ggplot2 package implements “The Grammar of Graphics” designed by Leland Wilkinson. ggplot2 uses plus sign + to combine new elements to existing plot. It is also capable to store plot in a variable and render graphic when print method is invoked on a variable.
lattice package provide powerful multi-panel plots “trellis” using R formula interface commonly used in R: y ~ x | z

1.3.7 Trips by `subscription_type` and hour of the day

# count by hour and subscription type
histogram(duration ~ hour(start_date) | subscription_type, data=DT)
# median duration in minutes
ans <- DT[, .(median_m=median(duration)/60), .(subscription_type, start_hour=hour(start_date))]

xyplot(median_m ~ start_hour | subscription_type, data=ans, main="median trip duration")

ggplot(ans, aes(start_hour, median_m)) + geom_point() + facet_wrap(~ subscription_type)

1.3.8 Common trip routes

ans <- DT[, .N, .(start_station, end_station)]
# order by count in descending order, take top 10
ans[order(-N)][1:10]
# by subscription_type, top 5
ans <- DT[, .N, .(subscription_type, start_station, end_station)]
ans[order(-N), head(.SD, 5), subscription_type]
# by weekday, top 2
DT[, weekday := weekdays(start_date)]
ans <- DT[, .N, .(weekday, start_station, end_station)]
ans[order(-N), head(.SD, 2), .(weekday)]

1.3.9 Trips by weekday and `subscription_type`

# set order for weekday factor
days_in_week <- DT[order(wday(start_date)), unique(weekday)]
DT[, weekday := factor(weekday, levels=days_in_week)]
histogram(duration ~ weekday | subscription_type, data=DT, main="trip count")

# median duration
ans <- DT[, .(.N, duration=median(duration)/60), .(subscription_type, weekday)]
barchart(duration ~ weekday | subscription_type, data=ans, ylab="duration (min)")

1.3.10 Import US holidays data by pasting into `fread`

HD <- fread("
date,holiday
2014-01-01,New Year Day
2014-01-20,Martin Luther King Jr. Day
2014-02-17,Presidents Day (Washingtons Birthday)
2014-05-26,Memorial Day
2014-07-04,Independence Day
2014-09-01,Labor Day
2014-10-13,Columbus Day
2014-11-11,Veterans Day
2014-11-27,Thanksgiving Day
2014-12-25,Christmas Day
")

1.3.11 Investigate holidays impact - merge datasets

head(HD)
str(HD)
str(DT)
DT[, date := as.Date(start_date, tz="America/Los_Angeles")]
HD[, date := as.Date(date, tz="America/Los_Angeles")]
# add holiday column from HD to DT
DT[HD, holiday := i.holiday, on = "date"]
head(DT)
# feature engineering
DT[, holiday := factor(ifelse(!is.na(holiday), "holiday", "non-holiday"))]

1.3.12 Investigate holidays impact

# for non-english weekdays set locale
# Sys.setlocale("LC_TIME", "English")
DT[, workday := factor(ifelse(weekday %in% c("Saturday","Sunday"), "weekend", "workday"))]
ans <- DT[, .(N_trips=.N, N_days=uniqueN(date)), .(subscription_type, workday, holiday)]
ans[, trips_day := N_trips/N_days]
ans[order(subscription_type, workday, holiday)]
barchart(~ trips_day | subscription_type, groups=holiday, data=ans, auto.key=TRUE, subset=workday=="workday", origin=0, scales=list(x="free"), main="mean trips per workday (Mon-Fri)")

1.3.13 Ultimate goal of EDA: feature engineering

Feature engineering allows data scientist to prepare data to be better processed by machine learning algorithms in automatic manner to build better models.
Adding “holidays” to our dataset we allow machine learning algos to more precisely predict our measures - “trips per day” in last example. Not having “holiday” flag in our dataset is not just a matter not being able to use holiday information to predict trips count. Holidays not flagged as holidays in the data will actually generate a “noise” due to higher variation. This happens because automated algos will cluster dates into Mon-Fri or Sat-Sun so holidays will be an outliers within those cluster reducing model precision.

1.3.14 Homework: read about and play with

read data.table homepage r-datatable.com and introduction vignette
know when and how to use .N and .SD special symbols
read ggplot2 homepage ggplot2.tidyverse.org
read lattice introduction vignette
understand order and sort functions
read about feature engineering

1.4 Machine Learning

Using H2O

1.4.1 What is H2O?

machine learning and predictive analytics platform
open source
free to use
distributed and parallel
in-memory
fast and scalable
easy to productionize
core written in Java
interfaces from webUI, REST API, R, Python, Scala, Hadoop, Spark

1.4.2 Install and start H2O

SystemRequirements: Java (>= 1.7) - OpenJDK: openjdk.java.net - Oracle Java SE: www.oracle.com/technetwork/java/javase

# install h2o
install.packages("h2o")

# load h2o
library(h2o)

# initialize h2o instance
h2o.init()

1.4.3 seeds data set: wheat kernels properties

seeds data set: measurements of geometrical properties of kernels belonging to three different varieties of wheat.

# import dataset using base R
df <- read.csv("seeds_dataset.txt", sep="\t", header=FALSE)
# import dataset using data.table
dt <- data.table::fread("seeds_dataset.txt")
# import dataset using H2O
hf <- h2o.uploadFile("seeds_dataset.txt")

1.4.4 base R k-means

head(df)
# exclude actual cluster info from dataset
dat <- df[, paste0("V", 1:7)]
# run k-means using 3 clusters
km1 <- kmeans(dat, centers = 3)
# preview model info
km1
# plot all variable pairs
plot(dat, col = km1$cluster + 1, 
     main = "base R k-means",
     pch = 20, cex = 2)

1.4.5 H2O k-means

hf
# run k-means on C1-C7 variables, using 3 clusters
km2 <- h2o.kmeans(hf, x = paste0("C", 1:7), k = 3)
# preview model info
km2
# run prediction, exclude actual cluster info
p <- h2o.predict(km2, hf[, -8])
# plot all variable pairs
plot(as.data.frame(hf[, -8]), col = as.vector(p) + 2, 
     main = "H2O k-means",
     pch = 20, cex = 2)

1.4.6 k-means: unify cluster labels

# inspect labels
df[, "V8"]   # actuals
km1$cluster  # base R
as.vector(p) # h2o
# decode base R cluster id
base_r_cluster_id <- km1$cluster
base_r_cluster <- ifelse(base_r_cluster_id==3, 1, ifelse(base_r_cluster_id==1, 2, 3))
# decode H2O cluster id
h2o_cluster_id <- as.vector(p)
h2o_cluster <- ifelse(h2o_cluster_id==2, 1, ifelse(h2o_cluster_id==0, 2, 3))

1.4.7 k-means metric: root mean square error

# root mean square error
rmse <- function(predicted, actuals) {
  sqrt( mean( (predicted-actuals)^2, na.rm=TRUE ) )
}

# base R k-means rmse
rmse(base_r_cluster, df[, "V8"])

# H2O k-means rmse
rmse(h2o_cluster, as.vector(hf[, "C8"]))
h2o.make_metrics(as.h2o(h2o_cluster), hf[, "C8"])

1.4.8 k-means metric: confusion matrix

# confusion matrix
table(base_r_cluster, df[, "V8"])
table(h2o_cluster, as.vector(hf[, "C8"]))

# correct classification ratio
base_r_correct <- sum(base_r_cluster == df[, "V8"])
h2o_correct <- sum(h2o_cluster == as.vector(hf[, "C8"]))
sprintf("base R k-means correct ratio: %.2f%%",
        base_r_correct * 100 / nrow(df))
sprintf("H2O k-means correct ratio: %.2f%%",
        h2o_correct * 100 / nrow(df))

1.4.9 cluster dendrogram

# calculate euclidean distance matrix
d <- dist(dat, method = "euclidean")

# hierarchical cluster analysis
hc <- hclust(d, method = "ward.D")

# plot hclust object 
plot(hc)

# draw rectangles
rect.hclust(hc, k = 3, border = "red")

1.4.10 bivariate cluster plot

library(cluster)
# combine two plots horizontally
par(mfrow = c(1,2))
# plot base R k-means clusters
clusplot(dat, km1$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, main="base R k-means")
# plot H2O k-means clusters
clusplot(dat, as.vector(p)+1, color=TRUE, shade=TRUE, labels=2, lines=0, main="H2O k-means")
# reset plot option to default
par(mfrow = c(1,1))

1.4.11 k-means summary

K-means is a clustering algorithm. It measure the distance between observations in dataset and cluster them into partitions. It is commonly applied in various industries: - biology, computational biology and bioinformatics - medicine - business and marketing - world wide web - computer science - social science - others

1.4.12 H2O algorithms

Supervised Learning: - Generalized Linear Modeling (GLM) - Gradient Boosting Machine (GBM) - Deep Learning - Distributed Random Forest - Naive Bayes - Stacked Ensembles

Unsupervised Learning: - Generalized Low Rank Models (GLRM) - K-Means Clustering - Principal Components Analysis (PCA)

1.4.13 H2O webUI: Flow

Once you start H2O instance, for example using R command h2o.init(), you can open web browser and point it to localhost:54321 address to access H2O webUI.
H2O Flow is graphical user interface. User can import data and perform basic data science tasks from web browser. Flow is designed to be used as a notebook, like Jupyter Notebook (IPython) and similar. Therefore no programming skills are required to use H2O Flow. User can use multiple interfaces at the same time, for example load and pre-process data using R and h2o package, then perform statistical modeling in web browser with H2O Flow. Multiple interfaces will use the same H2O instance.

1.4.14 Homework

understand Euclidean distance and its application in K-means clustering
preview H2O website www.h2o.ai
play with H2O Flow