1 Crash course

The course will go through many topic very briefly just to present simple example of Data Science workflow.

1.1 Basics of R language

1.1.1 R language

1.1.1.1 What is R?

  • Programming language and environment for statistical computing
  • Modern implementation of S language that first appeared in 1976
  • First released in 1993
  • As a statistical software it is difficult to use
  • As a programming language it is easy to use

1.1.1.2 Features

  • Vector as atomic data type
  • Open source, free to use and extend without asking for permission
  • Interactive - no compiler
  • Extensible - thousands of R packages (CRAN, Bioconductor, GitHub, others)
  • Visualisation (graphics, lattice; contributed packages: ggplot2, rgl, others)
  • Native support for missing values (NA)
  • Computing on the language (metaprogramming)

1.1.1.3 Limitations

  • By default single threaded
  • By default in-memory processing
  • Lack of security features

1.1.1.4 R extensions (packages) capabilities

  • Import and export data from/to various file formats and databases
  • Efficient data cleansing and transformation
  • Plotting multidimensional data using multi panel charts or 3D graphs
  • Statistical modeling
  • Signal processing
  • Distributed parallel computing
  • Machine learning
  • Time series data support
  • Spatial data support
  • much more…

1.1.1.5 Install R

1.1.1.6 Start R

  • Windows: C:\Program Files\R\R-3.5.0\bin\x64\Rgui.exe
  • MacOSX: R
  • Linux: R

1.1.2 Basic syntax and data structures

1.1.2.1 Assignment and basic vector examples

We can assign value to variable using <- and = operators.

x <- 1
y = 5
sum(x, y)
x + y
length(x)
z <- c(1, 5)
length(z)
sum(x, z)
x + z     # element wise with recycling
sum(z, z)
z + z     # element wise

1.1.2.2 Atomic data types

# integer
1L
# real (numeric)
1.5
# string (character)
"a"
# categorical (factor)
as.factor("a")
# logical
TRUE
# complex (imaginary numbers)
1i
# raw (binary type)
as.raw(10)

1.1.2.3 Vectors

# integer
x <- c(1, 2, 3, 4, 5)    # technically it is not integer
x <- seq(1, 5) 
y <- 6:10
x * y
# numeric
x <- c(1, 1.5, 2, 2.5, 3)
x <- seq(1, 3, by = 0.5)
# logical
lgc <- c(TRUE, FALSE, TRUE)
# character
chr <- letters   # R built-in, same as c("a", "b", ..., "z")

1.1.2.4 Operations on vectors

-x

lgc <- x < 2
!x < 2          # negation
x >= 2

lgc & !lgc      # AND
lgc | !lgc      # OR

chr <-  c("a", "b", "c", "d", "e")
paste(x, chr)   # element wise

1.1.2.5 Subsetting using integer type

x[1]
x[2:4]
x[-(2:4)]
x[2:10]
x[c(1, 2, 3, 5, 4)]

lgc[2:4]
lgc[-5]

chr[1:3]
chr[10]
chr[c(1, 2, 3, 2, 1)]

1.1.2.6 Subsetting using logical type

x > 2
x[x > 2]
x[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
x[c(FALSE, TRUE, TRUE)]   # unexpected due recycling
x[x < 2 | x >= 3]

# %in% operator
chr %in% c("d", "e", "f")
chr[chr %in% c("d", "e", "f")]
chr[chr >= "d"]           # OS locale specific!

1.1.2.7 Names and subsetting using character type

x
chr
names(x) <- chr
x
x <- c(a = 1.0, b = 1.5, c = 2.0, d = 2.5, e = 3.0)

x["a"]
x[c("d", "e")]
x[c("d", "f")]

1.1.2.8 Modify elements in vector (sub-assign)

y[1] <- 100
y
y[c(8, 10)] <- c(5, 6)
length(y)
sum(y)
is.na(y)
y[!is.na(y)]
y <- c(y, 7)
y[y > 50] <- NA 
y <- y[!is.na(y)]

1.1.2.9 Matrices and arrays

mx <- matrix(1:25, 5, 5)
mx
mx <- 1:25
dim(mx) <- c(5, 5)
mx
mx[1:2, 1:3]

ar <- 1:27
dim(ar) <- c(3, 3, 3)
ar
str(ar)
ar[1:2, 1:3, 2:3]
ar[1, 1:3, 2:3, drop = FALSE]

1.1.2.10 Lists and data frames

lst <- list(1:5, letters[1:3], c(TRUE, FALSE))
lst
str(lst)
names(lst) <- c("a1", "a2", "a3")
lst$a1

df <- data.frame(c1 = 1:5, c2 = letters[1:5],
                 c3 = c(TRUE, FALSE, TRUE, TRUE, TRUE))
df <- rbind(df, df, df, df, df)
df
str(df)
head(df)

1.1.2.11 Base plot

x <- rnorm(50)
y <- rnorm(50)
plot(x, y)

class(mtcars)
head(mtcars)
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg ~ wt))
title("Regression of MPG on Weight")
detach(mtcars)

1.1.2.12 Getting help

  • Function manuals, use question mark in front of function name: ?sum, ?"["
  • R manuals: R-intro, Manuals
  • R packages vignettes (tutorials)
  • Post question on stackoverflow.com, use R tag, provide reproducible example
  • Read examples in blog posts - R blogs aggregator: r-bloggers.com

1.2 Exploratory Data Analysis

Using base R

1.2.1 Why Exploratory Data Analysis?

  • better understand data
  • detect data issues and bad data
  • find patterns in data
  • iterate exploration process to get better insight

1.2.2 Investigate variables

head(airquality)
str(airquality)
summary(airquality$Temp)
hist(airquality$Temp)

# count observations by month
table(airquality$Month)

# check missing values
table(is.na(airquality$Ozone))
barplot(table(is.na(airquality$Ozone)), main = "Ozone NA")

1.2.3 Investigate variables in batch

vars <- c("Ozone","Solar.R","Wind","Temp")
# use 2x2 plot grid as single plot
par(mfrow = c(2, 2))
# loop over variables and plot histogram
for (var in vars) {
  hist(airquality[[var]], main = var, xlab = "")
}
# restore default plot grid
par(mfrow = c(1, 1))

# count missing values in each column
sapply(airquality, function(x) sum(is.na(x)))

1.2.4 Cleaning bad data

# we assume Ozone sensors to have upper bound at 125 ppb
plot(airquality$Ozone)
abline(h = 125)

# we define 'bad data' for Ozone > 125
airquality[airquality$Ozone > 125 & !is.na(airquality$Ozone), ]

# we define fix for bad data as value 125
airquality[airquality$Ozone > 125 & !is.na(airquality$Ozone), "Ozone"] <- 125

1.2.5 Data transformation

# convert temperature from Fahrenheit to Celsius
FtoC <- function(x) (x - 32) * (5/9)
airquality$Temp <- FtoC(airquality$Temp)

# add month names
month.name
airquality$Month.Name <- month.name[airquality$Month]
# keep month name as factor type to get proper order on plot
airquality$Month.Name <- factor(airquality$Month.Name, levels = month.name)
airquality$Month.Name <- droplevels(airquality$Month.Name)

1.2.6 Data insight: seasonality

# mean solar radiation by month
aggregate(Solar.R ~ Month.Name, data = airquality, FUN = mean)
boxplot(Solar.R ~ Month.Name, data = airquality, xlab = "Month", ylab = "Solar radiation (lang)")

# mean temperature by month
aggregate(Temp ~ Month.Name, data = airquality, FUN = mean)
boxplot(Temp ~ Month.Name, data = airquality, xlab = "Month", ylab = "Temperature (C)")

1.2.7 Data insight: Ozone variable

plot(airquality$Temp, airquality$Ozone)
plot(airquality$Temp, airquality$Ozone, xlab="Temp (C)", ylab="Ozone (ppb)", main="NYC May-Sep '73 air quality")
ozone_temp_lm <- lm(airquality$Ozone ~ airquality$Temp)
summary(ozone_temp_lm)
abline(ozone_temp_lm, col = "red")

# check other variables
pairs(Ozone ~ Temp + Wind + Solar.R, data = airquality)

1.2.8 Data insight: Ozone variable in batch + lm

# add linear model to pairs plot
pointslm <- function(x, y, ...){
  model <- lm(y ~ x)
  points(x, y)
  abline(a = model$coefficients[1], b = model$coefficients[2], col = "red")
}
pairs(Ozone ~ Temp + Wind + Solar.R + Month, data = airquality, lower.panel = pointslm, upper.panel = NULL)

1.2.9 Homework: read about and play with

  • sapply and lapply functions to hide loops and make code cleaner
  • subsetting data frames by integer and logical types
  • plot and pairs functions
  • graphic options like par(mfrow = c(2, 1))

1.3 Efficient Exploratory Data Analysis

Using data.table, ggplot2, lattice

1.3.1 Install and load R packages

# install R packages
pkgs <- c("data.table","ggplot2","lattice","bikeshare14")
install.packages(pkgs)

# check version of packages
packageVersion("data.table")
sapply(pkgs, packageVersion, simplify = FALSE)

# load packages
require("data.table")
sapply(pkgs, require, character.only = TRUE)

1.3.2 Investigate dataset

head(batrips)
?batrips
str(batrips)

summary(batrips)
boxplot(batrips$duration/60, log="y", ylab="duration (min) - log scale")

# proceed with batrips as data.frame/data.table objects
DF <- batrips
DT <- as.data.table(batrips)

1.3.3 Cleaning data

# investigate trips longer than 12 hours
# data.frame way
DF[DF$duration > 60*60*12, ]
# data.table way
DT[duration > 60*60*12]

# exclude trips longer than 12h
DF <- DF[DF$duration <= 60*60*12, ]
DT <- DT[duration <= 60*60*12]

1.3.4 data.table syntax

1.3.5 Trips by subscription_type

# count trips
table(DF$subscription_type)
aggregate(duration ~ subscription_type, data=DF, FUN=length)
DT[, length(duration), subscription_type]
DT[, .N, subscription_type]
# mean trips duration
aggregate(duration ~ subscription_type, data=DF, FUN=mean)
DT[, mean(duration), subscription_type]

# count, mean and sum
DT[, .(count=.N, mean_s=mean(duration), sum_d=sum(duration)/(60*60*24)), subscription_type]

1.3.6 Plotting using ggplot2 and lattice

ggplot2 package implements “The Grammar of Graphics” designed by Leland Wilkinson. ggplot2 uses plus sign + to combine new elements to existing plot. It is also capable to store plot in a variable and render graphic when print method is invoked on a variable.
lattice package provide powerful multi-panel plots “trellis” using R formula interface commonly used in R: y ~ x | z

1.3.7 Trips by subscription_type and hour of the day

# count by hour and subscription type
histogram(duration ~ hour(start_date) | subscription_type, data=DT)
# median duration in minutes
ans <- DT[, .(median_m=median(duration)/60), .(subscription_type, start_hour=hour(start_date))]

xyplot(median_m ~ start_hour | subscription_type, data=ans, main="median trip duration")

ggplot(ans, aes(start_hour, median_m)) + geom_point() + facet_wrap(~ subscription_type)

1.3.8 Common trip routes

ans <- DT[, .N, .(start_station, end_station)]
# order by count in descending order, take top 10
ans[order(-N)][1:10]
# by subscription_type, top 5
ans <- DT[, .N, .(subscription_type, start_station, end_station)]
ans[order(-N), head(.SD, 5), subscription_type]
# by weekday, top 2
DT[, weekday := weekdays(start_date)]
ans <- DT[, .N, .(weekday, start_station, end_station)]
ans[order(-N), head(.SD, 2), .(weekday)]

1.3.9 Trips by weekday and subscription_type

# set order for weekday factor
days_in_week <- DT[order(wday(start_date)), unique(weekday)]
DT[, weekday := factor(weekday, levels=days_in_week)]
histogram(duration ~ weekday | subscription_type, data=DT, main="trip count")

# median duration
ans <- DT[, .(.N, duration=median(duration)/60), .(subscription_type, weekday)]
barchart(duration ~ weekday | subscription_type, data=ans, ylab="duration (min)")

1.3.10 Import US holidays data by pasting into fread

HD <- fread("
date,holiday
2014-01-01,New Year Day
2014-01-20,Martin Luther King Jr. Day
2014-02-17,Presidents Day (Washingtons Birthday)
2014-05-26,Memorial Day
2014-07-04,Independence Day
2014-09-01,Labor Day
2014-10-13,Columbus Day
2014-11-11,Veterans Day
2014-11-27,Thanksgiving Day
2014-12-25,Christmas Day
")

1.3.11 Investigate holidays impact - merge datasets

head(HD)
str(HD)
str(DT)
DT[, date := as.Date(start_date, tz="America/Los_Angeles")]
HD[, date := as.Date(date, tz="America/Los_Angeles")]
# add holiday column from HD to DT
DT[HD, holiday := i.holiday, on = "date"]
head(DT)
# feature engineering
DT[, holiday := factor(ifelse(!is.na(holiday), "holiday", "non-holiday"))]

1.3.12 Investigate holidays impact

# for non-english weekdays set locale
# Sys.setlocale("LC_TIME", "English")
DT[, workday := factor(ifelse(weekday %in% c("Saturday","Sunday"), "weekend", "workday"))]
ans <- DT[, .(N_trips=.N, N_days=uniqueN(date)), .(subscription_type, workday, holiday)]
ans[, trips_day := N_trips/N_days]
ans[order(subscription_type, workday, holiday)]
barchart(~ trips_day | subscription_type, groups=holiday, data=ans, auto.key=TRUE, subset=workday=="workday", origin=0, scales=list(x="free"), main="mean trips per workday (Mon-Fri)")

1.3.13 Ultimate goal of EDA: feature engineering

Feature engineering allows data scientist to prepare data to be better processed by machine learning algorithms in automatic manner to build better models.
Adding “holidays” to our dataset we allow machine learning algos to more precisely predict our measures - “trips per day” in last example. Not having “holiday” flag in our dataset is not just a matter not being able to use holiday information to predict trips count. Holidays not flagged as holidays in the data will actually generate a “noise” due to higher variation. This happens because automated algos will cluster dates into Mon-Fri or Sat-Sun so holidays will be an outliers within those cluster reducing model precision.

1.3.14 Homework: read about and play with

1.4 Machine Learning

Using H2O

1.4.1 What is H2O?

  • machine learning and predictive analytics platform
  • open source
  • free to use
  • distributed and parallel
  • in-memory
  • fast and scalable
  • easy to productionize
  • core written in Java
  • interfaces from webUI, REST API, R, Python, Scala, Hadoop, Spark

1.4.2 Install and start H2O

SystemRequirements: Java (>= 1.7) - OpenJDK: openjdk.java.net - Oracle Java SE: www.oracle.com/technetwork/java/javase

# install h2o
install.packages("h2o")

# load h2o
library(h2o)

# initialize h2o instance
h2o.init()

1.4.3 seeds data set: wheat kernels properties

seeds data set: measurements of geometrical properties of kernels belonging to three different varieties of wheat.

# import dataset using base R
df <- read.csv("seeds_dataset.txt", sep="\t", header=FALSE)
# import dataset using data.table
dt <- data.table::fread("seeds_dataset.txt")
# import dataset using H2O
hf <- h2o.uploadFile("seeds_dataset.txt")

1.4.4 base R k-means

head(df)
# exclude actual cluster info from dataset
dat <- df[, paste0("V", 1:7)]
# run k-means using 3 clusters
km1 <- kmeans(dat, centers = 3)
# preview model info
km1
# plot all variable pairs
plot(dat, col = km1$cluster + 1, 
     main = "base R k-means",
     pch = 20, cex = 2)

1.4.5 H2O k-means

hf
# run k-means on C1-C7 variables, using 3 clusters
km2 <- h2o.kmeans(hf, x = paste0("C", 1:7), k = 3)
# preview model info
km2
# run prediction, exclude actual cluster info
p <- h2o.predict(km2, hf[, -8])
# plot all variable pairs
plot(as.data.frame(hf[, -8]), col = as.vector(p) + 2, 
     main = "H2O k-means",
     pch = 20, cex = 2)

1.4.6 k-means: unify cluster labels

# inspect labels
df[, "V8"]   # actuals
km1$cluster  # base R
as.vector(p) # h2o
# decode base R cluster id
base_r_cluster_id <- km1$cluster
base_r_cluster <- ifelse(base_r_cluster_id==3, 1, ifelse(base_r_cluster_id==1, 2, 3))
# decode H2O cluster id
h2o_cluster_id <- as.vector(p)
h2o_cluster <- ifelse(h2o_cluster_id==2, 1, ifelse(h2o_cluster_id==0, 2, 3))

1.4.7 k-means metric: root mean square error

# root mean square error
rmse <- function(predicted, actuals) {
  sqrt( mean( (predicted-actuals)^2, na.rm=TRUE ) )
}

# base R k-means rmse
rmse(base_r_cluster, df[, "V8"])

# H2O k-means rmse
rmse(h2o_cluster, as.vector(hf[, "C8"]))
h2o.make_metrics(as.h2o(h2o_cluster), hf[, "C8"])

1.4.8 k-means metric: confusion matrix

# confusion matrix
table(base_r_cluster, df[, "V8"])
table(h2o_cluster, as.vector(hf[, "C8"]))

# correct classification ratio
base_r_correct <- sum(base_r_cluster == df[, "V8"])
h2o_correct <- sum(h2o_cluster == as.vector(hf[, "C8"]))
sprintf("base R k-means correct ratio: %.2f%%",
        base_r_correct * 100 / nrow(df))
sprintf("H2O k-means correct ratio: %.2f%%",
        h2o_correct * 100 / nrow(df))

1.4.9 cluster dendrogram

# calculate euclidean distance matrix
d <- dist(dat, method = "euclidean")

# hierarchical cluster analysis
hc <- hclust(d, method = "ward.D")

# plot hclust object 
plot(hc)

# draw rectangles
rect.hclust(hc, k = 3, border = "red")

1.4.10 bivariate cluster plot

library(cluster)
# combine two plots horizontally
par(mfrow = c(1,2))
# plot base R k-means clusters
clusplot(dat, km1$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, main="base R k-means")
# plot H2O k-means clusters
clusplot(dat, as.vector(p)+1, color=TRUE, shade=TRUE, labels=2, lines=0, main="H2O k-means")
# reset plot option to default
par(mfrow = c(1,1))

1.4.11 k-means summary

K-means is a clustering algorithm. It measure the distance between observations in dataset and cluster them into partitions. It is commonly applied in various industries: - biology, computational biology and bioinformatics - medicine - business and marketing - world wide web - computer science - social science - others

more on wikipedia: Cluster analysis

1.4.12 H2O algorithms

Supervised Learning: - Generalized Linear Modeling (GLM) - Gradient Boosting Machine (GBM) - Deep Learning - Distributed Random Forest - Naive Bayes - Stacked Ensembles

Unsupervised Learning: - Generalized Low Rank Models (GLRM) - K-Means Clustering - Principal Components Analysis (PCA)

more on docs.h2o.ai

1.4.13 H2O webUI: Flow

Once you start H2O instance, for example using R command h2o.init(), you can open web browser and point it to localhost:54321 address to access H2O webUI.
H2O Flow is graphical user interface. User can import data and perform basic data science tasks from web browser. Flow is designed to be used as a notebook, like Jupyter Notebook (IPython) and similar. Therefore no programming skills are required to use H2O Flow. User can use multiple interfaces at the same time, for example load and pre-process data using R and h2o package, then perform statistical modeling in web browser with H2O Flow. Multiple interfaces will use the same H2O instance.

1.4.14 Homework