1 Crash course
The course will go through many topic very briefly just to present simple example of Data Science workflow.
1.1 Basics of R language
1.1.1 R language
1.1.1.1 What is R?
- Programming language and environment for statistical computing
- Modern implementation of S language that first appeared in 1976
- First released in 1993
- As a statistical software it is difficult to use
- As a programming language it is easy to use
1.1.1.2 Features
- Vector as atomic data type
- Open source, free to use and extend without asking for permission
- Interactive - no compiler
- Extensible - thousands of R packages (CRAN, Bioconductor, GitHub, others)
- Visualisation (
graphics
,lattice
; contributed packages:ggplot2
,rgl
, others) - Native support for missing values (
NA
) - Computing on the language (metaprogramming)
1.1.1.3 Limitations
- By default single threaded
- By default in-memory processing
- Lack of security features
1.1.1.4 R extensions (packages) capabilities
- Import and export data from/to various file formats and databases
- Efficient data cleansing and transformation
- Plotting multidimensional data using multi panel charts or 3D graphs
- Statistical modeling
- Signal processing
- Distributed parallel computing
- Machine learning
- Time series data support
- Spatial data support
- much more…
1.1.1.6 Start R
- Windows:
C:\Program Files\R\R-3.5.0\bin\x64\Rgui.exe
- MacOSX:
R
- Linux:
R
1.1.2 Basic syntax and data structures
1.1.2.1 Assignment and basic vector examples
We can assign value to variable using <-
and =
operators.
x <- 1
y = 5
sum(x, y)
x + y
length(x)
z <- c(1, 5)
length(z)
sum(x, z)
x + z # element wise with recycling
sum(z, z)
z + z # element wise
1.1.2.2 Atomic data types
# integer
1L
# real (numeric)
1.5
# string (character)
"a"
# categorical (factor)
as.factor("a")
# logical
TRUE
# complex (imaginary numbers)
1i
# raw (binary type)
as.raw(10)
1.1.2.3 Vectors
# integer
x <- c(1, 2, 3, 4, 5) # technically it is not integer
x <- seq(1, 5)
y <- 6:10
x * y
# numeric
x <- c(1, 1.5, 2, 2.5, 3)
x <- seq(1, 3, by = 0.5)
# logical
lgc <- c(TRUE, FALSE, TRUE)
# character
chr <- letters # R built-in, same as c("a", "b", ..., "z")
1.1.2.4 Operations on vectors
-x
lgc <- x < 2
!x < 2 # negation
x >= 2
lgc & !lgc # AND
lgc | !lgc # OR
chr <- c("a", "b", "c", "d", "e")
paste(x, chr) # element wise
1.1.2.5 Subsetting using integer type
x[1]
x[2:4]
x[-(2:4)]
x[2:10]
x[c(1, 2, 3, 5, 4)]
lgc[2:4]
lgc[-5]
chr[1:3]
chr[10]
chr[c(1, 2, 3, 2, 1)]
1.1.2.6 Subsetting using logical type
x > 2
x[x > 2]
x[c(FALSE, TRUE, TRUE, FALSE, FALSE)]
x[c(FALSE, TRUE, TRUE)] # unexpected due recycling
x[x < 2 | x >= 3]
# %in% operator
chr %in% c("d", "e", "f")
chr[chr %in% c("d", "e", "f")]
chr[chr >= "d"] # OS locale specific!
1.1.2.7 Names and subsetting using character type
x
chr
names(x) <- chr
x
x <- c(a = 1.0, b = 1.5, c = 2.0, d = 2.5, e = 3.0)
x["a"]
x[c("d", "e")]
x[c("d", "f")]
1.1.2.8 Modify elements in vector (sub-assign)
y[1] <- 100
y
y[c(8, 10)] <- c(5, 6)
length(y)
sum(y)
is.na(y)
y[!is.na(y)]
y <- c(y, 7)
y[y > 50] <- NA
y <- y[!is.na(y)]
1.1.2.9 Matrices and arrays
mx <- matrix(1:25, 5, 5)
mx
mx <- 1:25
dim(mx) <- c(5, 5)
mx
mx[1:2, 1:3]
ar <- 1:27
dim(ar) <- c(3, 3, 3)
ar
str(ar)
ar[1:2, 1:3, 2:3]
ar[1, 1:3, 2:3, drop = FALSE]
1.1.2.10 Lists and data frames
lst <- list(1:5, letters[1:3], c(TRUE, FALSE))
lst
str(lst)
names(lst) <- c("a1", "a2", "a3")
lst$a1
df <- data.frame(c1 = 1:5, c2 = letters[1:5],
c3 = c(TRUE, FALSE, TRUE, TRUE, TRUE))
df <- rbind(df, df, df, df, df)
df
str(df)
head(df)
1.1.2.11 Base plot
x <- rnorm(50)
y <- rnorm(50)
plot(x, y)
class(mtcars)
head(mtcars)
attach(mtcars)
plot(wt, mpg)
abline(lm(mpg ~ wt))
title("Regression of MPG on Weight")
detach(mtcars)
1.1.2.12 Getting help
- Function manuals, use question mark in front of function name:
?sum
,?"["
- R manuals: R-intro, Manuals
- R packages vignettes (tutorials)
- Post question on stackoverflow.com, use R tag, provide reproducible example
- Read examples in blog posts - R blogs aggregator: r-bloggers.com
1.2 Exploratory Data Analysis
Using base R
1.2.1 Why Exploratory Data Analysis?
- better understand data
- detect data issues and bad data
- find patterns in data
- iterate exploration process to get better insight
1.2.2 Investigate variables
head(airquality)
str(airquality)
summary(airquality$Temp)
hist(airquality$Temp)
# count observations by month
table(airquality$Month)
# check missing values
table(is.na(airquality$Ozone))
barplot(table(is.na(airquality$Ozone)), main = "Ozone NA")
1.2.3 Investigate variables in batch
vars <- c("Ozone","Solar.R","Wind","Temp")
# use 2x2 plot grid as single plot
par(mfrow = c(2, 2))
# loop over variables and plot histogram
for (var in vars) {
hist(airquality[[var]], main = var, xlab = "")
}
# restore default plot grid
par(mfrow = c(1, 1))
# count missing values in each column
sapply(airquality, function(x) sum(is.na(x)))
1.2.4 Cleaning bad data
# we assume Ozone sensors to have upper bound at 125 ppb
plot(airquality$Ozone)
abline(h = 125)
# we define 'bad data' for Ozone > 125
airquality[airquality$Ozone > 125 & !is.na(airquality$Ozone), ]
# we define fix for bad data as value 125
airquality[airquality$Ozone > 125 & !is.na(airquality$Ozone), "Ozone"] <- 125
1.2.5 Data transformation
# convert temperature from Fahrenheit to Celsius
FtoC <- function(x) (x - 32) * (5/9)
airquality$Temp <- FtoC(airquality$Temp)
# add month names
month.name
airquality$Month.Name <- month.name[airquality$Month]
# keep month name as factor type to get proper order on plot
airquality$Month.Name <- factor(airquality$Month.Name, levels = month.name)
airquality$Month.Name <- droplevels(airquality$Month.Name)
1.2.6 Data insight: seasonality
# mean solar radiation by month
aggregate(Solar.R ~ Month.Name, data = airquality, FUN = mean)
boxplot(Solar.R ~ Month.Name, data = airquality, xlab = "Month", ylab = "Solar radiation (lang)")
# mean temperature by month
aggregate(Temp ~ Month.Name, data = airquality, FUN = mean)
boxplot(Temp ~ Month.Name, data = airquality, xlab = "Month", ylab = "Temperature (C)")
1.2.7 Data insight: Ozone variable
plot(airquality$Temp, airquality$Ozone)
plot(airquality$Temp, airquality$Ozone, xlab="Temp (C)", ylab="Ozone (ppb)", main="NYC May-Sep '73 air quality")
ozone_temp_lm <- lm(airquality$Ozone ~ airquality$Temp)
summary(ozone_temp_lm)
abline(ozone_temp_lm, col = "red")
# check other variables
pairs(Ozone ~ Temp + Wind + Solar.R, data = airquality)
1.2.8 Data insight: Ozone variable in batch + lm
# add linear model to pairs plot
pointslm <- function(x, y, ...){
model <- lm(y ~ x)
points(x, y)
abline(a = model$coefficients[1], b = model$coefficients[2], col = "red")
}
pairs(Ozone ~ Temp + Wind + Solar.R + Month, data = airquality, lower.panel = pointslm, upper.panel = NULL)
1.2.9 Homework: read about and play with
sapply
andlapply
functions to hide loops and make code cleaner- subsetting data frames by integer and logical types
plot
andpairs
functions- graphic options like
par(mfrow = c(2, 1))
1.3 Efficient Exploratory Data Analysis
Using data.table
, ggplot2
, lattice
1.3.1 Install and load R packages
# install R packages
pkgs <- c("data.table","ggplot2","lattice","bikeshare14")
install.packages(pkgs)
# check version of packages
packageVersion("data.table")
sapply(pkgs, packageVersion, simplify = FALSE)
# load packages
require("data.table")
sapply(pkgs, require, character.only = TRUE)
1.3.2 Investigate dataset
head(batrips)
?batrips
str(batrips)
summary(batrips)
boxplot(batrips$duration/60, log="y", ylab="duration (min) - log scale")
# proceed with batrips as data.frame/data.table objects
DF <- batrips
DT <- as.data.table(batrips)
1.3.3 Cleaning data
# investigate trips longer than 12 hours
# data.frame way
DF[DF$duration > 60*60*12, ]
# data.table way
DT[duration > 60*60*12]
# exclude trips longer than 12h
DF <- DF[DF$duration <= 60*60*12, ]
DT <- DT[duration <= 60*60*12]
1.3.5 Trips by subscription_type
# count trips
table(DF$subscription_type)
aggregate(duration ~ subscription_type, data=DF, FUN=length)
DT[, length(duration), subscription_type]
DT[, .N, subscription_type]
# mean trips duration
aggregate(duration ~ subscription_type, data=DF, FUN=mean)
DT[, mean(duration), subscription_type]
# count, mean and sum
DT[, .(count=.N, mean_s=mean(duration), sum_d=sum(duration)/(60*60*24)), subscription_type]
1.3.6 Plotting using ggplot2
and lattice
ggplot2
package implements “The Grammar of Graphics” designed by Leland Wilkinson. ggplot2
uses plus sign +
to combine new elements to existing plot. It is also capable to store plot in a variable and render graphic when print method is invoked on a variable.
lattice
package provide powerful multi-panel plots “trellis” using R formula interface commonly used in R: y ~ x | z
1.3.7 Trips by subscription_type
and hour of the day
# count by hour and subscription type
histogram(duration ~ hour(start_date) | subscription_type, data=DT)
# median duration in minutes
ans <- DT[, .(median_m=median(duration)/60), .(subscription_type, start_hour=hour(start_date))]
xyplot(median_m ~ start_hour | subscription_type, data=ans, main="median trip duration")
ggplot(ans, aes(start_hour, median_m)) + geom_point() + facet_wrap(~ subscription_type)
1.3.8 Common trip routes
ans <- DT[, .N, .(start_station, end_station)]
# order by count in descending order, take top 10
ans[order(-N)][1:10]
# by subscription_type, top 5
ans <- DT[, .N, .(subscription_type, start_station, end_station)]
ans[order(-N), head(.SD, 5), subscription_type]
# by weekday, top 2
DT[, weekday := weekdays(start_date)]
ans <- DT[, .N, .(weekday, start_station, end_station)]
ans[order(-N), head(.SD, 2), .(weekday)]
1.3.9 Trips by weekday and subscription_type
# set order for weekday factor
days_in_week <- DT[order(wday(start_date)), unique(weekday)]
DT[, weekday := factor(weekday, levels=days_in_week)]
histogram(duration ~ weekday | subscription_type, data=DT, main="trip count")
# median duration
ans <- DT[, .(.N, duration=median(duration)/60), .(subscription_type, weekday)]
barchart(duration ~ weekday | subscription_type, data=ans, ylab="duration (min)")
1.3.10 Import US holidays data by pasting into fread
HD <- fread("
date,holiday
2014-01-01,New Year Day
2014-01-20,Martin Luther King Jr. Day
2014-02-17,Presidents Day (Washingtons Birthday)
2014-05-26,Memorial Day
2014-07-04,Independence Day
2014-09-01,Labor Day
2014-10-13,Columbus Day
2014-11-11,Veterans Day
2014-11-27,Thanksgiving Day
2014-12-25,Christmas Day
")
1.3.11 Investigate holidays impact - merge datasets
head(HD)
str(HD)
str(DT)
DT[, date := as.Date(start_date, tz="America/Los_Angeles")]
HD[, date := as.Date(date, tz="America/Los_Angeles")]
# add holiday column from HD to DT
DT[HD, holiday := i.holiday, on = "date"]
head(DT)
# feature engineering
DT[, holiday := factor(ifelse(!is.na(holiday), "holiday", "non-holiday"))]
1.3.12 Investigate holidays impact
# for non-english weekdays set locale
# Sys.setlocale("LC_TIME", "English")
DT[, workday := factor(ifelse(weekday %in% c("Saturday","Sunday"), "weekend", "workday"))]
ans <- DT[, .(N_trips=.N, N_days=uniqueN(date)), .(subscription_type, workday, holiday)]
ans[, trips_day := N_trips/N_days]
ans[order(subscription_type, workday, holiday)]
barchart(~ trips_day | subscription_type, groups=holiday, data=ans, auto.key=TRUE, subset=workday=="workday", origin=0, scales=list(x="free"), main="mean trips per workday (Mon-Fri)")
1.3.13 Ultimate goal of EDA: feature engineering
Feature engineering allows data scientist to prepare data to be better processed by machine learning algorithms in automatic manner to build better models.
Adding “holidays” to our dataset we allow machine learning algos to more precisely predict our measures - “trips per day” in last example.
Not having “holiday” flag in our dataset is not just a matter not being able to use holiday information to predict trips count. Holidays not flagged as holidays in the data will actually generate a “noise” due to higher variation. This happens because automated algos will cluster dates into Mon-Fri or Sat-Sun so holidays will be an outliers within those cluster reducing model precision.
1.3.14 Homework: read about and play with
- read
data.table
homepage r-datatable.com and introduction vignette - know when and how to use
.N
and.SD
special symbols - read
ggplot2
homepage ggplot2.tidyverse.org - read
lattice
introduction vignette - understand
order
andsort
functions - read about feature engineering
1.4 Machine Learning
Using H2O
1.4.1 What is H2O?
- machine learning and predictive analytics platform
- open source
- free to use
- distributed and parallel
- in-memory
- fast and scalable
- easy to productionize
- core written in Java
- interfaces from webUI, REST API, R, Python, Scala, Hadoop, Spark
1.4.2 Install and start H2O
SystemRequirements: Java (>= 1.7) - OpenJDK: openjdk.java.net - Oracle Java SE: www.oracle.com/technetwork/java/javase
# install h2o
install.packages("h2o")
# load h2o
library(h2o)
# initialize h2o instance
h2o.init()
1.4.3 seeds data set: wheat kernels properties
seeds data set: measurements of geometrical properties of kernels belonging to three different varieties of wheat.
# import dataset using base R
df <- read.csv("seeds_dataset.txt", sep="\t", header=FALSE)
# import dataset using data.table
dt <- data.table::fread("seeds_dataset.txt")
# import dataset using H2O
hf <- h2o.uploadFile("seeds_dataset.txt")
1.4.4 base R k-means
head(df)
# exclude actual cluster info from dataset
dat <- df[, paste0("V", 1:7)]
# run k-means using 3 clusters
km1 <- kmeans(dat, centers = 3)
# preview model info
km1
# plot all variable pairs
plot(dat, col = km1$cluster + 1,
main = "base R k-means",
pch = 20, cex = 2)
1.4.5 H2O k-means
hf
# run k-means on C1-C7 variables, using 3 clusters
km2 <- h2o.kmeans(hf, x = paste0("C", 1:7), k = 3)
# preview model info
km2
# run prediction, exclude actual cluster info
p <- h2o.predict(km2, hf[, -8])
# plot all variable pairs
plot(as.data.frame(hf[, -8]), col = as.vector(p) + 2,
main = "H2O k-means",
pch = 20, cex = 2)
1.4.6 k-means: unify cluster labels
# inspect labels
df[, "V8"] # actuals
km1$cluster # base R
as.vector(p) # h2o
# decode base R cluster id
base_r_cluster_id <- km1$cluster
base_r_cluster <- ifelse(base_r_cluster_id==3, 1, ifelse(base_r_cluster_id==1, 2, 3))
# decode H2O cluster id
h2o_cluster_id <- as.vector(p)
h2o_cluster <- ifelse(h2o_cluster_id==2, 1, ifelse(h2o_cluster_id==0, 2, 3))
1.4.7 k-means metric: root mean square error
# root mean square error
rmse <- function(predicted, actuals) {
sqrt( mean( (predicted-actuals)^2, na.rm=TRUE ) )
}
# base R k-means rmse
rmse(base_r_cluster, df[, "V8"])
# H2O k-means rmse
rmse(h2o_cluster, as.vector(hf[, "C8"]))
h2o.make_metrics(as.h2o(h2o_cluster), hf[, "C8"])
1.4.8 k-means metric: confusion matrix
# confusion matrix
table(base_r_cluster, df[, "V8"])
table(h2o_cluster, as.vector(hf[, "C8"]))
# correct classification ratio
base_r_correct <- sum(base_r_cluster == df[, "V8"])
h2o_correct <- sum(h2o_cluster == as.vector(hf[, "C8"]))
sprintf("base R k-means correct ratio: %.2f%%",
base_r_correct * 100 / nrow(df))
sprintf("H2O k-means correct ratio: %.2f%%",
h2o_correct * 100 / nrow(df))
1.4.9 cluster dendrogram
# calculate euclidean distance matrix
d <- dist(dat, method = "euclidean")
# hierarchical cluster analysis
hc <- hclust(d, method = "ward.D")
# plot hclust object
plot(hc)
# draw rectangles
rect.hclust(hc, k = 3, border = "red")
1.4.10 bivariate cluster plot
library(cluster)
# combine two plots horizontally
par(mfrow = c(1,2))
# plot base R k-means clusters
clusplot(dat, km1$cluster, color=TRUE, shade=TRUE, labels=2, lines=0, main="base R k-means")
# plot H2O k-means clusters
clusplot(dat, as.vector(p)+1, color=TRUE, shade=TRUE, labels=2, lines=0, main="H2O k-means")
# reset plot option to default
par(mfrow = c(1,1))
1.4.11 k-means summary
K-means is a clustering algorithm. It measure the distance between observations in dataset and cluster them into partitions. It is commonly applied in various industries: - biology, computational biology and bioinformatics - medicine - business and marketing - world wide web - computer science - social science - others
more on wikipedia: Cluster analysis
1.4.12 H2O algorithms
Supervised Learning: - Generalized Linear Modeling (GLM) - Gradient Boosting Machine (GBM) - Deep Learning - Distributed Random Forest - Naive Bayes - Stacked Ensembles
Unsupervised Learning: - Generalized Low Rank Models (GLRM) - K-Means Clustering - Principal Components Analysis (PCA)
more on docs.h2o.ai
1.4.13 H2O webUI: Flow
Once you start H2O instance, for example using R command h2o.init()
, you can open web browser and point it to localhost:54321 address to access H2O webUI.
H2O Flow is graphical user interface. User can import data and perform basic data science tasks from web browser. Flow is designed to be used as a notebook, like Jupyter Notebook (IPython) and similar. Therefore no programming skills are required to use H2O Flow. User can use multiple interfaces at the same time, for example load and pre-process data using R and h2o
package, then perform statistical modeling in web browser with H2O Flow. Multiple interfaces will use the same H2O instance.
1.4.14 Homework
- understand Euclidean distance and its application in K-means clustering
- preview H2O website www.h2o.ai
- play with H2O Flow