Sorting preliminaries

Check the documentation for ?sort and ?order and then write code to output the values of Y sorted according to the order of X

X <- runif(10)
Y <- X + rnorm(10)
qplot(X, Y)

Base R solution

Check with tidyverse solution

egdf <- data.frame(X = X, Y = Y)
egdf %>%
  arrange(X)
##             X          Y
## 1  0.05571544  2.6500709
## 2  0.08926600  0.5578119
## 3  0.16266617  0.2374518
## 4  0.17869520  0.2483979
## 5  0.25586316 -2.3250955
## 6  0.45478863 -0.2310837
## 7  0.82033903  1.1852317
## 8  0.84673601 -1.2461330
## 9  0.89994850  3.1002198
## 10 0.94981537  0.2989222

Within-leaf averages

Below is some code that computes the average values of Y above and below a given split point

Base R

x_split <- 0.5
c(mean(Y[X <= x_split]),
  mean(Y[X > x_split]))
## [1] 0.1895922 0.8345602

tidyverse

egdf %>%
  group_by(X <= x_split) %>%
  summarize(avg_Y = mean(Y))
## # A tibble: 2 x 2
##   `X <= x_split` avg_Y
##   <lgl>          <dbl>
## 1 FALSE          0.835
## 2 TRUE           0.190

How much computation is required if we change the value of x_split?

Write code that sorts the data on X only once, and then, taking each X value as a split point consecutively, computes the average Y values above and below that split point while minimizing unnecessary computation

How can we use this to decide the split point for a regression tree?

How could we change this code to guarantee a minimum number of observations in each leaf?

Numeric predictor

Write a function that inputs a single numeric predictor and outcome, and outputs a splitting point that achieves the lowest RSS

Example data

n <- 1000
mixture_ids <- rbinom(n, 1, .5)
x <- rnorm(n) + 3*mixture_ids
y <- rnorm(n) + 3*mixture_ids
qplot(x,y)

Apply your function to the example data. Try different values of minimum observations per leaf

Change the data generating process to make the example easier/harder and repeat the above initial split

Compare the initial tree split to a Bayes decision boundary in one of the above examples. Increase n and repeat

Test the function on some gapminder data, plot the initial split point

Multiple splits

n <- 1000
mixture_ids <- rbinom(n, 1, .5)
x <- rnorm(n) + 3*mixture_ids
y <- rnorm(n) + 3*mixture_ids
x <- c(x, rnorm(n/2, mean = -2))
y <- c(y, rnorm(n/2, mean = 5))
egdf <- data.frame(x = x, y = y)
egplot <- egdf %>%
  ggplot(aes(x, y)) +
  geom_point()
egplot

Use your single split function once, then split the data and apply the function again on each subset

Plot the data and each split point

Extra practice

Recursive splits

Write a function taking data and a maximum number of splits as inputs, and outputting a decision tree

How does computation scale in number of levels (tree depth)?

Categorical predictors

Write a function for when the outcome variable is categorical and some splitting rule based on e.g. Gini index or entropy