Sorting preliminaries

Check the documentation for ?sort and ?order and then write code to output the values of Y sorted according to the order of X

X <- runif(10)
Y <- X + rnorm(10)
qplot(X, Y)

Base R solution

Check with tidyverse solution

egdf <- data.frame(X = X, Y = Y)
egdf %>%
  arrange(X)
##             X          Y
## 1  0.05571544  2.6500709
## 2  0.08926600  0.5578119
## 3  0.16266617  0.2374518
## 4  0.17869520  0.2483979
## 5  0.25586316 -2.3250955
## 6  0.45478863 -0.2310837
## 7  0.82033903  1.1852317
## 8  0.84673601 -1.2461330
## 9  0.89994850  3.1002198
## 10 0.94981537  0.2989222

Within-leaf averages

Below is some code that computes the average values of Y above and below a given split point

Base R

x_split <- 0.5
c(mean(Y[X <= x_split]),
  mean(Y[X > x_split]))
## [1] 0.1895922 0.8345602

tidyverse

egdf %>%
  group_by(X <= x_split) %>%
  summarize(avg_Y = mean(Y))
## # A tibble: 2 x 2
##   `X <= x_split` avg_Y
##   <lgl>          <dbl>
## 1 FALSE          0.835
## 2 TRUE           0.190

How much computation is required if we change the value of x_split?

Write code that sorts the data on X only once, and then, taking each X value as a split point consecutively, computes the average Y values above and below that split point while minimizing unnecessary computation

How can we use this to decide the split point for a regression tree?

How could we change this code to guarantee a minimum number of observations in each leaf?

Numeric predictor

Write a function that inputs a single numeric predictor and outcome, and outputs a splitting point that achieves the lowest RSS

Example data

n <- 1000
mixture_ids <- rbinom(n, 1, .5)
x <- rnorm(n) + 3*mixture_ids
y <- rnorm(n) + 3*mixture_ids
qplot(x,y)

Apply your function to the example data. Try different values of minimum observations per leaf

Change the data generating process to make the example easier/harder and repeat the above initial split

Compare the initial tree split to a Bayes decision boundary in one of the above examples. Increase n and repeat

Test the function on some gapminder data, plot the initial split point

Multiple splits

n <- 1000
mixture_ids <- rbinom(n, 1, .5)
x <- rnorm(n) + 3*mixture_ids
y <- rnorm(n) + 3*mixture_ids
x <- c(x, rnorm(n/2, mean = -2))
y <- c(y, rnorm(n/2, mean = 5))
egdf <- data.frame(x = x, y = y)
egplot <- egdf %>%
  ggplot(aes(x, y)) +
  geom_point()
egplot