Check the documentation for ?sort
and ?order
and then write code to output the values of Y
sorted according to the order of X
X <- runif(10)
Y <- X + rnorm(10)
qplot(X, Y)
Base R solution
egdf <- data.frame(X = X, Y = Y)
egdf %>%
arrange(X)
## X Y
## 1 0.05571544 2.6500709
## 2 0.08926600 0.5578119
## 3 0.16266617 0.2374518
## 4 0.17869520 0.2483979
## 5 0.25586316 -2.3250955
## 6 0.45478863 -0.2310837
## 7 0.82033903 1.1852317
## 8 0.84673601 -1.2461330
## 9 0.89994850 3.1002198
## 10 0.94981537 0.2989222
Below is some code that computes the average values of Y
above and below a given split point
Base R
x_split <- 0.5
c(mean(Y[X <= x_split]),
mean(Y[X > x_split]))
## [1] 0.1895922 0.8345602
tidyverse
egdf %>%
group_by(X <= x_split) %>%
summarize(avg_Y = mean(Y))
## # A tibble: 2 x 2
## `X <= x_split` avg_Y
## <lgl> <dbl>
## 1 FALSE 0.835
## 2 TRUE 0.190
x_split
?X
only once, and then, taking each X
value as a split point consecutively, computes the average Y
values above and below that split point while minimizing unnecessary computationWrite a function that inputs a single numeric predictor and outcome, and outputs a splitting point that achieves the lowest RSS
n <- 1000
mixture_ids <- rbinom(n, 1, .5)
x <- rnorm(n) + 3*mixture_ids
y <- rnorm(n) + 3*mixture_ids
qplot(x,y)
n
and repeatgapminder
data, plot the initial split pointn <- 1000
mixture_ids <- rbinom(n, 1, .5)
x <- rnorm(n) + 3*mixture_ids
y <- rnorm(n) + 3*mixture_ids
x <- c(x, rnorm(n/2, mean = -2))
y <- c(y, rnorm(n/2, mean = 5))
egdf <- data.frame(x = x, y = y)
egplot <- egdf %>%
ggplot(aes(x, y)) +
geom_point()
egplot