Let’s start easy by just repeating some steps but with data from a different year.
gapminder
scatterplot using data from the year 2002gm2002 <- gapminder %>%
filter(year == 2002)
gm_scatterplot <-
ggplot(gm2002,
aes(x = gdpPercap, y = lifeExp)) +
geom_point()
gm_scatterplot
model_lm <- lm(lifeExp ~ gdpPercap, data = gm2002)
predictions_lm <- augment(model_lm)
model_loess <- loess(lifeExp ~ gdpPercap, span = .75, data = gm2002)
predictions_loess <- augment(model_loess)
gm_scatterplot +
geom_line(data = predictions_lm, size = 1,
color = "blue",
linetype = "dashed",
aes(y = .fitted)) +
geom_line(data = predictions_loess, size = 1,
color = "green",
aes(y = .fitted))
mean(residuals(model_lm)^2)
## [1] 80.11716
mean(residuals(model_loess)^2)
## [1] 45.21583
Models are supposed to capture/use structure in the data that corresponds to structure in the real world. And if the real world isn’t misbehaving, that structure should be somewhat stable.
For example, suppose the relationship changed dramatically from one time period to another time period. Then it would be less useful/interesting to have a model fit on data at one time period, because the same model might have a poor fit on data from a different time period.
Let’s explore this with our gapminder
models
Create datasets for the desired years
gm2007 <- gapminder %>% filter(year == 2007)
gm1997 <- gapminder %>% filter(year == 1997)
Predict using newdata
argument, then pull
the residuals from the resulting data.frame
lm_resid2007 <- augment(model_lm, newdata = gm2007) %>%
pull(.resid)
lm_resid1997 <- augment(model_lm, newdata = gm1997) %>%
pull(.resid)
loess_resid2007 <- augment(model_loess, newdata = gm2007) %>%
pull(.resid)
loess_resid1997 <- augment(model_loess, newdata = gm1997) %>%
pull(.resid)
mean(lm_resid1997^2)
## [1] 67.23804
mean(loess_resid1997^2)
## [1] 31.56738
mean(lm_resid2007^2)
## [1] 80.1876
mean(loess_resid2007^2, na.rm = TRUE)
## [1] 49.54716
One trade-off we see here: the loess
function does not have any default way of extrapolating to observations outside the range of the original data (values of gdpPercap
in 2007 that are larger than the maximum in 2002).
The more complex, loess
model performs better than the linear model even when tested on data from 5 years earlier or later.
Sometimes a more complex model really is better!
Question: Can we break it? Let’s change the span
parameter in the loess
function to make it even more complex and see if we keep reaching the same conclusion.
Answer: Even after decreasing to span = 0.1
the more complex model was still better!
Question: How can we change this setup so that the linear model isn’t always worse?
Answer: Try a logarithmic transformation on the gdpPercap
variable to improve the fit of the linear model first, then maybe the linear model will do better on data from a different year than loess
with a low span
value.
gm2002 <- gapminder %>%
filter(year == 2002) %>%
mutate(log_gdpPercap = log10(gdpPercap))
gm2007 <- gapminder %>% filter(year == 2007) %>%
mutate(log_gdpPercap = log10(gdpPercap))
gm1997 <- gapminder %>% filter(year == 1997) %>%
mutate(log_gdpPercap = log10(gdpPercap))
Now repeat other code changing gdpPercap
to log_gdpPercap
.
It seems that: