Random forest to classify penguins

pg <- penguins %>%
  # not interested in classifying by time/island
  dplyr::select(-island, -year, -sex) %>%
  drop_na()

First fit a single tree

fit_tree <- tree(species ~ .,
                 split = "gini",
                 control = tree.control(nrow(pg), mincut = 40),
                 data = pg)
plot(fit_tree, type = "uniform")
text(fit_tree, pretty = 0, cex = 1.1)

Experiment with changing the control parameters

Fit a random forest

fit_rf <- randomForest(species ~ ., ntree = 100, mtry = 2,
                  minsize = 20, data = pg)
fit_rf
## 
## Call:
##  randomForest(formula = species ~ ., data = pg, ntree = 100, mtry = 2,      minsize = 20) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.92%
## Confusion matrix:
##           Adelie Chinstrap Gentoo class.error
## Adelie       146         4      1 0.033112583
## Chinstrap      4        64      0 0.058823529
## Gentoo         0         1    122 0.008130081

Variable importance plot

Read ?importance to learn more

varImpPlot(fit_rf)

Experiment with changing the control parameters

Partial dependence plots

pred_rf <- Predictor$new(fit_rf)
pdp_rf <- FeatureEffects$new(pred_rf,
      features = c("bill_length_mm",
                   "flipper_length_mm"),
      method = "pdp+ice")
plot(pdp_rf) 
## Warning: UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated
## random numbers without specifying argument 'future.seed'. There is a risk that
## those random numbers are not statistically sound and the overall results might
## be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper,
## parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To
## disable this check, use 'future.seed=NULL', or set option 'future.rng.onMisuse'
## to "ignore".

Check other variables

Interaction plots

int_rf <- Interaction$new(pred_rf)
plot(int_rf)

For each variable, how much does the model’s predictions depend on the interaction of that variable with others or by the variable alone?

Hint: read ?Interaction

If the measure is close to 0 then the predictions depend mostly on the variable alone, if it’s close to 1 then the predictions depend more on interactions between that variable and other predictors

Fit a gam and interpret the interaction plot for the fitted model

Note that the iml and gam packages are slightly less compatible, e.g. it seems that iml plots will automatically show results for only 1 class

fit_gam <- gam(species ~ ., family = "binomial",  data = pg)
fit_gam
## Call:
## gam(formula = species ~ ., family = "binomial", data = pg)
## 
## Degrees of Freedom: 341 total; 337 Residual
## Residual Deviance: 9.44917
pred_gam <- Predictor$new(fit_gam)#, type = "response")
pdp_gam <- FeatureEffects$new(pred_gam,
      features = c("bill_length_mm",
                   "flipper_length_mm"),
      method = "pdp+ice")
plot(pdp_gam) 
## Warning: UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated
## random numbers without specifying argument 'future.seed'. There is a risk that
## those random numbers are not statistically sound and the overall results might
## be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper,
## parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To
## disable this check, use 'future.seed=NULL', or set option 'future.rng.onMisuse'
## to "ignore".

int_gam <- Interaction$new(pred_gam)
plot(int_gam) + xlim(c(0,1))
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

Is this what we expect?

Note that the additivity of the model is on the log-odds scale. Change the prediction type to “response” and check the interaction plot again