Machine learning

class: left, bottom, title-slide

.title[
# Machine learning
]
.subtitle[
## The big picture
]
.author[
### Joshua Loftus
]

---

class: center, middle, inverse

# What is "Machine Learning"?

## Or rather, *why* is it?

---
class: inverse

## Machine learning applications

Can you think of an example? (Write it down)

- Electronic health records to predict which patients will require more care
--

- Genome sequence data for tissue samples to detect different kinds of cancer
--

- Text scraped from social media to predict events of social unrest, or track spread of misinformation
--

- Tech platform user data to target relevant content, or detect policy/regulation violations
--

- Learning adaptive control of robot prosthesis

etc...

---
class: inverse

# Machine learning, proper

These *application* examples help motivate the value of ML

(Actually, much of the value comes from work specific to the application, like the creation/gathering/processing of the data, and the real world *actions* taken based on the output of ML)

We'll use "ML" to refer to the *theory and general methods*

(Skills like gathering and cleaning data are very useful--and we'll practice them a little--but they're not the main focus of this course)

---
class: inverse

## What is "artificial intelligence"?

(Don't tell anyone I said this, but) It's a collection of computational tools that people use to create mathematically structured data out of non-mathematically structured data

e.g. a (possibly randomized) function from `$\{ \text{some image file type} \} \to \mathbb R^d$` for some `$d$`.

.pull-left[
e.g. [word embedding](https://en.wikipedia.org/wiki/Word_embedding) for text data. ([Image credit](https://corpling.hypotheses.org/495))

*We'll usually assume our data is already mathematically structured*
]

.pull-right[
<img src="https://f-origin.hypotheses.org/wp-content/blogs.dir/4190/files/2018/04/3dplot-768x586.jpg" width="99%" />
]

---

# Abstraction and notation

Along came some data which someone formats as a collection of `$p$` distinct variables

$$
X = (X_1, X_2, \ldots, X_p) \in \mathbb R^p
$$
We assume **each observation is a point in a vector space** (which we also implicitly assumed is finite-dimensional, and that's OK by any practical standard)

### Question: is there a `$Y$` variable?

Think about your application example (the one you wrote down)

---

# Categories of ML tasks

**Supervised learning** (most of the term)

Often we focus on one variables, name it `$Y$`, and give it the special status of being an "outcome"/"response"

**Unsupervised learning** (a bit of this)

If there is no obvious choice of an outcome variable, we may just wish to "find structure" in the `$X$` variables. Clustering, dimension reduction

**Other** tasks (probably not these)

Ranking, anomaly detection, network data, embeddings, correspondence, recsys, multi-armed bandit, etc...

---

# Supervised ML sub-categories

If `$Y$` is numeric: **regression**

- Concentration levels of a protein (disease status/severity)
- Selling price of a house

If `$Y$` is categorical: **classification**

- Should this item be flagged for (human) review? yes/no
- Identify type of cancer: lymphoma, sarcoma, neuroblastoma, etc

Special cases

- `$Y$` is binary with rare cases, e.g. anomaly detection
- `$Y$` is a time to event, survival analysis
- Multi-class, hierarchical classes, etc

---

# Focus on regression

.pull-left[
- Simpler math (orthogonal projection, Euclidean geometry)

- Intuition pump for other cases

- Often underlies other cases 
]
.pull-right[
![](https://i.imgflip.com/4unpnl.jpg)
]

- e.g. binary classification by thresholding a numeric **score**, or ranking (ordinal outcome) / set selection (select items with `$\text{top-}k$` scores)

---

# How to predict `$Y$` from `$X$`?

- Would be sweet if `$\exists f$` such that the graph of the function `$y = f(x)$` fit the data perfectly

- **Problem**: what if `$(x_1, y_1) = (1, 0)$` and `$(x_2, y_2) = (1, 1)$`?

- **Problem**: even our most tested and verified physical laws won't fit data *[perfectly](https://en.wikipedia.org/wiki/Measurement_uncertainty)*

#### Solution: applied mathematics

For any function `$f$` we can always write `$\varepsilon \equiv y - f(x)$`. Look for an `$f$` which makes these "errors" "small" for the observed data

---

### Uncertainty opens the door for probability

- Assume a probability distribution (adequately) models the data/errors

Define a good function as one that minimizes

$$
\mathbb E[\varepsilon^2] = \mathbb E\\{[Y - f(X)]^2\\}
$$

- Assume the data/error is sampled independently

Motivates the **plug-in principle**: compute an estimate `$\hat f$` of the good function `$f$` by solving the corresponding problem on the dataset, i.e.

$$
\text{minimize} \sum_{i=1}^n \left[y_i - \hat f(x_i)\right]^2
$$

---
class: inverse

# Very useful assumptions!

## The *why* of machine learning

### "it works"

- Squared error `$\rightarrow$` simpler math

(we'll come back to this and consider other loss functions)

- i.i.d. sampling `$\rightarrow$` simpler estimation, justifies generalisation

(we'll come back to this too)

---
class: inverse, center

Minimizing expected squared-error also gives us...

# One of the most powerful ideas in all of statistics

`$$\mathbb E\{[Y - \hat f(X)]^2\} = \text{Var}(\hat f) + \text{Bias}(\hat f)^2 + \text{const.}$$`

## the bias-variance trade-off

Are the errors systematic (bias) or not (variance)?

---

# With model complexity:

Typically, more complex models have lower bias and higher variance

And typically, there is a "right amount" of complexity

- Too low? Little variance, but overwhelming bias
- Too high? Little bias, but overwhelming variance
- Just Right: [insert happy statistician meme]

---
class: inverse, bottom, center
background-image: url("../../../files/theme/LSE/LSE_stats_graduation.jpg")
background-size: contain

statisticians celebrate finding the right model complexity
---

# gapminder example

.pull-left[

<img src="01-2-foundations_files/figure-html/gapminder-lm-1.png" width="504" />
  
  ]
  .pull-right[

]

---

# candy ranking example

.pull-left[
<img src="01-2-foundations_files/figure-html/candy_lm-1.png" width="504" />
]
.pull-right[
<img src="01-2-foundations_files/figure-html/candy_lm_multiple-1.png" width="504" />

]

---

## Evaluation: mean squared error

`gapminder` models

```r
c(mean(residuals(gm_simple)^2),
mean(residuals(gm_complex)^2))
```

```
## [1] 54.47218 41.08507
```

`candy_rankings` models

```r
c(mean(residuals(candy_simple)^2),
mean(residuals(candy_complex)^2))
```

```
## [1] 188.4498 127.1098
```

### A victory for machine learning!

... or is it? What did you learn in the first seminar?