Statistical Thinking for Machine Learning: Lecture 4

# Statistical Thinking for Machine Learning: Lecture 4
### Reda Mastouri <br><span style="font-size: 50%;">UChicago MastersTrack: Coursera<br>Thank you to Gregory Bernstein for parts of these slides</span>

---

# What have we covered?

- .left[Distributions, CDF, PDF, Method of Moments]

- .left[ANOVA, Simple Regression]

- .left[Hypothesis testing, Multiple Regression]

---

# Agenda

- .left[Logistic Regression]
  - .left[Why use Logistic Regression?]
  - .left[Forming the Logistic Regression]
  - .left[The Link Function]
  - .left[Interpreting coefficients]
  - .left[Determing the effect size]

---

# Why use Logistic Regression?

---

# Why use Logistic Regression?

.left-column[
<table>
 <thead>
  <tr>
   <th style="text-align:right;"> id </th>
   <th style="text-align:right;"> hours </th>
   <th style="text-align:right;"> pass </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 7 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 8 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 9 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 11 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 13 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 14 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 15 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>
]

.right-column[
Why don't we use Linear Regression (i.e., Linear Probability Model [LPM])?<br>
`$\text{Pass test}=\beta_0+\beta_1\text{hours studied}$`<br>
- Model output is unbounded: `$(-\infty, \infty)$`
- Constant change predicted probability for 1 unit increase in `$\text{X}$`
- Residual variance is not constant for different values of `$\text{X}$`
- Residuals can be large (outliers)
]

---

# Large outliers, Non-constant variance

```
## 
## Call:
## lm(formula = pass ~ hours, data = study_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79696 -0.31379 -0.02389  0.29967  0.78284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.02389    0.15988   0.149  0.88254   
## hours        0.09663    0.02866   3.372  0.00263 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4261 on 23 degrees of freedom
## Multiple R-squared:  0.3308,	Adjusted R-squared:  0.3017 
## F-statistic: 11.37 on 1 and 23 DF,  p-value: 0.002633
```

LPM: If we study 500 hours: 4834.0933768% probability of passing.
]

---

# Logit more interpretable

```
## 
## Call:
## glm(formula = pass ~ hours, family = "binomial", data = study_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8852  -0.7913  -0.3866   0.7670   1.8532  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -2.5563     1.1255  -2.271   0.0231 *
## hours         0.5185     0.2099   2.470   0.0135 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.617  on 24  degrees of freedom
## Residual deviance: 25.161  on 23  degrees of freedom
## AIC: 29.161
## 
## Number of Fisher Scoring iterations: 4
```

Logit: If we study 500 hours: 100% probability of passing.
]

---

# Why use Logistic Regression?

<div class="my-footer"><span>https://www.statlearning.com</span></div>
]

---

# Reasons to use Logistic Regression

.pull-left[
- Model is bounded between `$[0, 1]$`
- Each incremental unit increase does not necessarily increase probability by the same weight

## Logistic formula:
- Logistic is a linear classifier
- We need a smooth function that is not discontinuous between `$[0, 1]$`
- We will use the standard logistic sigmoid function: `$y=\frac{1}{1 + e^{-x}}$`
]

---

# Forming the Logistic Regression

---

# Forming the Logistic Regression

- In a linear model, both `$X$` and `$Y$` have a range of `$(-\infty, \infty)$`
- If we have a categorical dependent variable, Y now has a range of `$[0, 1]$` while X still have a range of `$(-\infty, \infty)$`
- We must convert Y so that is has the same range as X to create a linear predictor

$$
p(Y) \in (-\infty, \infty)
$$
Convert probability to odds:

$$
\frac{p(Y)}{1-p(Y)}, \in [0, \infty)
$$
Convert odds to log odds:

$$
\text{log odds}=\text{log}(\frac{p(Y)}{1-p(Y)}), \in (-\infty, \infty)
$$

---

# Forming the Logistic Regression

Linear model after conversion: `$\text{log}(\frac{p(Y)}{1-p(Y)})=\beta X_i$`

Calculating probability:

`\begin{align}
\frac{p(Y)}{1-p(Y)} & =e^{\beta X_i} \\
 p(Y) & = (1-p(Y))e^{\beta X_i} \\
 p(Y) & = e^{\beta X_i}-p(Y)e^{\beta X_i} \\
 p(Y) + p(Y)e^{\beta X_i} & = e^{\beta X_i} \\
 p(Y)  (1+e^{\beta X_i}) & = e^{\beta X_i} \\
 p(Y) & = \frac{e^{\beta X_i}}{(1+e^{\beta X_i})}
\end{align}`

---

class:text-slide

# Link Function

- Here is a our logistic regression model: `$p(Y=1|X_i)=\frac{1}{1+e^{-\beta X_i}}$`
- Let's compare to linear regression: `$Y=\beta X_i$`

- For logistic regression, our desired output y is the probability of success
- There is always a link function between predictors and output. For linear regression, it’s just the identity function. For logistic regression, we use a lopit link function
- Linear regression is linear between X and Y. Logistic regression is linear between log odds and X. 
- We use link function to transform log odds into interpretability.

---

# Estimating Coefficients

- We will not use sum of squares to evaluate accuracy of this model, since this function is subject to multiple local minimums
- Instead, we’ll use logistic loss function: `$$y\text{log}(p)+(1-y)\text{log}(1-p)$$`
- Betas will be estimated using maximum likelihood estimation
- **Maximum likelihood:** Given a sample, what is the parameter with the highest probability of observing or what is the parameter with the maximum likelihood of being correct?

---

# Interpretation of Coefficients and Output

- A 1 unit increase in X causes the log odds to increase by `$\beta X_i$`
- If log odds increase, odds increase, and probability increase
- If we just want to quickly classify observations, we can say any positive output from the model is a success and any negative output from the model is a failure
- Why? `$$\text{log}\frac{0.5}{1-0.5}=0$$`

---

# Partial Effects

---

# An alternative link — Probit

- Inverse CDF of normal distribution of probability = `$\beta_0 + \beta_1X_1$`

- Link function is normal CDF