class: left, middle, inverse, title-slide # Statistical Thinking for Machine Learning: Lecture 4 ### Reda Mastouri
UChicago MastersTrack: Coursera
Thank you to Gregory Bernstein for parts of these slides
--- class: text-slide, title-inv-7, center, animated, slideInDown count: FALSE # What have we covered? - .left[Distributions, CDF, PDF, Method of Moments] - .left[ANOVA, Simple Regression] - .left[Hypothesis testing, Multiple Regression] --- class: text-slide, title-inv-7, center count: FALSE # Agenda -- - .left[Logistic Regression] - .left[Why use Logistic Regression?] - .left[Forming the Logistic Regression] - .left[The Link Function] - .left[Interpreting coefficients] - .left[Determing the effect size] --- class: text-slide, main-slide, center, middle, hide-count # Why use Logistic Regression? --- class: text-slide # Why use Logistic Regression? .left-column[ <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> hours </th> <th style="text-align:right;"> pass </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> ] .right-column[ Why don't we use Linear Regression (i.e., Linear Probability Model [LPM])?<br> `\(\text{Pass test}=\beta_0+\beta_1\text{hours studied}\)`<br> - Model output is unbounded: `\((-\infty, \infty)\)` - Constant change predicted probability for 1 unit increase in `\(\text{X}\)` - Residual variance is not constant for different values of `\(\text{X}\)` - Residuals can be large (outliers) ] --- class: text-slide # Large outliers, Non-constant variance .pull-left[ ``` ## ## Call: ## lm(formula = pass ~ hours, data = study_df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.79696 -0.31379 -0.02389 0.29967 0.78284 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.02389 0.15988 0.149 0.88254 ## hours 0.09663 0.02866 3.372 0.00263 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4261 on 23 degrees of freedom ## Multiple R-squared: 0.3308, Adjusted R-squared: 0.3017 ## F-statistic: 11.37 on 1 and 23 DF, p-value: 0.002633 ``` LPM: If we study 500 hours: 4834.0933768% probability of passing. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- class: text-slide # Logit more interpretable .pull-left[ ``` ## ## Call: ## glm(formula = pass ~ hours, family = "binomial", data = study_df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.8852 -0.7913 -0.3866 0.7670 1.8532 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.5563 1.1255 -2.271 0.0231 * ## hours 0.5185 0.2099 2.470 0.0135 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 34.617 on 24 degrees of freedom ## Residual deviance: 25.161 on 23 degrees of freedom ## AIC: 29.161 ## ## Number of Fisher Scoring iterations: 4 ``` Logit: If we study 500 hours: 100% probability of passing. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="504" /> ] --- class: text-slide, center # Why use Logistic Regression? .center[ <img src="img/why_logistic.jpg" width="75%"/> <div class="my-footer"><span>https://www.statlearning.com</span></div> ] --- class: text-slide # Reasons to use Logistic Regression .pull-left[ - Model is bounded between `\([0, 1]\)` - Each incremental unit increase does not necessarily increase probability by the same weight ## Logistic formula: - Logistic is a linear classifier - We need a smooth function that is not discontinuous between `\([0, 1]\)` - We will use the standard logistic sigmoid function: `\(y=\frac{1}{1 + e^{-x}}\)` ] <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="504" /> --- class: text-slide, main-slide, center, middle, hide-count # Forming the Logistic Regression --- class: text-slide # Forming the Logistic Regression - In a linear model, both `\(X\)` and `\(Y\)` have a range of `\((-\infty, \infty)\)` - If we have a categorical dependent variable, Y now has a range of `\([0, 1]\)` while X still have a range of `\((-\infty, \infty)\)` - We must convert Y so that is has the same range as X to create a linear predictor $$ p(Y) \in (-\infty, \infty) $$ Convert probability to odds: $$ \frac{p(Y)}{1-p(Y)}, \in [0, \infty) $$ Convert odds to log odds: $$ \text{log odds}=\text{log}(\frac{p(Y)}{1-p(Y)}), \in (-\infty, \infty) $$ --- class: text-slide # Forming the Logistic Regression Linear model after conversion: `\(\text{log}(\frac{p(Y)}{1-p(Y)})=\beta X_i\)` Calculating probability: `\begin{align} \frac{p(Y)}{1-p(Y)} & =e^{\beta X_i} \\ p(Y) & = (1-p(Y))e^{\beta X_i} \\ p(Y) & = e^{\beta X_i}-p(Y)e^{\beta X_i} \\ p(Y) + p(Y)e^{\beta X_i} & = e^{\beta X_i} \\ p(Y) (1+e^{\beta X_i}) & = e^{\beta X_i} \\ p(Y) & = \frac{e^{\beta X_i}}{(1+e^{\beta X_i})} \end{align}` --- class:text-slide # Link Function - Here is a our logistic regression model: `\(p(Y=1|X_i)=\frac{1}{1+e^{-\beta X_i}}\)` - Let's compare to linear regression: `\(Y=\beta X_i\)` - For logistic regression, our desired output y is the probability of success - There is always a link function between predictors and output. For linear regression, it’s just the identity function. For logistic regression, we use a lopit link function - Linear regression is linear between X and Y. Logistic regression is linear between log odds and X. - We use link function to transform log odds into interpretability. --- class: text-slide # Estimating Coefficients - We will not use sum of squares to evaluate accuracy of this model, since this function is subject to multiple local minimums - Instead, we’ll use logistic loss function: `$$y\text{log}(p)+(1-y)\text{log}(1-p)$$` - Betas will be estimated using maximum likelihood estimation - **Maximum likelihood:** Given a sample, what is the parameter with the highest probability of observing or what is the parameter with the maximum likelihood of being correct? --- class: text-slide # Interpretation of Coefficients and Output - A 1 unit increase in X causes the log odds to increase by `\(\beta X_i\)` - If log odds increase, odds increase, and probability increase - If we just want to quickly classify observations, we can say any positive output from the model is a success and any negative output from the model is a failure - Why? `$$\text{log}\frac{0.5}{1-0.5}=0$$` --- class: text-slide, main-slide, center, middle, hide-count # Partial Effects --- class: text-slide # An alternative link — Probit - Inverse CDF of normal distribution of probability = `\(\beta_0 + \beta_1X_1\)` - Link function is normal CDF