Understanding data and different data types
Understanding data and different data types
Distributions, PDF, CDF
Understanding data and different data types
Distributions, PDF, CDF
Sampling and descriptive statistics
Understanding data and different data types
Distributions, PDF, CDF
Sampling and descriptive statistics
Hypothesis testing to evaluate a single parameter
Understanding data and different data types
Distributions, PDF, CDF
Sampling and descriptive statistics
Hypothesis testing to evaluate a single parameter
Bivariate linear model
Understanding data and different data types
Distributions, PDF, CDF
Sampling and descriptive statistics
Hypothesis testing to evaluate a single parameter
Bivariate linear model
Correlation vs. Causation
Multivariate linear regression
Hypothesis testing
Multivariate linear regression
Hypothesis testing
Variable transformations — interpreting results
\(y=\beta_1+\beta_1x_1+\beta_2x_2+\epsilon\)
How do we interpret \(\beta_1\), \(\beta_2\)?
$$\hat{\beta_1}=\frac{\hat{\text{Cov}}(X, Y)}{\hat{Var}(X)}$$
$$ \begin{split} \hat{\text{Cov}}(\text{educ}, \text{wages}) & = \hat{\text{Cov}}(\text{educ}, \beta_1\text{educ}+\beta_2\text{exp}+\epsilon) \\ & = \beta_1\hat{\text{Var}}(\text{educ}) + \beta_2\hat{\text{Cov}}(\text{educ}, \text{exp}) + \hat{\text{Cov}}(\text{educ}, \epsilon) \\ & = \beta_1\hat{\text{Var}}(\text{educ}) + \beta_2\hat{\text{Cov}}(\text{educ}, \text{exp}) \end{split} $$
$$ \text{Omitted variable bias: } \hat{\beta_1}=\beta_1+\beta_2\frac{\hat{\text{Cov}(\text{educ}, \text{exp})}}{\hat{\text{Var}}(\text{educ})} $$
Population model (true relationship): \(y=\beta_0+\beta_1x_1+\beta_2x_2+\nu\)
Our model: \(y=\hat{\beta}_0+\hat{\beta}_1x_1+\upsilon\)
Auxiliary model: \(x_2=\delta_0+\delta_1x_1+\epsilon\)
$$\text{E}(\hat{\beta_1})=\beta_1+\beta_2\delta$$
Equivalently, the bias is: \(\text{E}(\hat{\beta_1})-\beta_1=\beta_2\delta\)
A and B are positively correlated |
A and B are negatively correlated |
|
---|---|---|
B is positively correlated with y |
Positive bias | Negative bias |
B is negatively correlated with y |
Negative bias | Positive bias |
variable | description |
---|---|
CRIM | per capita crime rate by town |
ZN | proportion of residential land zoned for lots over 25,000 sq.ft. |
INDUS | proportion of non-retail business acres per town. |
CHAS | Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
NO | nitric oxides concentration (parts per 10 million) |
RM | average number of rooms per dwelling |
AGE | proportion of owner-occupied units built prior to 1940 |
DIS | weighted distances to five Boston employment centres |
RAD | index of accessibility to radial highways |
TAX | full value property tax rate per $10,000 |
PTRATIO | pupil teacher ratio by town |
B | 1000(Bk 0.63)^2 where Bk is the proportion of blacks by town |
LSTAT | % lower status of the population |
MEDV | Median value of owner-occupied homes in $1000's |
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | b | lstat | medv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | NA | 36.2 |
0.02985 | 0 | 2.18 | 0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 | 28.7 |
Situation | Action |
---|---|
z is correlated with both x and y | Probably best to include z but be wary of multicollinearity |
z is correlated with x but not y | Do not include z — no benefit |
z is correlated with y but not x | Include z — new explanatory power |
z is correlated with neither x nor y | Should not be much effect when including, but could affect hypothesis testing — no real benefit |
Affine transformations are transformations that do not affect the fit of the model. The most common example is scaling transformations
Example:
This is why scaling variables is not necessary for linear regression, but knowing the scale of your variables is important for interpretation
$$\hat{\text{wage}}=3.12+.447\text{exp}-0.007\text{exp}^2$$
Recall that the natural logarithm is the inverse of the exponential function, so \(\text{ln}(e^x)=x\), and:
\(\text{ln}(1)=0\)
\(\text{ln}(0)=-\infty\)
\(\text{ln}(ax)=\text{ln}(a)+\text{ln}(x)\)
\(\text{ln}(x^a)=a\text{ln}(x)\)
\(\text{ln}(\frac{1}{x})=-\text{ln}(x)\)
\(\text{ln}(\frac{x}{a})=\text{ln}(x) - \text{ln}(a)\)
\(\frac{d\text{ln}(x)}{dx}=\frac{1}{x}\)
Level-log: \(y=5+0.2\text{ln}(x)\)
Log-level: \(\text{ln}(y)=5+0.2(x)\)
Log-log: \(\text{ln}(y)=5+0.2\text{ln}(x)\)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |