class: left, middle, inverse, title-slide # Statistical Thinking for Machine Learning: Lecture 1 ### Reda Mastouri
UChicago MastersTrack: Coursera
Thank you to Gregory Bernstein for parts of these slides
--- class: text-slide, title-inv-7, center, animated, slideInDown count: FALSE # Statistics and machine learning - .left[Statistics and Probability Theory are the foundations for modeling and machine learning] - .left[Statistical Modeling will be more concerned with inference. Understanding an effect we hope to observe] - .left[Machine Learning will be more concerned with prediction. Being able to predict an unknown value based off a sample we observed. Having a representative sample and generalizable results will be crucial] - .left[Fit the data to desired output and *learn* from the data!] --- class: text-slide, title-inv-7, center count: FALSE # Agenda -- - .left[Comprehending data] - .left[Data in the grand scheme of things] - .left[Understanding your data frame: Rows and columns] - .left[Different types of data] -- - .left[Distributions] - .left[Data types and distributions] - .left[PDF and CDF] - .left[Parameters and Method of Moments] - .left[The Normal Distribution and the Central Limit Theorem] -- - .left[Sampling] - .left[How we can use sampling to achieve our goals] - .left[Descriptive statistics] --- class: text-slide, main-slide, center, middle, hide-count # Data structure and data types --- class: text-slide # Where does data fit in the big picture? - Defining a problem statement - Collecting and storing data - recording data, downloading data sets, web scraping or using API, data structures and lakes - Data - Sampling and Modeling - .left[Insights, inference, and visualization] --- class: text-slide # Data types .center[ |Data||Description| |:----------||:-------------| | Binary || a value of 0/1, true/false | | Categorical || a discrete value with limited possibilities | | Continuous || a numerical value that has an infinite range of possibilities | ] -- .center[ <img src="img/data-types.jpg" width="75%"/> <div class="my-footer"><span>https://www.statlearning.com</span></div> ] --- class: text-slide # Rows and columns .center[ <img src="img/tidy.png" width="100%"/> <div class="my-footer"><span>https://r4ds.had.co.nz</span></div> ] --- class: text-slide # Rows and columns: Diamond Data Set - Two use-cases: supervised and unsupervised learning <table> <thead> <tr> <th style="text-align:right;"> carat </th> <th style="text-align:left;"> cut </th> <th style="text-align:left;"> color </th> <th style="text-align:left;"> clarity </th> <th style="text-align:right;"> depth </th> <th style="text-align:right;"> table </th> <th style="text-align:right;"> price </th> <th style="text-align:right;"> x </th> <th style="text-align:right;"> y </th> <th style="text-align:right;"> z </th> <th style="text-align:right;"> id </th> <th style="text-align:left;"> purchased </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1.54 </td> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> J </td> <td style="text-align:left;"> VS1 </td> <td style="text-align:right;"> 62.2 </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> 8848 </td> <td style="text-align:right;"> 7.34 </td> <td style="text-align:right;"> 7.38 </td> <td style="text-align:right;"> 4.58 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:right;"> 0.91 </td> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> E </td> <td style="text-align:left;"> SI2 </td> <td style="text-align:right;"> 61.5 </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> 3968 </td> <td style="text-align:right;"> 6.20 </td> <td style="text-align:right;"> 6.23 </td> <td style="text-align:right;"> 3.82 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:right;"> 2.01 </td> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> H </td> <td style="text-align:left;"> SI2 </td> <td style="text-align:right;"> 63.1 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 13849 </td> <td style="text-align:right;"> 7.99 </td> <td style="text-align:right;"> 8.09 </td> <td style="text-align:right;"> 5.07 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:right;"> 1.01 </td> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> E </td> <td style="text-align:left;"> SI1 </td> <td style="text-align:right;"> 64.1 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 4480 </td> <td style="text-align:right;"> 6.26 </td> <td style="text-align:right;"> 6.19 </td> <td style="text-align:right;"> 3.99 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:right;"> 1.52 </td> <td style="text-align:left;"> Good </td> <td style="text-align:left;"> F </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 57.8 </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> 12283 </td> <td style="text-align:right;"> 7.58 </td> <td style="text-align:right;"> 7.50 </td> <td style="text-align:right;"> 4.36 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:right;"> 1.61 </td> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> D </td> <td style="text-align:left;"> SI1 </td> <td style="text-align:right;"> 61.4 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 13582 </td> <td style="text-align:right;"> 7.56 </td> <td style="text-align:right;"> 7.51 </td> <td style="text-align:right;"> 4.63 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:right;"> 0.33 </td> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> G </td> <td style="text-align:left;"> VS1 </td> <td style="text-align:right;"> 61.5 </td> <td style="text-align:right;"> 56 </td> <td style="text-align:right;"> 699 </td> <td style="text-align:right;"> 4.45 </td> <td style="text-align:right;"> 4.48 </td> <td style="text-align:right;"> 2.74 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:right;"> 1.17 </td> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> F </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 61.8 </td> <td style="text-align:right;"> 55 </td> <td style="text-align:right;"> 8072 </td> <td style="text-align:right;"> 6.81 </td> <td style="text-align:right;"> 6.74 </td> <td style="text-align:right;"> 4.19 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> FALSE </td> </tr> <tr> <td style="text-align:right;"> 0.30 </td> <td style="text-align:left;"> Premium </td> <td style="text-align:left;"> H </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 62.6 </td> <td style="text-align:right;"> 58 </td> <td style="text-align:right;"> 608 </td> <td style="text-align:right;"> 4.28 </td> <td style="text-align:right;"> 4.22 </td> <td style="text-align:right;"> 2.66 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:left;"> TRUE </td> </tr> <tr> <td style="text-align:right;"> 0.70 </td> <td style="text-align:left;"> Ideal </td> <td style="text-align:left;"> G </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:right;"> 61.8 </td> <td style="text-align:right;"> 57 </td> <td style="text-align:right;"> 2929 </td> <td style="text-align:right;"> 5.68 </td> <td style="text-align:right;"> 5.71 </td> <td style="text-align:right;"> 3.52 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:left;"> TRUE </td> </tr> </tbody> </table> -- - **Independent variables** can be used for unsupervised modeling - **Dependent variables** and independent variables can be used together for supervised modeling --- class: text-slide # Rows and columns - Rows are linked — some rows belong to the same group - Are the group means the same? .center[ |city_type |population_mil |rainfall_inches| |:----------|-------------:|-------------:| | urban | 1.2 | 38 | | urban | .75 | 6 | | suburban | .5 | 14 | | suburban | .5 | 18 | | rural | .5 | 32 | | rural | .5 | 12| ] - Label and categorical independent variable that can be used in a model: `city_type` - potential dependent variables: `population_mil`, `rainfall_inches` --- class: text-slide # Categorical variables and dummy encoding .left-column[ <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> price </th> <th style="text-align:left;"> clarity </th> <th style="text-align:left;"> cut </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8848 </td> <td style="text-align:left;"> VS1 </td> <td style="text-align:left;"> Ideal </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3968 </td> <td style="text-align:left;"> SI2 </td> <td style="text-align:left;"> Ideal </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 13849 </td> <td style="text-align:left;"> SI2 </td> <td style="text-align:left;"> Good </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4480 </td> <td style="text-align:left;"> SI1 </td> <td style="text-align:left;"> Good </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 12283 </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:left;"> Good </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 13582 </td> <td style="text-align:left;"> SI1 </td> <td style="text-align:left;"> Premium </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 699 </td> <td style="text-align:left;"> VS1 </td> <td style="text-align:left;"> Ideal </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8072 </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:left;"> Ideal </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 608 </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:left;"> Premium </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 2929 </td> <td style="text-align:left;"> VS2 </td> <td style="text-align:left;"> Ideal </td> </tr> </tbody> </table> ] -- <table> <thead> <tr> <th style="text-align:right;"> id </th> <th style="text-align:right;"> price </th> <th style="text-align:right;"> clarity_1 </th> <th style="text-align:right;"> clarity_2 </th> <th style="text-align:right;"> clarity_3 </th> <th style="text-align:right;"> clarity_4 </th> <th style="text-align:right;"> clarity_5 </th> <th style="text-align:right;"> clarity_6 </th> <th style="text-align:right;"> clarity_7 </th> <th style="text-align:right;"> clarity_8 </th> <th style="text-align:right;"> cut_1 </th> <th style="text-align:right;"> cut_2 </th> <th style="text-align:right;"> cut_3 </th> <th style="text-align:right;"> cut_4 </th> <th style="text-align:right;"> cut_5 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8848 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3968 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 13849 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4480 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 12283 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 13582 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 699 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 8072 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 608 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 2929 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> --- class: text-slide, main-slide, center, middle, hide-count # Distributions --- class: text-slide # Distributions depend on the data -- .pull-left[ - A Random Variable can follow a certain distribution, depending on the data type - A normal distribution describes continuous, numeric data - Bernoulli and Binomial distributions describes random variables that take binary values - A uniform distribution can handle discrete or continuous data ] -- <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto 0 auto auto;" /> -- <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto 0 auto auto;" /> --- class: text-slide # Terminology of distributions - **Moments:** Values from our sample that can help us understand our distribution - **Parameters:** The values that define our distribution] - Normal: `\(\mu\)`, `\(\sigma\)` - Binomial: `\(n\)`, `\(p\)` - Uniform: `\(a\)`, `\(b\)` - **Probability Density (Mass) Function (PDF)/(PMF):** A description of how likely certain outcomes are in a distribution - **Cumulative Distribution Function (CDF):** A description of how much of a distribution is contained up to a certain point --- class: text-slide # Moments of random variables -- - Moments describe a distribution with a set of attributes -- - `\(\text{E}[X^n]\)`: general raw moment -- - `\(\text{E}[X]\)`: first moment, `\(\mu=\frac{\sum{x}}{n}\)` -- - `\(\text{E}[X^2]\)`: second raw moment (unrefined, so contains first moment information) -- - `\(\text{E}[X^2]\)` - `\(\text{E}[X]^2\)`: second central moment, `\(\sigma^2=\frac{\sum{(x-\mu})^2}{n}\)` -- - For continuous data, we can mathematically represent the second central moment: -- `$$\int_{\infty}^{-\infty}f(x)x^2dx - \text{E}[X]^2$$` where `\(f(x)=PDF\)` --- class: text-slide # Uniform PMF/PDF, CDF .pull-left[ <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="504" /> <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="504" /> ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-10-1.png" width="504" /> <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="504" /> ] --- class: text-slide,centered # Normal PDF, CDF <iframe src="https://goldbergdata.shinyapps.io/distributions/" width=1000px, height=500px></iframe> --- class: text-slide, main-slide, center, middle, hide-count # Method of Moments Calculation — Normal, Uniform --- class: text-slide # Simulated data: Normal .pull-left[ ```python import random, numpy, math import statistics as stats, matplotlib.pyplot as plt import seaborn as sns, pandas as pd random.seed(50390) # reproducibility sample_df = numpy.random.normal( loc=0, scale = 2, size = 1000 ) # Normal sample_df[0:3] ``` ``` ## array([-0.42501843, -0.93878938, -2.88702722]) ``` ```python stats.mean(sample_df) ``` ``` ## -0.051744188640231566 ``` ```python # Solving for sigma using sample variance math.sqrt(stats.variance(sample_df)) ``` ``` ## 1.9919425634418428 ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-14-1.png" width="504" /> ] --- class: text-slide # Mystery data: Uniform ```python my_nums = pd.read_csv("my_sample.csv") #sample is uniform my_nums = my_nums["x"].tolist() my_nums[0:5] ``` ``` ## [8.74912820779718, 9.16765952832066, 6.329065448371691, 4.0551836041268, 8.59842579229735] ``` ``` ## Data length: 10000 ``` ```python stats.mean(my_nums) #Can we use sample moments to find parameters ``` ``` ## 5.513906104243407 ``` ```python stats.variance(my_nums) ``` ``` ## 6.673169412967319 ``` --- class: text-slide # Calculating Method of Moments: Uniform .left-column[ `\(\bar{\text{X}}=5.51\)` `\(\text{s}^2=6.67\)`<br> `\(\text{~U}(a, b)\)` ] `\begin{align} \text{E}[\mu] & =\int_{\infty}^{-\infty}\frac{1}{b-a}xdx \\ & = \frac{X^2}{2(b-a)} \\ \frac{X^2}{2(b-a)} ]_a^b& = \frac{b^2}{2(b-a)}-\frac{a^2}{2(b-a)} \\ & = \frac{b^2-a^2}{2(b-a)} \\ \frac{b^2-a^2}{2(b-a)} & = \frac{{(b-a)}(b+a)}{2(b-a)} \\ \text{E}[\mu] & = \frac{a+b}{2} \\ \end{align}` --- class: text-slide # Calculating Method of Moments: Uniform -- .left-column[ `\(\bar{\text{X}}=5.51\)` `\(\text{s}^2=6.67\)` <br> `\(\text{~U}(a, b)\)` <br><br> `\(\text{1) } \text{E}[\mu]=5.51=\frac{a+b}{2}\)` `\(\text{2) } \text{V}=6.67=\frac{(b-a)^2}{12}\)` ] -- .right-column[ `\begin{align} \text{Solve eq. 1 for b: } a+b & =11.02 \\ b & =11.02 - a \\ \end{align}` `\begin{align} \text{Plug b into eq. 2: } \frac{(11.02-a-a)^2}{12} & =6.67 \\ 11.02-2a & =6.67\sqrt{12} \\ -2a& =6.67\sqrt{12}-11.02 \\ a& =\frac{6.67\sqrt{12}-11.02}{-2} \\ a& =1.04 \\ \end{align}` `\begin{align} \text{Complete eq. 1 for b: } b & =11.02 - 1.04 \\ b & =9.98 \\ \end{align}` ] --- class: text-slide # Checking Method of Moments with the data ```python print(f"a = {min(my_nums)}, max = {max(my_nums)}") ``` ``` ## a = 1.0011236150749, max = 9.999872184358539 ``` .center[ <img src="index_files/figure-html/unnamed-chunk-20-1.png" width="504" /> ] --- class: text-slide, main-slide, center, middle, hide-count # Sampling and descriptive statistics --- class: text-slide # Choosing a representative sample -- - Our goal when modeling is to use a sample to draw conclusions about a greater population -- - It is important that our sample is representative of our population; Ensuring we have a sufficient sample size is helpful to accomplish this goal -- - Sample Moments = Theoretical Moments! If the sample is not representative, this equation will mislead us -- - Methods of repeated sampling are also good to utilize, Central Limit Theorem --- class: text-slide # Central Limit Theorem ```python dice = [1, 2, 3, 4, 5, 6] #uniform ``` ```python # Central limit # Draws from Uniform Distribution becoming normal distribution rolls = 6 dice_means = [0] * rolls for i in range(0,rolls): this_roll = random.choices(dice,k=rolls) dice_means[i] = stats.mean(this_roll) ``` --- class: text-slide # Using descriptive statistics from a sample - Measure of central tendency: - Mean — arithmetic average - Median — middle quantile (based on data sorted) - Mode — most common value - Variance - Range, IQ range - Correlation — the foundation of linear regression