Sampling and Sampling Distributions Part II

class: center, middle, inverse, title-slide

.title[
# Sampling and Sampling Distributions Part II
]
.subtitle[
## EDP 613
]
.author[
### Week 7
]

---

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(/Users/skynet/Documents/WVU/Teaching/GitHub.nosync/edp613/static/img/course_hex_alpha.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'https://edp613.asocialdatascientist.com'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

# A Note About The Slides

Currently the equations do not show up properly in Firefox. Other browsers such as Chrome and Safari do work.

---

# A Note About Probability

We're going to touch on this now but come back to more of it later in the term when talking about Bayesian Statistics.

---

# For Now

- An **event** `$E$` is a set of outcomes of an experiment.

- The **probability** `$P$` of an event describes how likely it will occur.

-  A **sample space** contains all possible outcomes.

-  A **probability distribution** gives a probability for each value in a sample space.

---

# Example

> What is the sample space and probability distribution created by tossing a fair quarter?

>   >- Sample space: {*Heads*, *Tails*}

>   >- Probability distribution: `$\left\{\dfrac{1}{2}, \dfrac{1}{2}\right\}$`

---

## Notions

>- The probability of an event is ALWAYS between 0 and 1.

>- Assuming all outcomes are likely, the probability `$P$` of an event `$E$` can be found
`$$P(E) = \frac{\mathrm{Number\,\,of\,\,times\,\,an\,\,event\,\,will\,\,happen}}{\mathrm{Total\,\,number\,\,of\,\,events}}$$`
---

## Example

>- Assume that a standard fair six sided die is rolled. Find the

>>- (a) sample space and then

>>- (b) the probability that someone will roll a 2

>>- (a) The sample space of event `$E=$` six sided dice is rolled is `$P(E)=\{1,2,3,4,5,6\}$`

>   >- (b) The probability that someone will roll a 2 is `$P(2)$` which can be found by `$$P(2)=\dfrac{1}{6}$$`

---

## On Your Own (More of a Challenge!)

>- Assume that a standard fair six sided die is rolled. Find the (a) the probability that someone will roll a 7 and (b) the probability that someone will roll less than a 3

>>- (a) The probability that someone will roll a 7 is `$P(7)$` which can be found by `$$P(7)=\dfrac{0}{6}$$` since the sample space is `$P(E)=\{1,2,3,4,5,6\}$`

---

>>- (b) The probability that someone will roll less than a 3 is `$P(<3)$` which can be found by 
`\begin{aligned}
P(<3)&=P(1)+P(2)\\\\
&=\dfrac{1}{6} + \dfrac{1}{6}\\\\
&= \dfrac{2}{6} \\\\
&= \dfrac{1}{3}
\end{aligned}`

---

### Rule: *Always Reduce Fractions*

- But why?

- `$\dfrac{2}{6} = \dfrac{1}{3}$` but what do you lose by reducing?

- The sample size information which seems sort of important!

---

### New Rule: *Don't Reduce Fractions Unless it Makes Sense!*

---

## Sampling Distributions

- If several samples are drawn from a population, they are likely to have different values for for the mean `$\overline{Y}$`

- The probability distribution of those means (aka all of the `$\overline{Y}\mathrm{s}$`) is called the **sampling distribution**

---

## Sampling Distributions: Words and Notation - The Mean

The mean is calculated the ***exact same way*** as always but

- is called the *mean of the sampling distribution*

- has special variables:

- represented by `$\mu_{\overline{Y}}$`

- sample size is specifically for probabilities and represented by `$M$`
        
--

- given by the formula:
`$$\mu_{\overline{Y}}=\dfrac{\overline{Y}}{M}$$`
  
---

## Sampling Distributions: Words and Notation - The Standard Deviation

- The standard deviation is calculated the ***exact same way*** as always but

- is called the *standard error of the mean*

- has special variables:

- represented by `$\sigma_{\overline{Y}}$`

- sample size is specifically for probabilities and represented by `$N$`

- given by the formula: `$$\sigma_{\overline{Y}}=\dfrac{\sigma}{\sqrt{N}}$$`
  
---

## Central Limit Theorem (CLT)

>- *Officially*: If `$\overline{Y}$` is the mean of a large SRS ( `$N>30$` ) from a population with mean `$\mu$` and standard deviation `$\sigma$`, as `$M$` increases, the distribution becomes normal

>- *Better*: As you take more samples, especially big ones, your graph of the sample means will look more like a normal distribution

>- Implications

>>- If you add up the means from all of your samples and find the average, that number will be your *actual population mean*.<br><br>
    
--

>>- If you add up the standard deviations from all of your samples and find the average, that number will be your *actual population standard deviation*.<br><br>
    
--

>>- Helps you predict characteristics of a population

---

## Procedure for Calculating the CLT

1. Be sure `$N>30$`

2. Find `$\mu_{\overline{Y}}$` and  `$\sigma_{\overline{Y}}$`

3. Sketch a normal curve and shade in the area to be found

4. Find the area using The Standard Normal Table (Appendix B)

---

## Example

According to the Nielsen Company, the mean number of TV sets in a U.S. household in 2008 was 2.83. Assume the standard deviation is 1.2. A sample of 85 households is drawn. What is the probability that the sample mean number of TV sets is between 2.5 and 3?

1. `$85>30$` so this is probably normal

2. We have `$$\mu_{\overline{Y}}=2.83$$` with

`\begin{aligned}
\sigma_{\overline{Y}}&=\dfrac{1.2}{\sqrt{85}}\\\\
&\approx 0.130158
\end{aligned}`

---

</li>
</ol>

<ol start="4">
<li>We have `z`-scores<br>

.pull-left[
`\begin{aligned}
z&=\dfrac{3-2.83}{0.130158}\\\\
&\approx 1.31
\end{aligned}`
]

.pull-right[
`\begin{aligned}
z&=\dfrac{2.5-2.83}{0.130158}\\\\
&\approx -2.54
\end{aligned}`
]

</li>
</ol>

---

- The Standard Normal Table tells us that this is `$0.8994$`

- So there was about a 90% chance that a random household had between 2.5 and 3 TVs in 2008.

---

## Example

It is estimated that the mean number of TV sets in a U.S. household in 2020 is 2.00. Assume the standard deviation is 0.8. A sample of 180 households is drawn. What is the probability that the sample mean number of TV sets is still between 2.5 and 3?

1. `$180>30$` so this is probably normal

2. We have `$$\mu_{\overline{Y}}=2.00$$` with

`\begin{aligned}
\sigma_{\overline{Y}}&=\dfrac{0.8}{\sqrt{180}}\\\\
&\approx 0.059628
\end{aligned}`

---

</li>
<br>
<li>We have `z`-scores<br>

.pull-left[
`\begin{aligned}
z&=\dfrac{3-2.00}{0.059628}\\\\
&\approx 16.77
\end{aligned}`
]

.pull-right[
`\begin{aligned}
z&=\dfrac{2.5-2.00}{0.059628}\\\\
&\approx 8.38
\end{aligned}`
]
</li>
</ol>

---

- The Standard Normal Table tells us that this is essentially `$0$`

- So there is nearly a 0% chance that a random household has between 2.5 and 3 TVs in 2020.

---

## That's it. Let's take a break before working in R.