6 - Inferential Statistics

b ### Computations
Consider a population with $μ = 50$ , and $σ = 12$ that is randomly sampled with $n = 36$

Determine the following probabilities:
We're using SDSM, so we calculate
$μ_{\bar{x}} = μ = 50$ and $σ_{\hat{x}} = \frac{σ}{\sqrt{n}} = \frac{12}{\sqrt{36}} = 2$
- $P (\bar{x} - 1 < μ < \bar{x} + 1) = norm.dist (51, 50, 2, 1) - norm.dist (49, 50, 2, 1) = 0.3829$
- $P (\bar{x} - 3 < μ < \bar{x} + 3) = norm.dist (53, 50, 2, 1) - norm.dist (47, 50, 2, 1) = 0.8664$
- $P (\bar{x} - 5 < μ < \bar{x} + 5) = norm.dist (55, 50, 2, 1) - norm.dist (45, 50, 2, 1) = 0.9876$
  For each probability, find the appropriate $d > 0$
- $P (\bar{x} - d < μ < \bar{x} + d) = 0.68259$ -> $d = μ - norm.inv (\frac{1 - 0.68259}{2}, μ, σ) = 2$
- $P (\bar{x} - d < μ < \bar{x} + d) = 0.9545$ -> $d = μ - norm.inv (\frac{1 - 0.9545}{2}, μ, σ) = 4$
- $P (\bar{x} - d < μ < \bar{x} + d) = 0.99$ -> $d = μ - norm.inv (\frac{1 - 0.99}{2}, μ, σ) = 5.152$
  We could use a trick - probability $0.68259$ if one $σ$ from the mean, so we could use $norm.s.inv (μ - σ)$ , then 0.9545 is two $σ$ from the mean, etc
Other expressions equivalent to $\bar{x} - d < μ < \bar{x} + d$ :
- $\bar{x} \in (μ - d, μ + d)$
- $| \bar{x} - μ | < d$
- $d = # of stdev \cdot σ_{\bar{x}}$

Confidence Intervals

We do not know the value of the population parameter
We do not expect the value of our sample statistic to equal the population parameter
We want to use the sample statistic value to construct an interval that is likely to contain the population parameter
This likelyhood, called the confidence level CL can be understood as the success rate of our construction process
For means and proportions, the interval is centered at the value of the computed statistic from the collected random sample

$(\bar{x} - M E, \bar{x} + M E)$ for means
$(\hat{p} - M E, \hat{p} + M E)$ for proportions
We set the margin or error ME in order to achieve the CL

To return $(\bar{x} - d, \bar{x} + d)$ , we have to set the confidence interval.
Select confidence level (understood as probability that given the distance, the randomly selected sample will produce a confidence interval containing the parameter)

We introduce $α$ as the failure rate (probability of value outside the CL)

We don't want to choose $C L = 100 %$ , because that will mean that $C I = (- \infty, + \infty)$

z_{\frac{α}{2}} = norm.s.inv (C L + \frac{α}{2}) = 1 - norm.s.inv (\frac{α}{2})

Determining the Margin of Error

The SDSM and SDSP are both approximately normal under certain conditions. These conditions will need to be met for our construction method

In order to calculate the area between two points in a normal distribution, knowing how many standard deviations each point is away from the mean is sufficient for calculating the area

This fact allows us to determine how many standard deviations away from the mean we need to go in order to achieve a given CL
The number of standard deviations away from the mean we need to go in order to achieve the CL is called the critical value

Sampling Distributions

SDSM - iid random sampling with size n
$μ_{\bar{x}} = μ σ_{\bar{x}} = \frac{σ}{\sqrt{n}}$

SDSP - iid random sampling size n - shape always scaled binomial
$μ_{\hat{p}} = p σ_{\hat{p}} = \sqrt{\frac{p q}{n}}$

Definitions and Interpretations

Condifence Level (CL)

Probability of selecting a random sample that wil produce a confidence interval containing the parameter
Success rate of construction process
Area between negative and positive critical values

\begin{aligned} For Means : & (\bar{x} - M E, \bar{x} + M E), M E = \frac{σ}{\sqrt{n}} \cdot z_{\frac{α}{2}} \\ For Proportions : & (\hat{p} - M E, \hat{p} + M E), M E = \sqrt{\frac{p q}{n}} \cdot z_{\frac{α}{2}} \end{aligned}

Set confidence level - success rate of construction
Use CL to determine ME
Find # of stdev necessary to attain confidence (critical value)
Then, $M E = z_{\frac{α}{2}} \cdot #_{of σ}$

The problem with proportions, is that goal of inferential statistics is to find the parameter $p$ , and the formulas are using that value, which makes no sense. So, to get the best out of it, when we obtain our sample is we can use $\hat{p}$ and $\hat{q}$

M E = \sqrt{\frac{\hat{p} \hat{q}}{n}} \cdot z_{\frac{α}{2}}

This is why it's important to meet the normal distribution conditions, so normal distribution can be used instead of the binomial one.

Alpha Level $α = 1 - C L$

Probability of selecting a random sample that will not produce a confidence interval containing the parameter
Failure rate of construction process
Total area in the tails (evenly split between left and right tails)

Critical Values from Standard Normal Distribution $z_{\frac{α}{2}}$

Number of standard deviations away from the mean we must go, in both directions, in order to achieve the confidence level

Confidence Intervals for Proportions

Requirement - SDSP approximately normal
Random sample of size $n$ such that $\hat{n} p > 5$ and $\hat{n} q > 5$

Exercises

A QC manager randomly selects 144 lights sensors each day of production. Company policy mandates manufacturing overhauls if the company is confident that the % of defective light sensors produces in a day is larger than 5% at 92% confidence level. Twelve of the randomly sampled sensors are found to be defective. Construct the confidence interval and make a recommendation to the manager
$C L = 92 %$ , $α = 8 %$
the distribution satistfies normal distribution
$n = 144$ , $x = 12$
$n \hat{p} = 12 > 5$ (we consider defective as success)
$n \hat{q} = 132 > 5$
So we can construct the CI
$(\hat{p} - z_{\frac{α}{2}} \cdot \sqrt{\frac{\hat{p} \hat{q}}{n}}, \hat{p} + z_{\frac{α}{2}} \cdot \sqrt{\frac{\hat{p} \hat{q}}{n}})$
$\hat{p} = \frac{12}{144} = \frac{1}{12}$ , $\hat{q} = \frac{11}{12}$
$z_{\frac{α}{2}} = norm.s.inv (0.96) = 1.75068$ we can use norm.s.inv instead of z-score, because $σ = 1$ , so it gives the same value
Lower bound - $L B = \hat{p} - z_{\frac{α}{2}} \cdot σ_{\hat{p}} = \frac{1}{12} - 1.75068 \cdot \sqrt{\frac{\frac{1}{12} \cdot \frac{11}{12}}{144}} = 0.04301$
Upper bound - $U B = \hat{p} + z_{\frac{α}{2}} \cdot σ_{\hat{p}} = \frac{1}{12} + 1.75068 \cdot \sqrt{\frac{\frac{1}{12} \cdot \frac{11}{12}}{144}} = 0.12366$
so the CI - $(4.301 %, 12.366 %)$
At the 92% confidence level, the population proportion of defective sensors is between 4.301% and 12.366%
The QC manager would like to update the daily sample size so that the margin of error is less than 2%. What would that sample size be?
$z_{\frac{α}{2}} \cdot \sqrt{\frac{\hat{p} \hat{q}}{n}} < 0.02$
${(\frac{z_{\frac{α}{2}} \cdot \sqrt{\hat{p} \hat{q}}}{0.02})}^{2} < n$
We see that n takes the highest value when $\hat{p} \hat{q}$ is the largest.
$\hat{p} \hat{q} = \hat{p} (1 - \hat{p}) = \hat{p} - {\hat{p}}^{2} \to max. \hat{p} = \frac{1}{2}$

$\hat{p} = \frac{1}{2}$ is the most conservative answer for the question 'when will the margin of error be the smallest?' It will be the solution to all of those questions. Always.

You are running for president for SGA at FHSU. Campaign team randomly selects 100 students to see who they are voting for, and only 37 of them say are voting for you. Construct a 99% confidence interval. A simple majority is neede to win, should you be concerned?
$C L = 99 %$ , $α = 1 %$ , $n = 100$ , $x = 37$ , $\hat{p} = \frac{37}{100} = 0.37$ , $\hat{q} = \frac{63}{100} = 0.63$
$n \hat{p} = 100 \cdot 0.37 = 37 > 5$ and $n \hat{q} = 100 \cdot 0.63 = 63 > 5$ so it's a random natural
$z_{\frac{α}{2}} = norm.s.inv (C L + \frac{1}{2} z_{\frac{α}{2}}) = norm.s.inv (0.99 + 0.005 = 0.995) = 2.57583$
$σ_{\hat{p}} = 0.04828$
$L B = 0.24574$ and $U B = 0.49436$

Three types of solutions

$(0.24, 0.49)$ - most likely going to lose
$(0.40, 0.60)$ - loosely 50/50
$(0.51, 0.72)$ - most likely going to win

A journal published $(0.11, 0.14) as $95 %$ CI for the proportion of people who regularly attend the movie theater. What can you deduce about the sample data?
$M E = 0.14 - 0.11 = 0.03$
$\hat{p} = 0.125$ , $\hat{q} = 0.875$
$M E = z_{\frac{α}{2}} \cdot \sqrt{\frac{\hat{p} \cdot \hat{q}}{n}} = norm.s.inv(0.975) \cdot \sqrt{\frac{0.125 \cdot 0.875}{n}} \to n = 1868$

Confidence Intervals for Means

Important

$M E = z_{\frac{α}{2}} \cdot \frac{σ}{\sqrt{n}}$ $C I = (\bar{x} - M E, \bar{x} + M E)$

distribution is normal, is $n > 30$ , or population is normal and random

Sampling Distribution of Sample Variances

Looking at the distribution of sample variances, they are skewed to the right

P (s^{2} < μ_{\hat{s}}) > \frac{1}{2}

and knowing that the sample mean is equal to the population variance

P (s^{2} < σ^{2}) > \frac{1}{2}

$C V = \frac{s}{\sqrt{n}}$ , but this is too small, so we make it bigger
We create a new distribution, called t-distribution, which gives us the usable critical value. =t.dist(x, d.f=n-1, 1) and =t.inv() in Excel

t.dist (x, d . f = n - 1, 1) = P (t < x) area to the left

t.inv (a r e a, d . f = n - 1) = x x that has area to the left

Mean of t.dist=0 is symmetric about 0 and bell-shaped

The sides of the t-distribution are fatter compared to the z-distribution

t_{\frac{α}{2}} = t.inv (C L + \frac{α}{2}, n - 1)

t-distribution should be used when working on means with

σ

unknown

Exercises

Many semi-trucks can haul up to 48000 pounds of cargo legally. A company says that their trailer can haul 35 cows. The consumer questions whether the trailer can actually do that. Assume the weights of cows is normally distributed. The consumer randomly samples 10 cows and finds average weight to be 1638 pounds, with $σ = 49$ lbs. Construct a 95% CI for the population mean weight of cows.

Not Knowing t-distribution

$σ = 49 lbs$
$\bar{x} = 1368 lbs$
$n = 10$
$z_{\frac{α}{2}} = norm.s.inv (0.975) = 1.95997$
$L B = \bar{x} - z_{\frac{α}{2}} \cdot \frac{σ}{\sqrt{n}} = 1337.63$
$U B = \bar{x} + z_{\frac{α}{2}} \cdot \frac{σ}{\sqrt{n}} = 1398.37$
so in the worst scenario, 35 cows will weight $35 \cdot 1398.37 lbs = 48942.95 lbs$ which is little over the advertised weight. For the trailer to haul 35 cows, the average weight of a cow would have to be 1371lbs.

Determine how many fully grown cows would need to be sampled so that the confidence interval would contain at most 1 whole number

\begin{aligned} 2 M E = Length & < 1 \\ M E & < 0.5 \\ 1.95997 \cdot \frac{49}{\sqrt{n}} & < 0.5 \\ n & = 36894 \end{aligned}

Knowing t-distribution

$\bar{x} = 1368$
$s = 49$
$t_{\frac{α}{n}, n - 1} = t.inv (0.975, 9) = 2.26$
$L B = \bar{x} - t_{\frac{α}{2}, n - 1} \cdot \frac{s}{\sqrt{n}} = 1332.95$
$U B = \bar{x} + t_{\frac{α}{2}, n - 1} \cdot \frac{s}{\sqrt{n}} = 1403.05$
This interval is bigger, because we used the same stdev, but increased the critical value (due to usage of t-distribution)

Confidence Intervals

Determining the Margin of Error

Definitions and Interpretations

Condifence Level (CL)

Alpha Level α=1−CL

Critical Values from Standard Normal Distribution zα2

Confidence Intervals for Proportions

Exercises

Confidence Intervals for Means

Sampling Distribution of Sample Variances

Exercises

Not Knowing t-distribution

Knowing t-distribution

Alpha Level $α = 1 - C L$

Critical Values from Standard Normal Distribution $z_{\frac{α}{2}}$