5 - Sample Distributions

Sample Distribution of Sample Means

Interest in some quantitative variable such that every member of the population is associated witha particular numerical value
The sampling distribution of sample means is the probability distribution for the random variable result from
Random experiment - iid sampling with a fixed sample size n
Numerical assignment - computing the sample mean

Central Limit Theorem

Definition

Given any infinite population with population mean μ and non-zero population variance σ2, as the sample size n increases, the sampling distribution of sample means approaches a normal distribution with mean μx¯=μ and variance σx¯2=σ2n

Also - σx¯=σn

Practically infinite populations: n<0.05N
The distributions produced by SRS and iid are good approximations for one another for practically infinite populations
For the most common populations, n>30 is sufficiently large for the sampling distribution of sample means to be approimated well by a normal distribution
If population is normal, all sampling distribution of sample means, regardless of the sample size, are normal as well by a normal distribution
If population is normal, all sampling distributions of sample means, regardless of the sample size, are normal as well

Key Takeaways

μx¯=μσx¯=σn


for n<0.05N -> SRS iid


if n>30, SDSM normal
if population is normal, all SDSM = normal

Exercises

A population is uniformly distributed taking on values from 10 to 35. The expected value of the population is 22.5, and the stdev is 2512

  1. Determine the probability that a random selected membre of the population has a value larger than 20
    P(x>20)=12515=35=0.6=60%
  2. Determine the probability that a random sample of size 32 taken from a population has an average value larger than 20
    Now we're in the world of sampling distributions of sample means.
    P(x¯>20)=?
    n=32>30, so CLT applied always put words 'so CLT applies'

The average value of a house in a particular city is $175000 with a standard deviation of $10000

  1. Determine the probability that a random selected house of this city has a value larger than $180000
    We cannot answer this question, because it asks us for a single home, and we don't know the parent distribution
  2. Determine the probability that a random sample of size 49 has an average value larger than $180000
    μx¯=175000
    σx¯=σn=100007=1428.57
    P(x>18000)=1norm.dist(180000,175000,100007,1)=0.000233
  3. Determine the probability that a random sample of size 49 taken from the population has an average within 2 standard deviations of the mean
    xtop=175000+2100007
    xbot=1750002100007
    P(μ2σxμ+2σ)=norm.s.dist(2,1)norm.s.dist(2,1)=0.9545
  4. Determine the probability that a random sample of size 64 taken from the population has an average within 2 standard deviations of the mean
    xtop=175000+2100008
    xbot=1750002100008
    P(μ2σxμ+2σ)=norm.s.dist(2,1)norm.s.dist(2,1)=0.9545

IQ are normally distributed with a mean of 100 and standard deviation of 10. Suppose a random sample of size 4 is taken from the population. Determine the following probabilities:

  1. P(|xμ|>3)
    Population distribution (which is normal), with μ=100 and σ=10
    P(|xμ|>3)=norm.dist(97, 100, 10, 1) + (1 - norm.dist(103, 100, 10, 1))=0.76
  2. P(|x¯μ|)>3
    Because we have x¯, we work on SDSM
    σx¯=σn=102=5
    μx¯=μ=100
    P(|x¯μ|>3)=norm.dist(97,100,5,1)+1norm.dist(103,100,5,1)=0.55
  3. P(|x¯115|<1)
    The σx¯ and μx¯ are the same as in 2.
    P(|x¯115|<1)=norm.dist(116,100,5,1)norm.dist(114,100,5,1)=0.0019
    Now, for the sample size of 16, 64, 100, and 150, we can see that

Sampling Distribution of Sample Proportions

Taking interest in some characteristic such that every member of the population either has or does not have the characteristic
The sampling distribution of sample proportions is the probability distribution for the random variable resulting from:

Constructing Sampling Distribution of Sample Proportions

Consider a population of size N and a characteristic with occurrence rate p, and a sample formed using iid sampling with a sample size n

The distributions produced by SRS and iid are good approximations for another for practically infinite populations

Construction

Determine all the possible sample proportions
for n=6, we can have {0,16,26,36,46,56,1}, so n+1 possible proportions
Show how the SDSP is related to a binomial variable
our numerator is a binomial variable with n=n, p=p, and q=1p
Determine the expected value and variance of the SDSP
We can use really useful formulas:
μ=np and σ2=npq
And to bring those formulas into SDSP world, we just have to multiply by 1n

xP(x)=np1nxP(x)=np1n, going into SDSP worldp^P(x)=pand(xnp)2P(x)=npq,going into SDSP world1n2(xnp)2P(x)=npq1n(xnpn)2P(x)=pqn(p^p)2P(x)=pqn

So the expected value μp^=p, and standard deviation σp^=pqn

Creating the probability table for possible sample proportions

p^ P(p^)
0 binom.dist(0, 6, p, 0)
16 binom.dist(1, 6, p, 0)
26 ...
36 ...
46 ...
56 ...
1 ...
SDSP Computations

If np>5 and nq>5, then use of the normal approximation for computations is permissible
A more conservative (cautious) criteria is np10 and nq10
If there conditions are not met, do not use normal approximation - use binomial computations

For exams, use the less conservative (>5) criteria

Sampling Distribution of Sample Proportions

Useful formulas:

μp^=pσp^=pqn

For the distribution to be normal - np>5 and nq>5

Examples
  1. A politician has 40% of electorate voting for him. Determine the probability that a random sample of size 20 will indicate that he will be a popular vote (>50%)
    n=20, p=40%
    checking if the distribution is normal - np=200.4=8>5, and nq=200.6=12>5, so normal distribution is appropriate
    μp^=p=0.4
    σp^=pqn=0.40.620=0.1095
    P(p^>0.5)=1norm.dist(0.5,0.4,0.1095,1)=0.18
    The =norm.dist is a approximation, whereas =binom.dist gives the correct value. But, for the distributions approximable to normal, we should use =norm.dist

  2. The MTHFR gene mutation has a prevalence rate of 40% among Caucasians.

    • How large of a sample would be necessary to get the standard deviation of the sampling distribution of sample proportions to be less than 0.01?
      pqn=σp^<0.1
      0.40.60.01<n
      2400<n
      We should take at least n=2401
    • Determine P(|p^p|>0.005) for this smallest sample
      1(norm.dist(μ+0.005,μ,σ,1)norm.dist(μ0.005,μ,σ,1))=0.617
    • Determine P(|p^p|<0.015) for this smallest sample
      norm.dist(μ+0.015,μ,σ,1)norm.dist(μ0.015,μ,σ,1)=0.866
    • Determine k such that P(|p^p|<k)=0.5 for this smallest sample
      We can use the fact that if the part we want is 50%, then the left border is a 25% norm.dist(0.25,μ,σp^)=0.393, and use this value to calculate k: 0.393=0.4kk=0.007
      Second way is to use z-distribution. It moves the mean to 0, and allows use to use z=norm.s.inv(0.25)=0.67=p^pσp^0.67σp^=k