2.8, 2.9 - Measuring grouped data

Measures of Median and Mean on Grouped Data

2.7 - Connecting measures together focused on calculating the descriptive statistics on raw data, where we see every single sample. Now, we'll focus on calculating the descriptive statistics on grouped data.

Mean and median of data in a Frequency Table

Student Score	Frequency
3	1
4	1
5	3
6	5
7	5
8	7
9	5
10	3

We could recreate the original data set ${3, 4, 5, 5, 5, \dots}$ but for huge sets it might create mistakes and take lots of time
Sample mean

\bar{x} = \frac{\sum x_{i}}{n} = \frac{\sum (x_{i} \cdot f_{i})}{f_{i}}

Population mean

μ = \frac{\sum x_{i}}{N} = \frac{\sum (x_{i} \cdot f_{i})}{f_{i}}

Mean and median of data in a relative frequency table

Sample mean

\bar{x} = \frac{\sum x_{i}}{n} = \sum \frac{x_{i} \cdot f_{i}}{n} \overset{\frac{f_{i}}{n} = P (x_{i})}{=} \sum (x_{i} \cdot P (x_{i}))

Population mean

μ = \sum (x_{i} \cdot P (x_{i}))

Weighted measures

Exactly the same as relative frequency, where the weight takes place of relative frequency

Central measures on grouped data with loss of information

Previously, we only worked on data where we had exact values known to us. Now, we'll focus on situation where we are given intervals of data

Age Intervals	Frequency	Relative Frequency
[20, 29)	1441	$\frac{1441}{25466} = 5.66 %$
[30, 39)	2477	$9.73 %$
[40, 49)	4971	$19.52 %$
[50, 59)	7438	$29.21 %$
[60, 69)	6367	$25.00 %$
[70, 79)	2314	$9.09 %$
80+	458	$1.80 %$
TOTAL	25466	$100 %$

We cannot estimate exact mean or median of the values, because we do not know what are the exact values in the age intervals

Median

We find in which interval does the $50^{t h}$ percentile land in, and then we take the midpoint $m_{i}$ of that interval. It is a drastic measure, as we can miss the median by quite some, but it's the best we can do in this situation

Mean

To calculate the mean, we can use the midpoint $m_{i}$ of each interval, multiply it by the relative frequency, and take the total sum of those products.
Sample mean

\bar{x} \approx \frac{\sum (m_{i} \cdot P (m_{i}))}{\sum f_{i}} = \sum (m_{i} \cdot P (m_{i}))

Product mean

μ \approx \sum (m_{i} \cdot P (m_{i}))

Variance and Standard Deviation on Grouped Data

Measures of spread

When taking the sample of a population, we find that usually the calculated variance and standard deviation are a substancially low compared to the actual population values, when using traditional formulas. For this reason, we introduce a slight variation for sampled calculations.
For grouped data, we again take the midpoint of the intervals to calculate the measures

Variance for sample data

s^{2} = \frac{\sum [(m_{i} - \bar{x})^{2} \cdot f_{i}]}{\sum (f_{i}) - 1}

Standard deviation for sample data

s = \sqrt{\frac{\sum [(m_{i} - \bar{x})^{2} \cdot f_{i}]}{\sum (f_{i}) - 1}} = \sqrt{s^{2}}

Those can also be derived for relative frequency tables, but only for populations, because for samples, the addition of $- 1$ in the denominator makes it impossible to rewrite the formula with the relative frequency