2.6 - Measures of dispersion

Measures of Dispersion

Dispersion refers to how varied data are in a data set or how spread out the distribution of the data is
There are many ways to measure 'dispersion'

Range

Most straighforward measure to calculate

range = maximum - minimum

Because we are subtracting values, the data must be quantitive (interval or ratio)

Interquartile Range (IQR)

IQR is the range of the middle $50 %$ of the values in a distribution.
equal to the length of the box in the box plot

IQR = 75^{t h} - 25^{t h} = Q_{3} - Q_{1}

Deviation

Previous measures took only some of the values into the account (min and max for range, $Q_{1}$ and $Q_{3}$ for IQR).
Other way to include all data is to compare how far away each piece of data $x$ is from some specific value $v$
The most logical and common value of $v$ is the mean
After calculating all of the deviations, we usually take the average

\frac{\sum (x_{i} - \bar{x})}{n}

Mean Absolute Deviation (MAD)

After understanding the Example, we conclude that the average deviation from the mean will always be zero. But the set doesn't have zero spread, does it?
To fix that, we usually use mean absolute deviation (MAD). It calculated the deviation based on the average distance of values from the central value.

MAD = \frac{1}{n} \sum | x_{i} - v |

Minimization of MAD

The MAD is minimized, if we take the median for value $v$

Difference between deviation and absolute deviation is the difference between displacement and distance

Example

Data set ${2, 3, 4, 9, 17}$
mean - $7$
deviations:
$2 - 7 = - 5$
$3 - 7 = - 4$
$4 - 7 = - 3$
$9 - 7 = 2$
$17 - 7 = 10$
average deviation = $\frac{- 5 + (- 4) + (- 3) + 2 + 10}{5} = \frac{0}{5} = 0$
absolute deviation = $\frac{5 + 4 + 3 + 2 + 10}{5} = \frac{24}{5} = 4.8$
We can also calculate deviation from the median
median - $4$
deviations:
$2 - 4 = - 2$
$3 - 4 = - 1$
$4 - 4 = 0$
$9 - 4 = 5$
$17 - 4 = 13$
average deviation = $\frac{- 2 + (- 1) + 0 + 5 + 13}{5} = \frac{15}{5} = 3$
average deviation = $\frac{2 + 1 + 0 + 5 + 13}{5} = \frac{21}{5} = 4.2$

Variance

Distance between to points on a plane is equal to

d = \sqrt{(x_{1} - x_{2})^{2} + (y_{1} - y_{2})^{2}}

And the distance between central value $v$ and the data set

d = \sqrt{(x_{1} - v)^{2} + (x_{2} - v)^{2} + \dots + (x_{n} - v)^{2}}

It is a sum of deviations from $v$ squared - squared deviation from $v$

When considering MAD, the median minimized the sum of the absolute deviations. This is not the case for squared deviations, where the mean gives the smallest value
For this reason, we define variance of a population as the average of squared deviations from the mean

σ^{2} = \frac{\sum (x - μ)^{2}}{N}

$σ^{2}$ is the population variance

Variance in large population

In large populations, we use sample statistics.
When calculating the average of squared deviations from the mean using sample data, the computation tends to underestimate the variance of the larger population significantly
Because of this, the sample variance formula is adjusted slightly to account for that difference

s^{2} = \frac{\sum (x - \bar{x})^{2}}{n - 1}

Standard deviation

The variance is a powerful measure, but it's hard to comprehend because of the squared units. Because of that, we introduce standard deviation which is a square root of the variance.
$σ$ for population, $s$ for sample

End remark

Fact worth to note - all of the measures discussed (Range, IQR, MAD, Variance, Standard Deviation) are always non-negative.
Additionally, they will be equal to 0 for constant data sets ${1, 1, 1, 1}$

Visualization of mean and standard deviation

For data set

{4, 7, 13, 21, 28, 33, 42, 61}

$μ = 26.125$ , and $σ = 17.93$
What should we do to the set, if we want to keep the mean constant, but decrease the standard deviation?
We move the values from both sides of the mean by the same amount closer to the mean

{4_{+ 3}, 7_{+ 5}, 13, 21, 28, 33, 42_{- 5}, 61_{- 3}}

$μ = 26.125$ , and $σ = 15.6$ so the goal was met