Chapter 2.1 - 2.4

Reading mathematical expressions

Common expressions

Difference
|x5|6 - the distance between x and 5 is less or equal to 6
x[1,11]
|x+5|>6 - the distance between x and -5 is greater than 6
x(,1)(11,+)
Summation
i=1nti=t1+t2++tn1+tn
i=25i2=22+32+42+52=4+9+16+25=54

Variables

n - sample size
N - population size
μ - population mean
x¯ - sample mean

Proportions

A proportion is the percentage of observations/data values that have a certain characteristic

A sample statistic: p^=# observationsn, where n is the sample size
A population parameter: p=# observationsN, where N is the population size

Proportions of types of variables

Nominal - how many people are in grad school
Ordinal - how many people got first place
Interval - values between X and Y
Ratio - is 1/3rd

Variable classification

Identify the variables for which proportions are calculated
Classify the level of measurement for proportion
Caution: care must be taken when making comparisons with proportions across different data sets

Distributions

Descriptive statistics summarize without painting a full picture

Distributions and visualizations show us a glimpse of the data as a whole, but often fail to paint a full picture

Best practice

When analyzing data, create distributions and visualizations to develop a sense of data

Distribution components

A distribution segments all of the data into classes that are exhaustive, mutually exclusive, and equal length

Distribution provides information about the number of data values in each class.
Frequency disctributions provide raw count in each class
Relative frequency distributions provide the proportion of values in each class

Example drawing

Pasted image 20260127123001.png|300
Classes - 4, each for one person

Important to note

The graph starts at 180, so ratios look much bigger than in reality

!!!Comparing bar height is different than comparing sales performance!!!

Fixed bar graph
Pasted image 20260127123406.png|300

Naming

Space in between bars - bar graph
No space in between bars - histogram

Norms for visualization

Use histograms for continuous quantitative variables
Use bar graphs for qualitative and discrete variables

For interval variables, indicate the level and caution making ration comparisons
For ratio variables, have vertical axis start at zero
if not at zero, the variable is rendered as interval
Communicate clearly

Box plots

Define four classes so that each class has approximately same proportion of observations
it might not be possible to split population in 4 equal classes due to the population size not being divisible

Quartiles

First quartile Q1 - a value at which 25% of data values are less or equal to the value
Second quartile Q2 - a value at which approximately 50% of data values are less or equal to the value
First quartile Q3 - a value at which 75% of data values are less or equal to the value

To have four classes, we need 5 boundary points. The other two are:

  1. Sample minimum
  2. Sample maximum
Quartile computation
  1. Sort data values in ascending order
  2. Determine how many observations will need to be less than or equal to the value. We call this the rank
    If the found rank is an whole integer, we use R=xR+xR+12
  3. Find the smallest value that is smaller or equal than the value at rank

Example
data set - {17,52,76,76,81,90}
size - n=6
quartile 1:
rank - R1=0.256=1.52
Q1=52
quartile 2:
rank - R2=0.56=3
Q2=76+762=76
taking 3 values, we can see that there is 76 before and after the rank, so we have to use a bit different computation. This is the reason for "approximately" in the definition
quartile 3:
rank - R3=0.756=4.55
Q3=81
min: 17
max: 90
Example
Pasted image 20260127130044.png
Which set has the greatest proportion of values greater than 50?
Data set II
Which set has the largest value? - Data set III
Which set has the smallest box? - Data set II
Size of the box - inter-quartile range
Which set has the greatest number of values above 80?
we cannot say, as the boxes show percentages, and not where the values are on the plot
Which set has the most tightly packed data - we can't say
Which set is the most symmetric - we can't say

Percentiles

We use percentiles to generalize quartiles
The third quartile is value such that approx. 75% of data values are less or equal than the q.value. The same goes for other quartiles

Percentiles allow us to specify any percentage between 0 and 100

The percentile is calculated exactly like the quartile
81st percentile:
Rank - R=0.816=4.865
81st = 81 (which is actually equal to the Q3)

Examples

1 - 81st of the first 200 positive, even numbers
R=0.81200=162
162nd value - 324
163nd value - 326
81st=324+3262=325
2 - construct a data set with indicated size and Five Number Summary
a) n = 6, min = 1, Q1=4, Q2=5, Q3=7, max = 10
R1=0.256=1.52
R2=0.56=3
R3=0.756=4.55
{1, 4, 4, 6, 7, 10}
b) n = 7, min = 100, Q1=108, Q2=122, Q3=125, max = 126
R1=0.257=1.752
R2=0.57=3.54
R3=0.757=5.256
{100, 108, 110, 122, 124, 125, 126}
c) n = 8, min = -1, Q1=0, Q2=5, Q3=5, max = 15
R1=0.258=2
R2=0.58=4
R3=0.758=6
{-1, 0, 0, 5, 5, 5, 5, 15}
3 - construct a data set with 15 entries so that the 34th=50, 73rd=84
R34=0.3415=5.16 34th=50
R73=0.7315=10.9511 73th=84
{1, 2, 3, 4, 5, 50, 51, 52, 53, 54, 84, 85, 86, 87, 88}
4 - What is the smallest size of set so that it has different values for each whole number percentile between 31th and 40th?
Using excel - 93