8 - Bivariate Data and Correlation

Multivariate Data

In life, most decision involve multiple factors. We weigh these factors by importance as we discern. There are often relationships between the various factors

How do we decide how to weigh each factor?
How do we identify relationships between factors?

One answer is to collect data and take measurements for each factor
In other words, we need to collect and analyze multivariate data

Bivariate data can be shown as scatter plots

Example scatter plot

Positive association - goes upward over the entire plot
Negative association - goes downward over the entire plot

We will focus only on the linear relationships

Definitions

If there is a relationship between the variables in a bivariate dataset, there is an association between the variables

Positive association between two variables is when

as one variable increases, the other increases
as one variable decreases, the other decreases

Negative association between two variables is when

as one variable increases, the other decreases

Two variables are correlated if there is a positive, or negative association

If an association is linear, then the variables are linearly correlated either positively or negatively

There is no correlation, as $x = c o n s t .$ , so no relationship between $x$ and $y$

The more packed the scatter points are around the line, the stronger the correlation

Measuring Linear Correlation

FROM REMARKABLE

\begin{array}{r} (x - \bar{x}) (y - \bar{y}) > 0 \\ r = \frac{1}{n - 1} \sum \frac{x - \bar{x}}{s_{x}} \frac{y - \bar{y}}{s_{y}} \end{array}

$r$ is the correlation coefficient

for $r < 0$ - negative association
for $r > 0$ - positive association

Pearson's Correlation Coefficient

Pearson's Correlation Coefficient $r$ is a measure of the streangth of a linear association

r = \frac{1}{n - 1} \sum \frac{(x - \bar{x})}{s_{x}} \frac{(y - \bar{y})}{s_{y}}

The calculation assumes there is a linear association

Do not compute without evidence that a linear association is reasonable! (just looking at the graph)

Key properties of $r$ :
Independent of the units of measurements in either variable
$(x, y)$ produces sthat same values as $(y, x)$
$r \in {- 1, 1}$
If $r \approx 0$ there is no linear association
$| r | = 1$ iff the data is perfectly linear
The sign of $r$ indicates the type of assiciation (positive for $r > 0$ , and negative when $r < 0$ )

No linear association is taken for $| r | < 0.3$

Exercises

If the original bivariate data set was corrupted, and only $x - \bar{x}$ and $y - \bar{y}$ tables remained:

Can the original data set be reconstructed - NO
Can $r$ still be computed - YES

Linear Regression

Linear Equation Formula

$\hat{y} = \hat{m} x + \hat{b}$

The hats are the predicted values

Assessing Linear Models

For omst data points there is a disparity between $y_{i}$ (the measured value) and ${\hat{y}}_{i}$ (the predicted value)
This disparity is called the error at $x_{i}$ : $e_{i} = y_{i} - {\hat{y}}_{i}$
A natural selection for the best line would be the line that has the least error among all the among all the possible lines
We want a measure that incorporates all of the errors without cancellation
In order to have a best line, we need that measure to have a single line that produces the smallest error
The sum of the squared errors is such a measure

SSE = \sum e_{i}^{2} = \sum (y_{i} - {\hat{y}}_{i})^{2} = \sum {(y_{i} - (\hat{m} x_{i} + \hat{b}))}^{2}

Selecting the Best Line

For a given data set, the SSE can be understood as a function of $\hat{m}$ and $\hat{b}$ that can be minimized
Such optimization problems can be solved using calculus or linear algebra
As such, we provide the solution for you
For a given data set, the line of best fit is a unique line that minimizes the SSE. The line is given by

\hat{y} = r \frac{s_{y}}{s_{x}} x + (y - r \frac{s_{y}}{s_{x}} \bar{x})

The slope and y-intercept can be calculated in Excel using =linest()

Every line of best fit goes through

(\bar{x}, \bar{y})

Coefficient of Determination

Just because the line is the best line does not mean it is good
We need to measure for how well the line fits the data
How much of the variation in the dependent variable is explained by the line of best fit?
We measure the total variation of $y$

Total Variation = \sum (y_{i} - {\hat{y}}_{i})^{2}

We measure the explained variation

Explained Variation = Total Variation - SSE

The coefficient of determination $R^{2}$ is the proportion of variation in the dependent variable that is explained by the line of best fit

R^{2} = 1 - \frac{SSE}{Total Variation} = 1 - \frac{\sum (y_{i} - {\hat{y}}_{i})^{2}}{\sum (y_{i} - \bar{y})^{2}}, R^{2} \in [0, 1]

If $R^{2}$ is close to zero, that means our model is trash
If $R^{2}$ is close to one, that means our model explains every change in the data, so it is very good
For $R^{2} = 1$ means that the data set is perfectly linear

R^{2} = r^{2}

Excercise

If given the coefficient of determination and a scatter plot of the data, can you determine the correlation coefficient
Yes, because we know that $R^{2} = r^{2}$ , so $r = \pm \sqrt{R^{2}}$ and based on looking at the scatter plot, we can evaluate whether the correlation coefficient $r$ should be positive or negative

Use the historic MATH250 exam score data set to predict the following grades in this course

Unit 4 exam using the average of the first three exams
Final grade using the average of the unit examt
Discuss the quality of these predictions and nuance use

Linear Regression

m = \frac{n \sum x y - \sum x \cdot \sum y}{n \sum x^{2} - {(\sum x)}^{2}}

b = \frac{\sum y - m \sum x}{n}