Unit 3: Bivariate Data, Correlation, and Regression

3.1 Bivariate Data and Scatter Diagram
3.2 Correlation
3.3 Karl Pearson's Coefficient of Correlation (r)
3.4 Spearman's Rank Correlation Coefficient (R)
3.5 Regression
3.6 Lines of Regression and Properties
3.7 Principle of Least-Squares and Curve Fitting
3.8 Coefficient of Determination

3.1 Bivariate Data and Scatter Diagram

Bivariate Data

Data where we have two variables measured for each unit of observation. For example, (Height, Weight) for 50 students, or (Ad Spend, Sales) for 12 months.

Scatter Diagram (or Scatter Plot)

A graph used to visualize bivariate data. Each (x, y) pair is plotted as a single point on a 2D graph.

It is the first step in analyzing bivariate data, as it visually shows the form, direction, and strength of a relationship.

Direction: Points trend upwards (positive) or downwards (negative).
Form: Points cluster in a line (linear) or a curve (non-linear).
Strength: Points are tightly packed (strong) or widely scattered (weak).

3.2 Correlation

Correlation is a statistical measure that describes the degree and direction of the relationship between two variables. This unit focuses on linear correlation.

3.3 Karl Pearson's Coefficient of Correlation (r)

Also called the Product-Moment Correlation Coefficient. It measures the strength and direction of the linear relationship between two quantitative variables.

Properties of 'r'

Range: 'r' always lies between -1 and +1 (i.e., -1 ≤ r ≤ +1).
r = +1: Perfect positive linear relationship. All points lie on a straight line with a positive slope.
r = -1: Perfect negative linear relationship. All points lie on a straight line with a negative slope.
r = 0: No linear relationship. (There could still be a strong non-linear relationship, like a U-shape).
Unit-free: 'r' is a pure number. It is independent of change of origin and scale. This means if you multiply all x-values by 2 (change of scale) or add 10 to all y-values (change of origin), 'r' will not change.

Formulas for 'r'

Covariance Formula:
r = Cov(x, y) / (σₓ * σᵧ)

Where Cov(x, y) = E[(x-μₓ)(y-μᵧ)], σₓ = std. dev. of x, σᵧ = std. dev. of y.
Computational Formula (for raw data):
r = [ nΣxy - (Σx)(Σy) ] / sqrt( [nΣx² - (Σx)²] * [nΣy² - (Σy)²] )

3.4 Spearman's Rank Correlation Coefficient (R or ρ)

This is a non-parametric measure of correlation. It assesses the strength of a monotonic relationship (a relationship that is consistently increasing or decreasing, but not necessarily in a straight line).

It is used when:

The data is ordinal (ranked), e.g., "rank of 10 students in Math vs. Physics."
The data is quantitative but does not meet the assumptions of Pearson's r (e.g., it's not linear, or has extreme outliers).

Procedure

Assign ranks (Rₓ) to the x-values from 1 to n.
Assign ranks (Rᵧ) to the y-values from 1 to n.
Calculate the difference in ranks for each pair: d = Rₓ - Rᵧ.
Calculate the sum of squared differences: Σd².

Formulas for 'R'

Case 1: No Ties in Ranks
R = 1 - [ (6 * Σd²) / (n³ - n) ]
Case 2: Ties in Ranks Present
If two values are tied, assign the average rank to both. (e.g., if 4th and 5th are tied, both get rank 4.5). Then, you must use the Karl Pearson's formula (from 3.3) on the ranks themselves, not the simplified formula above.

Interpretation: 'R' has the same properties as 'r' (i.e., it ranges from -1 to +1).

3.5 Regression

If correlation shows that two variables are related, regression gives us an equation to describe that relationship. This equation allows us to predict the value of one variable (Dependent Variable, Y) based on the value of another (Independent Variable, X).

3.6 Lines of Regression and Properties

In linear regression, we assume the relationship is a straight line. There are two regression lines:

1. Regression Line of Y on X

Purpose: Predicts Y, given X. (Y is dependent, X is independent).
Equation: (y - y-bar) = bᵧₓ * (x - x-bar)
bᵧₓ is the regression coefficient of Y on X. It represents the change in Y for a one-unit change in X.
bᵧₓ = Cov(x, y) / σₓ² = r * (σᵧ / σₓ)

2. Regression Line of X on Y

Purpose: Predicts X, given Y. (X is dependent, Y is independent).
Equation: (x - x-bar) = bₓᵧ * (y - y-bar)
bₓᵧ is the regression coefficient of X on Y. It represents the change in X for a one-unit change in Y.
bₓᵧ = Cov(x, y) / σᵧ² = r * (σₓ / σᵧ)

Properties of Regression Coefficients

Geometric Mean: The correlation coefficient 'r' is the geometric mean of the two regression coefficients.
r² = bᵧₓ * bₓᵧ => r = ±sqrt(bᵧₓ * bₓᵧ)
Sign: 'r', bᵧₓ, and bₓᵧ must all have the same sign. If one is positive, all three are.
Magnitude: If |r| ≤ 1, then at least one of the coefficients must be ≤ 1. It is not possible for both coefficients to be greater than 1.
Intersection: The two regression lines always intersect at the point of their means, (x-bar, y-bar).

Angle Between Two Regression Lines

If θ (theta) is the angle between the two lines, its tangent is given by:

tan(θ) = [ (1 - r²) / (r) ] * [ (σₓσᵧ) / (σₓ² + σᵧ²) ]

If r = 0, tan(θ) = ∞, so θ = 90°. The lines are perpendicular (uncorrelated).
If r = ±1, tan(θ) = 0, so θ = 0°. The lines are coincident (perfect correlation).

3.7 Principle of Least-Squares and Curve Fitting

Principle of Least-Squares

How do we find the "best" line? The Principle of Least-Squares states that the best-fit line is the one that minimizes the sum of the squared vertical distances (residuals) between the observed data points (y) and the values predicted by the line (y-hat).

Minimize Σ(y - y-hat)² = Minimize Σ(y - (a + bx))²

Curve Fitting

We use this principle to find the "normal equations" to solve for the parameters of the best-fit curve.

1. Fitting a Straight Line (y = a + bx)

The normal equations are a system of 2 linear equations for 'a' and 'b':

(I) Σy = na + b(Σx)
(II) Σxy = a(Σx) + b(Σx²)

2. Fitting a Parabola (Polynomial: y = a + bx + cx²)

The normal equations are a system of 3 linear equations for 'a', 'b', and 'c':

(I) Σy = na + b(Σx) + c(Σx²)
(II) Σxy = a(Σx) + b(Σx²) + c(Σx³)
(III) Σx²y = a(Σx²) + b(Σx³) + c(Σx⁴)

3. Fitting an Exponential Curve (y = abˣ)

This is a non-linear model. We transform it into a linear one by taking the logarithm (log base 10 or natural log) of both sides.

log(y) = log(a) + x * log(b)

Let Y = log(y), A = log(a), and B = log(b). The model becomes:

Y = A + Bx

This is now a linear model. We use the normal equations for a straight line on (x, Y) data:

(I) ΣY = nA + B(Σx)
(II) ΣxY = A(Σx) + B(Σx²)

After solving for A and B, we find the original parameters by: a = antilog(A) and b = antilog(B).

3.8 Coefficient of Determination (r²)

This is simply the square of the correlation coefficient (r). It has a very important interpretation.

Interpretation: r² represents the proportion (or percentage) of the total variation in the dependent variable (Y) that can be explained by the linear relationship with the independent variable (X).

Example: If the correlation between "Years of Education" (X) and "Income" (Y) is r = 0.8.
Then r² = (0.8)² = 0.64.
Interpretation: "64% of the variation in people's incomes can be explained by their years of education. The other 36% is due to other factors (luck, skills, location, etc.)."

r² is a key measure of how good a regression model is. A value near 1 is a good fit; a value near 0 is a poor fit.