Unit 3: Bivariate Data, Correlation, and Regression

Table of Contents

3.1 Bivariate Data and Scatter Diagram

Bivariate Data

Data where we have two variables measured for each unit of observation. For example, (Height, Weight) for 50 students, or (Ad Spend, Sales) for 12 months.

Scatter Diagram (or Scatter Plot)

A graph used to visualize bivariate data. Each (x, y) pair is plotted as a single point on a 2D graph.

It is the first step in analyzing bivariate data, as it visually shows the form, direction, and strength of a relationship.

3.2 Correlation

Correlation is a statistical measure that describes the degree and direction of the relationship between two variables. This unit focuses on linear correlation.

3.3 Karl Pearson's Coefficient of Correlation (r)

Also called the Product-Moment Correlation Coefficient. It measures the strength and direction of the linear relationship between two quantitative variables.

Properties of 'r'

Formulas for 'r'

  1. Covariance Formula:
    r = Cov(x, y) / (σₓ * σᵧ)

    Where Cov(x, y) = E[(x-μₓ)(y-μᵧ)], σₓ = std. dev. of x, σᵧ = std. dev. of y.

  2. Computational Formula (for raw data):
    r = [ nΣxy - (Σx)(Σy) ] / sqrt( [nΣx² - (Σx)²] * [nΣy² - (Σy)²] )

3.4 Spearman's Rank Correlation Coefficient (R or ρ)

This is a non-parametric measure of correlation. It assesses the strength of a monotonic relationship (a relationship that is consistently increasing or decreasing, but not necessarily in a straight line).

It is used when:

  1. The data is ordinal (ranked), e.g., "rank of 10 students in Math vs. Physics."
  2. The data is quantitative but does not meet the assumptions of Pearson's r (e.g., it's not linear, or has extreme outliers).

Procedure

  1. Assign ranks (Rₓ) to the x-values from 1 to n.
  2. Assign ranks (Rᵧ) to the y-values from 1 to n.
  3. Calculate the difference in ranks for each pair: d = Rₓ - Rᵧ.
  4. Calculate the sum of squared differences: Σd².

Formulas for 'R'

Interpretation: 'R' has the same properties as 'r' (i.e., it ranges from -1 to +1).

3.5 Regression

If correlation shows that two variables are related, regression gives us an equation to describe that relationship. This equation allows us to predict the value of one variable (Dependent Variable, Y) based on the value of another (Independent Variable, X).

3.6 Lines of Regression and Properties

In linear regression, we assume the relationship is a straight line. There are two regression lines:

1. Regression Line of Y on X

2. Regression Line of X on Y

Properties of Regression Coefficients

Angle Between Two Regression Lines

If θ (theta) is the angle between the two lines, its tangent is given by:

tan(θ) = [ (1 - r²) / (r) ] * [ (σₓσᵧ) / (σₓ² + σᵧ²) ]

3.7 Principle of Least-Squares and Curve Fitting

Principle of Least-Squares

How do we find the "best" line? The Principle of Least-Squares states that the best-fit line is the one that minimizes the sum of the squared vertical distances (residuals) between the observed data points (y) and the values predicted by the line (y-hat).

Minimize Σ(y - y-hat)² = Minimize Σ(y - (a + bx))²

Curve Fitting

We use this principle to find the "normal equations" to solve for the parameters of the best-fit curve.

1. Fitting a Straight Line (y = a + bx)

The normal equations are a system of 2 linear equations for 'a' and 'b':

(I) Σy = na + b(Σx)
(II) Σxy = a(Σx) + b(Σx²)

2. Fitting a Parabola (Polynomial: y = a + bx + cx²)

The normal equations are a system of 3 linear equations for 'a', 'b', and 'c':

(I) Σy = na + b(Σx) + c(Σx²)
(II) Σxy = a(Σx) + b(Σx²) + c(Σx³)
(III) Σx²y = a(Σx²) + b(Σx³) + c(Σx⁴)

3. Fitting an Exponential Curve (y = abˣ)

This is a non-linear model. We transform it into a linear one by taking the logarithm (log base 10 or natural log) of both sides.

log(y) = log(a) + x * log(b)

Let Y = log(y), A = log(a), and B = log(b). The model becomes:

Y = A + Bx

This is now a linear model. We use the normal equations for a straight line on (x, Y) data:

(I) ΣY = nA + B(Σx)
(II) ΣxY = A(Σx) + B(Σx²)

After solving for A and B, we find the original parameters by: a = antilog(A) and b = antilog(B).

3.8 Coefficient of Determination (r²)

This is simply the square of the correlation coefficient (r). It has a very important interpretation.

Interpretation: r² represents the proportion (or percentage) of the total variation in the dependent variable (Y) that can be explained by the linear relationship with the independent variable (X).

r² is a key measure of how good a regression model is. A value near 1 is a good fit; a value near 0 is a poor fit.