Data where we have two variables measured for each unit of observation. For example, (Height, Weight) for 50 students, or (Ad Spend, Sales) for 12 months.
A graph used to visualize bivariate data. Each (x, y) pair is plotted as a single point on a 2D graph.
It is the first step in analyzing bivariate data, as it visually shows the form, direction, and strength of a relationship.
Correlation is a statistical measure that describes the degree and direction of the relationship between two variables. This unit focuses on linear correlation.
Also called the Product-Moment Correlation Coefficient. It measures the strength and direction of the linear relationship between two quantitative variables.
Where Cov(x, y) = E[(x-μₓ)(y-μᵧ)], σₓ = std. dev. of x, σᵧ = std. dev. of y.
This is a non-parametric measure of correlation. It assesses the strength of a monotonic relationship (a relationship that is consistently increasing or decreasing, but not necessarily in a straight line).
It is used when:
If two values are tied, assign the average rank to both. (e.g., if 4th and 5th are tied, both get rank 4.5). Then, you must use the Karl Pearson's formula (from 3.3) on the ranks themselves, not the simplified formula above.
Interpretation: 'R' has the same properties as 'r' (i.e., it ranges from -1 to +1).
If correlation shows that two variables are related, regression gives us an equation to describe that relationship. This equation allows us to predict the value of one variable (Dependent Variable, Y) based on the value of another (Independent Variable, X).
In linear regression, we assume the relationship is a straight line. There are two regression lines:
If θ (theta) is the angle between the two lines, its tangent is given by:
How do we find the "best" line? The Principle of Least-Squares states that the best-fit line is the one that minimizes the sum of the squared vertical distances (residuals) between the observed data points (y) and the values predicted by the line (y-hat).
We use this principle to find the "normal equations" to solve for the parameters of the best-fit curve.
The normal equations are a system of 2 linear equations for 'a' and 'b':
The normal equations are a system of 3 linear equations for 'a', 'b', and 'c':
This is a non-linear model. We transform it into a linear one by taking the logarithm (log base 10 or natural log) of both sides.
log(y) = log(a) + x * log(b)
Let Y = log(y), A = log(a), and B = log(b). The model becomes:
Y = A + Bx
This is now a linear model. We use the normal equations for a straight line on (x, Y) data:
After solving for A and B, we find the original parameters by: a = antilog(A) and b = antilog(B).
This is simply the square of the correlation coefficient (r). It has a very important interpretation.
Interpretation: r² represents the proportion (or percentage) of the total variation in the dependent variable (Y) that can be explained by the linear relationship with the independent variable (X).
r² is a key measure of how good a regression model is. A value near 1 is a good fit; a value near 0 is a poor fit.