Unit 4: Regression and Curve Fitting
Table of Contents
1. Regression: Types of Regression (Lines)
If correlation shows a relationship exists, regression describes that relationship with an equation. This equation can be used for prediction.
A "line of regression" is the line of best fit for the data. In bivariate analysis, there are two main types (lines) of linear regression.
1. Regression Line of Y on X
This line is used to predict the value of Y, given a value of X.
Equation: (Y - ȳ) = byx * (X - x̄)
Here, byx is the regression coefficient of Y on X (the slope). It represents the average change in Y for a one-unit change in X.
2. Regression Line of X on Y
This line is used to predict the value of X, given a value of Y.
Equation: (X - x̄) = bxy * (Y - ȳ)
Here, bxy is the regression coefficient of X on Y. It represents the average change in X for a one-unit change in Y.
2. Regression Coefficients and their Properties
The coefficients byx and bxy are the slopes of the two regression lines.
Formulas for Coefficients:
bxy = Cov(x, y) / σy² = r * (σx / σy)
byx = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σx²) - (Σx)² ]
bxy = [ n(Σxy) - (Σx)(Σy) ] / [ n(Σy²) - (Σy)² ]
Properties of Regression Coefficients:
- Geometric Mean: The correlation coefficient 'r' is the geometric mean of the two regression coefficients. r² = byx * bxy => r = ± sqrt(byx * bxy)
- Sign: 'r', byx, and bxy all have the same sign.
- Magnitude: If one regression coefficient is greater than 1, the other *must* be less than 1 (as their product, r², cannot exceed 1).
- Answer: No. r² = 1.6 * 0.9 = 1.44, which is > 1. This is impossible.
3. Angle Between Two Regression Lines
The two regression lines intersect at (x̄, ȳ). The angle (θ) between them indicates the strength of the correlation.
Key Insights:
- If r = 0: tan(θ) = ∞, so θ = 90°. The lines are perpendicular. The variables are uncorrelated.
- If r = +1 or -1: tan(θ) = 0, so θ = 0°. The two lines are coincident (they become the same line). This means perfect correlation.
4. Principle of Least Squares
This is the fundamental method used to find the "best-fit" line (the regression line) for a set of data points.
Principle: The line of best fit is the one that minimizes the sum of the squares of the vertical errors (residuals).
- Residual (Error): ei = (Observed yi) - (Predicted ŷi)
- Goal: Minimize the Sum of Squared Errors (SSE).
For a straight line ŷ = a + bx, we use calculus (partial derivatives w.r.t. 'a' and 'b') to find the values that minimize SSE. This process generates the "Normal Equations."
5. Fitting of Linear, Polynomials, and Exponential Curves
Using the Principle of Least Squares, we can derive the Normal Equations needed to fit specific curves to data.
1. Fitting a Linear Equation (Straight Line)
Equation: y = a + bx
Normal Equations:
- Σy = n*a + b*(Σx)
- Σxy = a*(Σx) + b*(Σx²)
Solve these two simultaneous equations for 'a' and 'b'.
2. Fitting a Polynomial (Parabola / Quadratic)
Equation: y = a + bx + cx²
Normal Equations:
- Σy = n*a + b*(Σx) + c*(Σx²)
- Σxy = a*(Σx) + b*(Σx²) + c*(Σx³)
- Σx²y = a*(Σx²) + b*(Σx³) + c*(Σx⁴)
Solve these three simultaneous equations for 'a', 'b', and 'c'.
3. Fitting an Exponential Curve
Equation: y = a * bx
This is not linear. We must transform it by taking the logarithm.
Now, let Y = log(y), A = log(a), and B = log(b).
The equation becomes a straight line: Y = A + Bx
We use the normal equations for a straight line, but with Y instead of y:
Normal Equations (Exponential):
- ΣY = n*A + B*(Σx) => Σ(log y) = n*log(a) + log(b)*(Σx)
- ΣxY = A*(Σx) + B*(Σx²) => Σ(x log y) = log(a)*(Σx) + log(b)*(Σx²)
Solve for A and B, then find a = antilog(A) and b = antilog(B).
6. Coefficient of Determination (r²)
Coefficient of Determination (r²): The square of the correlation coefficient (r). It represents the proportion of the total variance in the dependent variable (Y) that is explained or accounted for by the linear relationship with the independent variable (X).
- Range: 0 ≤ r² ≤ 1 (since it's a square).
- Example: If r = 0.9, then r² = 0.81.
- Interpretation: This means 81% of the variation in Y can be explained by X. The remaining 19% (1 - r²) is unexplained variation, due to other factors or random error.