Unit 3: Theory of Attributes & Curve Fitting
Table of Contents
1. Theory of Attributes: Introduction
The "Theory of Attributes" deals with qualitative data (see Unit 1), which cannot be measured numerically but can be classified based on the presence or absence of a characteristic (an "attribute").
- Examples: Literacy, gender (Male/Female), blindness, employment (Employed/Unemployed).
Notation
- Positive Attributes (Presence): Represented by capital letters.
- A = Literacy, B = Employment
- Negative Attributes (Absence): Represented by Greek letters (or lowercase).
- α (alpha) = Illiteracy (Not-A)
- β (beta) = Unemployment (Not-B)
- Combinations (Classes):
- (A) = Number of people who are Literate.
- (β) = Number of people who are Unemployed.
- (AB) = Number of people who are both Literate AND Employed.
- (Aβ) = Number of people who are Literate AND Unemployed.
- N: The total number of observations (the "universe").
Classes and Class Frequencies
- Order 0: N (the total)
- Order 1: (A), (B), (α), (β)
- Order 2: (AB), (Aβ), (αB), (αβ)
These frequencies are related. For example:
N = (B) + (β)
(A) = (AB) + (Aβ)
(B) = (AB) + (αB)
N = (AB) + (Aβ) + (αB) + (αβ)
Contingency Table (2x2)
This is the easiest way to organize the frequencies for two attributes.
| Attribute | B (Employed) | β (Unemployed) | Total |
|---|---|---|---|
| A (Literate) | (AB) | (Aβ) | (A) |
| α (Illiterate) | (αB) | (αβ) | (α) |
| Total | (B) | (β) | N |
2. Consistency of Attributes
Consistency: A set of data (class frequencies) is said to be consistent if no class frequency is negative.
Since a frequency represents a count of items, it cannot be less than zero. If any calculation results in a negative frequency (e.g., (AB) < 0), the data is inconsistent and likely contains errors in collection or transcription.
Condition for Consistency:
All frequencies of the highest order must be non-negative. For two attributes, this means:
- (AB) ≥ 0
- (Aβ) ≥ 0
- (αB) ≥ 0
- (αβ) ≥ 0
- (Aβ) = (A) - (AB)
- (αB) = (B) - (AB)
- (αβ) = N - (A) - (B) + (AB) or (αβ) = (α) - (αB)
If any of these 3 are negative, the data is inconsistent.
3. Independence of Attributes
Independence: Two attributes A and B are independent if there is no relationship between them. The presence or absence of A has no effect on the presence or absence of B.
If they are independent, the proportion of 'A's among 'B's should be the same as the proportion of 'A's in the whole population.
Proportion of A's among B's = (AB) / (B)
Proportion of A's in total = (A) / N
Condition for Independence:
A and B are independent if: (AB) = (A) * (B) / N
The observed frequency (AB) is compared to the expected frequency (A)*(B)/N.
4. Association of Attributes
If attributes are not independent, they are associated.
- Positive Association: A and B are positively associated if they tend to appear together.
- Condition: Observed (AB) > Expected ( (A)*(B)/N )
- Example: Literacy and Employment.
- Negative Association (or Disassociation): A and B are negatively associated if the presence of one discourages the presence of the other.
- Condition: Observed (AB) < Expected ( (A)*(B)/N )
- Example: Vaccination and Sickness.
Coefficients of Association
These are numerical measures of the *strength* and *direction* of the association.
1. Yule's Coefficient of Association (Q)
This is the most common measure. It ranges from -1 to +1.
- If Q = +1: Perfect Positive Association.
- If Q = -1: Perfect Negative Association.
- If Q = 0: A and B are Independent.
- The closer to |1|, the stronger the association.
2. Yule's Coefficient of Colligation (Y)
Another measure, also ranging from -1 to +1. It is related to Q.
Relationship between Q and Y:
Y = (1 - sqrt(1 - Q²)) / Q
5. Principle of Least Squares
The "Principle of Least Squares" is a fundamental method used in regression and curve fitting. It helps us find the "best-fit" line or curve for a set of data points (x, y).
Principle: The best-fit curve is the one that minimizes the sum of the squares of the vertical errors (residuals) between the observed data points (y) and the values predicted by the curve (ŷ).
- Data point: (xi, yi)
- Fitted curve: ŷ = f(x)
- Error (Residual): ei = yi - ŷi
- Goal: Minimize the Sum of Squared Errors (SSE).
We use calculus (partial derivatives) to find the parameters (e.g., 'a' and 'b' in a line) that make this sum as small as possible. This process generates a set of "Normal Equations."
6. Fitting of Polynomials and Exponential Curves
Using the Principle of Least Squares, we can derive the Normal Equations needed to fit specific curves to data.
1. Fitting a Straight Line (Linear Regression)
Equation: y = a + bx
Here, 'a' is the y-intercept and 'b' is the slope. We need to find the values of 'a' and 'b' that minimize Σ(y - (a + bx))².
Normal Equations for a Straight Line:
- Σy = n*a + b*(Σx)
- Σxy = a*(Σx) + b*(Σx²)
How to solve: 1. From your data, calculate: n, Σx, Σy, Σxy, Σx² 2. Plug these 5 values into the two normal equations. 3. You now have two simultaneous linear equations with two unknowns (a, b). Solve for 'a' and 'b'.
2. Fitting a Polynomial (Parabola / Quadratic)
Equation: y = a + bx + cx²
We need to find 'a', 'b', and 'c'.
Normal Equations for a Parabola:
- Σy = n*a + b*(Σx) + c*(Σx²)
- Σxy = a*(Σx) + b*(Σx²) + c*(Σx³)
- Σx²y = a*(Σx²) + b*(Σx³) + c*(Σx⁴)
How to solve: 1. Calculate n, Σx, Σy, Σx², Σxy, Σx³, Σx²y, Σx⁴. 2. Plug these values in to get three simultaneous equations. 3. Solve for 'a', 'b', and 'c'.
3. Fitting an Exponential Curve
Equation: y = a * bx
This is not a linear equation, so we can't use the normal equations directly. We must transform it into a linear form by taking the logarithm of both sides.
log(y) = log(a * bx)
log(y) = log(a) + log(bx)
log(y) = log(a) + x * log(b)
Now, let Y = log(y), A = log(a), and B = log(b).
The equation becomes: Y = A + Bx
This is just a straight line! We can use the normal equations for a straight line, but replacing 'y' with 'Y' (i.e., log(y)) and 'a' with 'A' and 'b' with 'B'.
Normal Equations for Exponential Curve:
- Σ(log y) = n*A + B*(Σx)
- Σ(x * log y) = A*(Σx) + B*(Σx²)
How to solve: 1. Create new columns in your data for Y = log(y) and x*log(y). 2. Calculate: n, Σx, Σx², Σ(log y), Σ(x * log y). 3. Solve the two normal equations for A and B. 4. Convert A and B back to 'a' and 'b':
- a = antilog(A) (or a = 10A if using log base 10)
- b = antilog(B) (or b = 10B if using log base 10)