Unit 3: Theory of Attributes & Curve Fitting

1. Theory of Attributes: Introduction
2. Consistency of Attributes
3. Independence of Attributes
4. Association of Attributes
5. Principle of Least Squares
6. Fitting of Polynomials and Exponential Curves

1. Theory of Attributes: Introduction

The "Theory of Attributes" deals with qualitative data (see Unit 1), which cannot be measured numerically but can be classified based on the presence or absence of a characteristic (an "attribute").

Examples: Literacy, gender (Male/Female), blindness, employment (Employed/Unemployed).

Notation

Positive Attributes (Presence): Represented by capital letters.
- A = Literacy, B = Employment
Negative Attributes (Absence): Represented by Greek letters (or lowercase).
- α (alpha) = Illiteracy (Not-A)
- β (beta) = Unemployment (Not-B)
Combinations (Classes):
- (A) = Number of people who are Literate.
- (β) = Number of people who are Unemployed.
- (AB) = Number of people who are both Literate AND Employed.
- (Aβ) = Number of people who are Literate AND Unemployed.
N: The total number of observations (the "universe").

Classes and Class Frequencies

Order 0: N (the total)
Order 1: (A), (B), (α), (β)
Order 2: (AB), (Aβ), (αB), (αβ)

These frequencies are related. For example:

N = (A) + (α)
N = (B) + (β)
(A) = (AB) + (Aβ)
(B) = (AB) + (αB)
N = (AB) + (Aβ) + (αB) + (αβ)

Contingency Table (2x2)

This is the easiest way to organize the frequencies for two attributes.

2x2 Contingency Table
Attribute	B (Employed)	β (Unemployed)	Total
A (Literate)	(AB)	(Aβ)	(A)
α (Illiterate)	(αB)	(αβ)	(α)
Total	(B)	(β)	N

2. Consistency of Attributes

Consistency: A set of data (class frequencies) is said to be consistent if no class frequency is negative.

Since a frequency represents a count of items, it cannot be less than zero. If any calculation results in a negative frequency (e.g., (AB) < 0), the data is inconsistent and likely contains errors in collection or transcription.

Condition for Consistency:

All frequencies of the highest order must be non-negative. For two attributes, this means:

(AB) ≥ 0
(Aβ) ≥ 0
(αB) ≥ 0
(αβ) ≥ 0

Exam Tip: You'll be given N, (A), (B), and (AB). To check consistency, you must calculate the other 3 frequencies and see if any are negative.
- (Aβ) = (A) - (AB)
- (αB) = (B) - (AB)
- (αβ) = N - (A) - (B) + (AB) or (αβ) = (α) - (αB)
If any of these 3 are negative, the data is inconsistent.

3. Independence of Attributes

Independence: Two attributes A and B are independent if there is no relationship between them. The presence or absence of A has no effect on the presence or absence of B.

If they are independent, the proportion of 'A's among 'B's should be the same as the proportion of 'A's in the whole population.

Proportion of A's among B's = (AB) / (B)

Proportion of A's in total = (A) / N

Condition for Independence:

A and B are independent if: (AB) = (A) * (B) / N

The observed frequency (AB) is compared to the expected frequency (A)*(B)/N.

4. Association of Attributes

If attributes are not independent, they are associated.

Positive Association: A and B are positively associated if they tend to appear together.
- Condition: Observed (AB) > Expected ( (A)*(B)/N )
- Example: Literacy and Employment.
Negative Association (or Disassociation): A and B are negatively associated if the presence of one discourages the presence of the other.
- Condition: Observed (AB) < Expected ( (A)*(B)/N )
- Example: Vaccination and Sickness.

Coefficients of Association

These are numerical measures of the *strength* and *direction* of the association.

1. Yule's Coefficient of Association (Q)

This is the most common measure. It ranges from -1 to +1.

Q = [ (AB)(αβ) - (Aβ)(αB) ] / [ (AB)(αβ) + (Aβ)(αB) ]

If Q = +1: Perfect Positive Association.
If Q = -1: Perfect Negative Association.
If Q = 0: A and B are Independent.
The closer to |1|, the stronger the association.

2. Yule's Coefficient of Colligation (Y)

Another measure, also ranging from -1 to +1. It is related to Q.

Y = [ sqrt((AB)(αβ)) - sqrt((Aβ)(αB)) ] / [ sqrt((AB)(αβ)) + sqrt((Aβ)(αB)) ]

Relationship between Q and Y:

Q = (2 * Y) / (1 + Y²)
Y = (1 - sqrt(1 - Q²)) / Q

Note: The value of Y is always smaller than the value of Q (unless Q=0 or Q=±1). This means Q tends to overestimate the level of association compared to Y.

5. Principle of Least Squares

The "Principle of Least Squares" is a fundamental method used in regression and curve fitting. It helps us find the "best-fit" line or curve for a set of data points (x, y).

Principle: The best-fit curve is the one that minimizes the sum of the squares of the vertical errors (residuals) between the observed data points (y) and the values predicted by the curve (ŷ).

Data point: (x_i, y_i)
Fitted curve: ŷ = f(x)
Error (Residual): e_i = y_i - ŷ_i
Goal: Minimize the Sum of Squared Errors (SSE).

Minimize: SSE = Σ (e_i)² = Σ (y_i - ŷ_i)²

We use calculus (partial derivatives) to find the parameters (e.g., 'a' and 'b' in a line) that make this sum as small as possible. This process generates a set of "Normal Equations."

6. Fitting of Polynomials and Exponential Curves

Using the Principle of Least Squares, we can derive the Normal Equations needed to fit specific curves to data.

1. Fitting a Straight Line (Linear Regression)

Equation: y = a + bx

Here, 'a' is the y-intercept and 'b' is the slope. We need to find the values of 'a' and 'b' that minimize Σ(y - (a + bx))².

Normal Equations for a Straight Line:

Σy = n*a + b*(Σx)

Σxy = a*(Σx) + b*(Σx²)

How to solve: 1. From your data, calculate: n, Σx, Σy, Σxy, Σx² 2. Plug these 5 values into the two normal equations. 3. You now have two simultaneous linear equations with two unknowns (a, b). Solve for 'a' and 'b'.

2. Fitting a Polynomial (Parabola / Quadratic)

Equation: y = a + bx + cx²

We need to find 'a', 'b', and 'c'.

Normal Equations for a Parabola:

Σy = n*a + b*(Σx) + c*(Σx²)

Σxy = a*(Σx) + b*(Σx²) + c*(Σx³)

Σx²y = a*(Σx²) + b*(Σx³) + c*(Σx⁴)

How to solve: 1. Calculate n, Σx, Σy, Σx², Σxy, Σx³, Σx²y, Σx⁴. 2. Plug these values in to get three simultaneous equations. 3. Solve for 'a', 'b', and 'c'.

3. Fitting an Exponential Curve

Equation: y = a * b^x

This is not a linear equation, so we can't use the normal equations directly. We must transform it into a linear form by taking the logarithm of both sides.

log(y) = log(a * b^x)
log(y) = log(a) + log(b^x)
log(y) = log(a) + x * log(b)

Now, let Y = log(y), A = log(a), and B = log(b).

The equation becomes: Y = A + Bx

This is just a straight line! We can use the normal equations for a straight line, but replacing 'y' with 'Y' (i.e., log(y)) and 'a' with 'A' and 'b' with 'B'.

Normal Equations for Exponential Curve:

Σ(log y) = n*A + B*(Σx)

Σ(x * log y) = A*(Σx) + B*(Σx²)

How to solve: 1. Create new columns in your data for Y = log(y) and x*log(y). 2. Calculate: n, Σx, Σx², Σ(log y), Σ(x * log y). 3. Solve the two normal equations for A and B. 4. Convert A and B back to 'a' and 'b':

a = antilog(A) (or a = 10^A if using log base 10)
b = antilog(B) (or b = 10^B if using log base 10)

5. The final fitted curve is y = a * b^x.

Exam Tip: Curve fitting problems are very common. The key is to correctly calculate all the "sum" (Σ) values from the given data table. Be very careful with your calculations.

Knowlet