Unit 2: Measures of Central Tendency and Dispersion

2.1 Measures of Central Tendency
2.2 Partition Values
2.3 Measures of Dispersion
2.4 Relative Measures of Dispersion
2.5 Moments
2.6 Measures of Skewness and Kurtosis

This unit focuses on summarizing data using numerical values. We look at measures for the "center" of the data and measures for the "spread" of the data.

2.1 Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set.

Arithmetic Mean (A.M.)

The "average" you are most familiar with. It's the sum of all values divided by the number of values.

Ungrouped Data: x-bar = (Σx) / n
Grouped Data: x-bar = (Σfx) / (Σf), where 'x' is the class midpoint and 'f' is the frequency.
Pros: Easy to calculate, uses all data points.
Cons: Highly sensitive to outliers (extreme values).

Median

The middle value of a dataset that has been sorted in order of magnitude.

Ungrouped Data: The (n+1)/2-th value. If 'n' is even, it's the average of the two middle values.
Grouped Data (Formula):
Median = L + [ (N/2 - C) / f ] * h
- L = Lower boundary of the median class (the class containing the N/2-th item).
- N = Total frequency (Σf).
- C = Cumulative frequency of the class preceding the median class.
- f = Frequency of the median class.
- h = Class width of the median class.
Pros: Not affected by outliers (it is "robust").
Cons: Does not use all data values.

Mode

The value that appears most frequently in the dataset.

Ungrouped Data: Find the value with the highest frequency. A dataset can be unimodal, bimodal, or multimodal.
Grouped Data (Formula):
Mode = L + [ (f₁ - f₀) / (2f₁ - f₀ - f₂) ] * h
- L = Lower boundary of the modal class (the class with the highest frequency).
- f₁ = Frequency of the modal class.
- f₀ = Frequency of the class preceding the modal class.
- f₂ = Frequency of the class following the modal class.
Pros: Easy to understand, can be used for categorical data.
Cons: May not exist or may not be unique.

Empirical Relationship (for unimodal, moderately skewed distributions):

Mean - Mode ≈ 3 * (Mean - Median)

For a positively (right) skewed distribution: Mean > Median > Mode

For a negatively (left) skewed distribution: Mean < Median < Mode

Geometric Mean (G.M.)

The n-th root of the product of n values. It is suitable for averaging ratios, percentages, or growth rates.

G.M. = (x₁ * x₂ * ... * xₙ)¹/ⁿ

Cannot be used if any data value is zero or negative.

Harmonic Mean (H.M.)

The reciprocal of the arithmetic mean of the reciprocals. It is suitable for averaging rates and speeds.

H.M. = n / ( Σ(1/x) )

Cannot be used if any data value is zero.

2.2 Partition Values

These are values that divide a sorted dataset into equal parts.

Median: Divides the data into 2 equal parts. (It is the 2nd Quartile).
Quartiles: Divide the data into 4 equal parts.
- Q₁ (First Quartile): The 25th percentile. 25% of data is below it.
- Q₂ (Second Quartile): The 50th percentile. This is the Median.
- Q₃ (Third Quartile): The 75th percentile. 75% of data is below it.
Deciles: Divide the data into 10 equal parts (D₁, D₂, ..., D₉).
Percentiles: Divide the data into 100 equal parts (P₁, P₂, ..., P₉₉).

The formula for any partition value in grouped data is a generalization of the median formula. For example, to find Q₁:

Q₁ = L + [ (N/4 - C) / f ] * h

2.3 Measures of Dispersion

These measures describe the spread, variability, or scatter of the data. A low value means the data is clustered tightly around the center, while a high value means it is spread out.

Range

The simplest measure. Range = Maximum Value - Minimum Value.

Pros: Very easy to calculate.
Cons: Highly unstable as it depends only on two extreme values.

Quartile Deviation (Q.D.)

Also called the Semi-Interquartile Range. It measures the spread of the middle 50% of the data.

Q.D. = (Q₃ - Q₁) / 2

Pros: Not affected by outliers. Better than Range.
Cons: Ignores 50% of the data (the top 25% and bottom 25%).

Mean Deviation (M.D.)

The average of the absolute differences between each data point and the mean (or median).

M.D. (about mean) = ( Σ |x - x-bar| ) / n

Pros: Uses all data values.
Cons: Ignores the signs of the deviations (absolute values), making it difficult for further mathematical treatment.

Variance (σ²) and Standard Deviation (σ)

The most important and widely used measures of dispersion.

Variance (σ²): The average of the squared differences from the Mean. Squaring the deviations avoids the issue of positive and negative deviations canceling out.
- Population Variance (σ²): σ² = ( Σ(x - μ)² ) / N
- Sample Variance (s²): s² = ( Σ(x - x-bar)² ) / (n-1) (Note: We divide by n-1 for samples to get an unbiased estimator of σ².)
Standard Deviation (σ): The square root of the variance.
- σ = sqrt(Variance)
- Pros: Has the same units as the original data, making it more interpretable than variance. Uses all data.
- Cons: Is sensitive to outliers (due to squaring).

Computational Formula for Variance:

σ² = [ (Σfx²) / N ] - (x-bar)²

(Average of the squares) - (Square of the average)

2.4 Relative Measures of Dispersion

The measures above (Range, SD) are absolute and are in the same units as the data. To compare the variability of two different datasets (e.g., heights in cm vs. weights in kg), we need relative measures (unit-free coefficients).

Coefficient of Variation (C.V.)

The most important relative measure. It expresses the standard deviation as a percentage of the mean.

C.V. = (Standard Deviation / |Mean|) * 100

Uses of C.V.:

Comparison: To compare the variability of two or more series with different units or different means.
Consistency: A series with a lower C.V. is said to be more consistent, more stable, or less variable. A series with a higher C.V. is less consistent or more variable.

Other Coefficients:

Coefficient of Range: (Max - Min) / (Max + Min)
Coefficient of Quartile Deviation: (Q₃ - Q₁) / (Q₃ + Q₁)
Coefficient of Mean Deviation: (M.D.) / (Mean or Median)

2.5 Moments

Moments are statistical measures that describe the characteristics of a distribution's shape.

Raw Moments (μ'ᵣ)

Moments about the origin (zero). The r-th raw moment is E[Xʳ].

μ'₁ = (Σfx) / N = Mean (x-bar)
μ'₂ = (Σfx²) / N
μ'₃ = (Σfx³) / N
μ'₄ = (Σfx⁴) / N

Central Moments (μᵣ)

Moments about the mean. The r-th central moment is E[(X - μ)ʳ].

μ₁ = ( Σf(x - x-bar) ) / N = 0 (by definition of the mean).
μ₂ = ( Σf(x - x-bar)² ) / N = Variance (σ²)
μ₃ = ( Σf(x - x-bar)³ ) / N (Measures skewness)
μ₄ = ( Σf(x - x-bar)⁴ ) / N (Measures kurtosis)

2.6 Measures of Skewness and Kurtosis

Skewness (Asymmetry)

Skewness measures the lack of symmetry in a distribution. [Image of three distributions: negatively skewed (left tail), symmetric (bell curve), and positively skewed (right tail)]

Positively Skewed (Right Skew): The tail on the right side is longer. Mean > Median > Mode.
Negatively Skewed (Left Skew): The tail on the left side is longer. Mean < Median < Mode.
Symmetric: The distribution is identical on both sides of the center. Mean = Median = Mode.

Measures of Skewness:

Karl Pearson's Coefficient (Sk): Sk = (Mean - Mode) / σ
Moment Coefficient (β₁ "beta-one"):
β₁ = μ₃² / μ₂³

For a symmetric distribution, μ₃ = 0, so β₁ = 0.

Kurtosis (Peakedness)

Kurtosis measures the "tailedness" and "peakedness" of a distribution compared to a Normal (bell-shaped) distribution.

Mesokurtic: A normal distribution (medium peak, normal tails).
Leptokurtic: A "skinny" distribution (high peak, fat tails). More outliers.
Platykurtic: A "flat" distribution (low peak, thin tails). Fewer outliers.

Measure of Kurtosis:

Moment Coefficient (β₂ "beta-two"):
β₂ = μ₄ / μ₂² = μ₄ / σ⁴
- If β₂ = 3, the curve is Mesokurtic (like the Normal distribution).
- If β₂ > 3, the curve is Leptokurtic (peaked).
- If β₂ < 3, the curve is Platykurtic (flat).