Unit 2: Measures of Central Tendency and Dispersion
This unit focuses on summarizing data using numerical values. We look at measures for the "center" of the data and measures for the "spread" of the data.
2.1 Measures of Central Tendency
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set.
Arithmetic Mean (A.M.)
The "average" you are most familiar with. It's the sum of all values divided by the number of values.
- Ungrouped Data: x-bar = (Σx) / n
- Grouped Data: x-bar = (Σfx) / (Σf), where 'x' is the class midpoint and 'f' is the frequency.
- Pros: Easy to calculate, uses all data points.
- Cons: Highly sensitive to outliers (extreme values).
Median
The middle value of a dataset that has been sorted in order of magnitude.
Mode
The value that appears most frequently in the dataset.
Empirical Relationship (for unimodal, moderately skewed distributions):
Mean - Mode ≈ 3 * (Mean - Median)
For a positively (right) skewed distribution: Mean > Median > Mode
For a negatively (left) skewed distribution: Mean < Median < Mode
Geometric Mean (G.M.)
The n-th root of the product of n values. It is suitable for averaging ratios, percentages, or growth rates.
G.M. = (x₁ * x₂ * ... * xₙ)¹/ⁿ
Cannot be used if any data value is zero or negative.
Harmonic Mean (H.M.)
The reciprocal of the arithmetic mean of the reciprocals. It is suitable for averaging rates and speeds.
H.M. = n / ( Σ(1/x) )
Cannot be used if any data value is zero.
2.2 Partition Values
These are values that divide a sorted dataset into equal parts.
- Median: Divides the data into 2 equal parts. (It is the 2nd Quartile).
- Quartiles: Divide the data into 4 equal parts.
- Q₁ (First Quartile): The 25th percentile. 25% of data is below it.
- Q₂ (Second Quartile): The 50th percentile. This is the Median.
- Q₃ (Third Quartile): The 75th percentile. 75% of data is below it.
- Deciles: Divide the data into 10 equal parts (D₁, D₂, ..., D₉).
- Percentiles: Divide the data into 100 equal parts (P₁, P₂, ..., P₉₉).
The formula for any partition value in grouped data is a generalization of the median formula. For example, to find Q₁:
Q₁ = L + [ (N/4 - C) / f ] * h
2.3 Measures of Dispersion
These measures describe the spread, variability, or scatter of the data. A low value means the data is clustered tightly around the center, while a high value means it is spread out.
Range
The simplest measure. Range = Maximum Value - Minimum Value.
- Pros: Very easy to calculate.
- Cons: Highly unstable as it depends only on two extreme values.
Quartile Deviation (Q.D.)
Also called the Semi-Interquartile Range. It measures the spread of the middle 50% of the data.
Q.D. = (Q₃ - Q₁) / 2
- Pros: Not affected by outliers. Better than Range.
- Cons: Ignores 50% of the data (the top 25% and bottom 25%).
Mean Deviation (M.D.)
The average of the absolute differences between each data point and the mean (or median).
M.D. (about mean) = ( Σ |x - x-bar| ) / n
- Pros: Uses all data values.
- Cons: Ignores the signs of the deviations (absolute values), making it difficult for further mathematical treatment.
Variance (σ²) and Standard Deviation (σ)
The most important and widely used measures of dispersion.
- Variance (σ²): The average of the squared differences from the Mean. Squaring the deviations avoids the issue of positive and negative deviations canceling out.
- Population Variance (σ²): σ² = ( Σ(x - μ)² ) / N
- Sample Variance (s²): s² = ( Σ(x - x-bar)² ) / (n-1) (Note: We divide by n-1 for samples to get an unbiased estimator of σ².)
- Standard Deviation (σ): The square root of the variance.
- σ = sqrt(Variance)
- Pros: Has the same units as the original data, making it more interpretable than variance. Uses all data.
- Cons: Is sensitive to outliers (due to squaring).
2.4 Relative Measures of Dispersion
The measures above (Range, SD) are absolute and are in the same units as the data. To compare the variability of two different datasets (e.g., heights in cm vs. weights in kg), we need relative measures (unit-free coefficients).
Coefficient of Variation (C.V.)
The most important relative measure. It expresses the standard deviation as a percentage of the mean.
C.V. = (Standard Deviation / |Mean|) * 100
Uses of C.V.:
- Comparison: To compare the variability of two or more series with different units or different means.
- Consistency: A series with a lower C.V. is said to be more consistent, more stable, or less variable. A series with a higher C.V. is less consistent or more variable.
Other Coefficients:
- Coefficient of Range: (Max - Min) / (Max + Min)
- Coefficient of Quartile Deviation: (Q₃ - Q₁) / (Q₃ + Q₁)
- Coefficient of Mean Deviation: (M.D.) / (Mean or Median)
2.5 Moments
Moments are statistical measures that describe the characteristics of a distribution's shape.
Raw Moments (μ'ᵣ)
Moments about the origin (zero). The r-th raw moment is E[Xʳ].
- μ'₁ = (Σfx) / N = Mean (x-bar)
- μ'₂ = (Σfx²) / N
- μ'₃ = (Σfx³) / N
- μ'₄ = (Σfx⁴) / N
Central Moments (μᵣ)
Moments about the mean. The r-th central moment is E[(X - μ)ʳ].
- μ₁ = ( Σf(x - x-bar) ) / N = 0 (by definition of the mean).
- μ₂ = ( Σf(x - x-bar)² ) / N = Variance (σ²)
- μ₃ = ( Σf(x - x-bar)³ ) / N (Measures skewness)
- μ₄ = ( Σf(x - x-bar)⁴ ) / N (Measures kurtosis)
2.6 Measures of Skewness and Kurtosis
Skewness (Asymmetry)
Skewness measures the lack of symmetry in a distribution.
[Image of three distributions: negatively skewed (left tail), symmetric (bell curve), and positively skewed (right tail)]
- Positively Skewed (Right Skew): The tail on the right side is longer. Mean > Median > Mode.
- Negatively Skewed (Left Skew): The tail on the left side is longer. Mean < Median < Mode.
- Symmetric: The distribution is identical on both sides of the center. Mean = Median = Mode.
Measures of Skewness:
Kurtosis (Peakedness)
Kurtosis measures the "tailedness" and "peakedness" of a distribution compared to a Normal (bell-shaped) distribution.
- Mesokurtic: A normal distribution (medium peak, normal tails).
- Leptokurtic: A "skinny" distribution (high peak, fat tails). More outliers.
- Platykurtic: A "flat" distribution (low peak, thin tails). Fewer outliers.
Measure of Kurtosis:
- Moment Coefficient (β₂ "beta-two"):
β₂ = μ₄ / μ₂² = μ₄ / σ⁴
- If β₂ = 3, the curve is Mesokurtic (like the Normal distribution).
- If β₂ > 3, the curve is Leptokurtic (peaked).
- If β₂ < 3, the curve is Platykurtic (flat).