DSC-152 LAB: Descriptive Statistics and Probability Distributions
Table of Contents
- 1. Course Details
- 2. Learning Objectives
- 3. Learning Outcomes
- 4. List of Practicals (with Notes)
- Practical 1: Graphical Representation
- Practical 2: Measures of Central Tendency
- Practical 3: Measures of Dispersion
- Practical 4: Combined Mean and Variance
- Practical 5: Moments, Skewness & Kurtosis
- Practical 6: Fitting of Curves
- Practical 7: Karl Pearson's Correlation
- Practical 8: Bivariate Correlation
- Practical 9: Lines of Regression
- Practical 10: Spearman Rank Correlation
- Practicals 11-14: Fitting of Discrete Distributions
- Practicals 15-16: Applications & Fitting of Normal Distribution
1. Course Details
- Course Code: DSC-152 LAB
- Title: Descriptive Statistics and Probability Distributions
- Credits: 03
- Contact Hours: 90 Hours
- Full Marks: 100
- End Semester Exam: 70
- Internal: 30
- Pass Marks: 40
- End Semester Exam: 28
- Internal: 12
2. Learning Objectives
The main goals of this practical course are:
- To develop skills in the graphical representation of data.
- To compute measures of central tendency, dispersion, moments, and correlation coefficients.
- To gain proficiency in fitting curves, such as polynomials and exponential curves, to data.
3. Learning Outcomes
After successfully completing this lab course, you will be able to:
- Interpret graphs of visually represented data.
- Interpret measures of central tendency, dispersion, moments, and correlation coefficients.
- Identify the best-fitted model (curve or distribution) for a given set of data.
4. List of Practicals (with Notes)
This section details the 16 required practicals for the course.
Practical 1: Graphical representation of data.
Focus: This practical involves creating various statistical graphs to visualize data distributions.
Charts to Master:
- Histogram: Used for continuous frequency distributions. Bars are adjacent. Can have equal or unequal class intervals (in which case, frequency density must be plotted).
- Frequency Polygon: A line graph connecting the midpoints of the tops of histogram bars. Can also be drawn without a histogram.
- Bar Diagram (or Bar Chart): Used for discrete or categorical data. Bars are separated.
- Pie Chart: Shows the proportion of different categories in a circle. Calculate the angle for each component as (Component Value / Total Value) * 360°.
- Ogive Curves:
- Less Than Ogive: Plots cumulative frequency (less than type) against the upper class boundary. It is a rising curve.
- More Than Ogive: Plots cumulative frequency (more than type) against the lower class boundary. It is a falling curve.
Practical 2: Problems based on measures of central tendency.
Focus: Calculating the "center" or "average" of a dataset using different methods.
Formulas to Apply:
- Arithmetic Mean (μ or x-bar):
- Ungrouped Data: Σx / n
- Grouped Data: Σ(f * x) / N, where x = midpoint and N = Σf
- Median:
- Ungrouped Data: The (n+1)/2-th value of the sorted data.
- Grouped Data: L + [ (N/2 - C) / f ] * h
- L = Lower boundary of median class
- N = Total frequency
- C = Cumulative frequency of the class before the median class
- f = Frequency of the median class
- h = Class width
- Mode:
- Ungrouped Data: The most frequent value.
- Grouped Data: L + [ (f₁ - f₀) / (2f₁ - f₀ - f₂) ] * h
- L = Lower boundary of modal class
- f₁ = Frequency of modal class
- f₀ = Frequency of class before modal class
- f₂ = Frequency of class after modal class
- Geometric Mean (G.M.) and Harmonic Mean (H.M.) for ungrouped data.
Practical 3: Problems based on measures of dispersion.
Focus: Calculating the "spread" or "variability" of a dataset.
Formulas to Apply:
- Range: Highest Value - Lowest Value
- Quartile Deviation (Semi-Interquartile Range): (Q₃ - Q₁) / 2
- You must first calculate the first quartile (Q₁) and third quartile (Q₃) using a method similar to the median.
- Mean Deviation:
- About Mean: Σ |x - mean| / n
- About Median: Σ |x - median| / n
- Standard Deviation (σ): The square root of Variance.
- Variance (σ²):
- Ungrouped Data: [ Σ(x - mean)² ] / n OR [ Σx² / n ] - (mean)²
- Grouped Data: [ Σf(x - mean)² ] / N OR [ Σ(f * x²) / N ] - (mean)²
Practical 4: Problems based on combined mean and variance and coefficient of variation.
Focus: Combining statistics from two or more groups and comparing their variability.
Formulas to Apply:
- Combined Mean (x-bar₁₂): x-bar₁₂ = (n₁ * x-bar₁ + n₂ * x-bar₂) / (n₁ + n₂)
- Combined Variance (σ²₁₂): σ²₁₂ = [ n₁(σ₁² + d₁²) + n₂(σ₂² + d₂²) ] / (n₁ + n₂)
- d₁ = x-bar₁ - x-bar₁₂
- d₂ = x-bar₂ - x-bar₁₂
- Coefficient of Variation (C.V.): C.V. = (σ / |mean|) * 100
- This is a relative measure of dispersion, expressed as a percentage. It is used to compare the variability of two datasets, even if their means are different.
Practical 5: Problems based on moments, skewness and kurtosis.
Focus: Describing the shape of the distribution.
Formulas to Apply:
- Raw Moments (μ'ᵣ): (Moments about origin)
- μ'₁ = Σ(f * x) / N = Mean
- μ'₂ = Σ(f * x²) / N
- Central Moments (μᵣ): (Moments about mean)
- μ₁ = 0 (always)
- μ₂ = μ'₂ - (μ'₁)² = Variance
- μ₃ = μ'₃ - 3μ'₂μ'₁ + 2(μ'₁)³
- μ₄ = μ'₄ - 4μ'₃μ'₁ + 6μ'₂(μ'₁)² - 3(μ'₁)⁴
- Karl Pearson's Coefficient of Skewness (Sk): Sk = (Mean - Mode) / σ OR Sk = 3 * (Mean - Median) / σ
- If Sk > 0, positively skewed (right tail).
- If Sk < 0, negatively skewed (left tail).
- If Sk = 0, symmetric.
- Moment Coefficient of Skewness (β₁): β₁ = μ₃² / μ₂³
- β₁ = 0 for a symmetric distribution.
- Moment Coefficient of Kurtosis (β₂): β₂ = μ₄ / μ₂²
- If β₂ = 3, Mesokurtic (Normal curve).
- If β₂ > 3, Leptokurtic (More peaked than normal).
- If β₂ < 3, Platykurtic (Flatter than normal).
Practical 6: Fitting of polynomials, exponential curves.
Focus: Using the Principle of Least Squares to find the "best fit" curve for a set of (x, y) data points.
Curves to Fit:
- Straight Line (Polynomial of degree 1): y = a + bx
- Normal Equations:
Σy = na + bΣx
Σxy = aΣx + bΣx² - Solve this 2x2 system of equations for 'a' and 'b'.
- Normal Equations:
- Parabola (Polynomial of degree 2): y = a + bx + cx²
- Normal Equations:
Σy = na + bΣx + cΣx²
Σxy = aΣx + bΣx² + cΣx³
Σx²y = aΣx² + bΣx³ + cΣx⁴ - Solve this 3x3 system for 'a', 'b', and 'c'.
- Normal Equations:
- Exponential Curve: y = abˣ
- Transform first: Take log on both sides.
log(y) = log(a) + x * log(b) - Let Y = log(y), A = log(a), B = log(b).
- The equation becomes Y = A + Bx, which is a straight line.
- Fit this line using the normal equations for a straight line:
ΣY = nA + BΣx
ΣxY = AΣx + BΣx² - Solve for A and B, then find the original parameters: a = antilog(A) and b = antilog(B).
- Transform first: Take log on both sides.
Practical 7: Karl Pearson's correlation coefficient.
Focus: Calculating the strength and direction of the linear relationship between two quantitative variables (x, y).
Formula (r):
Computational Formula:
Properties of 'r':
- -1 ≤ r ≤ +1
- r = +1: Perfect positive linear correlation.
- r = -1: Perfect negative linear correlation.
- r = 0: No linear correlation.
Practical 8: Correlation coefficient for a bivariate frequency distribution.
Focus: Calculating Karl Pearson's 'r' when the data is given in a bivariate frequency table (a "correlation table").
Formula:
- This is the same formula as Practical 7, but using coded values (u, v) and frequencies (f).
- u = (x - A) / h (step-deviation for x)
- v = (y - B) / k (step-deviation for y)
- fᵤ and fᵥ are the marginal frequencies.
- Σ(f * u * v) is the sum of the product of frequencies and their respective u,v values from the cells of the table.
Practical 9: Fitting of lines of regression.
Focus: Finding the "best fit" straight line (y = a + bx) to predict one variable from another.
Two Lines of Regression:
- Regression Line of Y on X: (Used to predict Y if X is known) (y - y-bar) = bᵧₓ * (x - x-bar)
- bᵧₓ is the regression coefficient of Y on X.
- bᵧₓ = Cov(x, y) / σₓ² = r * (σᵧ / σₓ)
- Regression Line of X on Y: (Used to predict X if Y is known) (x - x-bar) = bₓᵧ * (y - y-bar)
- bₓᵧ is the regression coefficient of X on Y.
- bₓᵧ = Cov(x, y) / σᵧ² = r * (σₓ / σᵧ)
- The two lines intersect at the point (x-bar, y-bar).
- r² = bᵧₓ * bₓᵧ (The geometric mean of the coefficients is 'r').
- Both coefficients must have the same sign.
Practical 10: Spearman rank correlation with and without ties.
Focus: Calculating the correlation between two variables when the data is ranked (ordinal). It measures the strength of a monotonic relationship (not just linear).
Formulas:
- Case 1: No Ties in Ranks R = 1 - [ (6 * Σd²) / (n³ - n) ]
- d = R₁ - R₂ (Difference between the ranks for each observation)
- n = Number of pairs
- Case 2: With Ties in Ranks
- If a tie occurs, assign the average rank to all tied items.
- Calculate Correction Factors (C.F.) for each tie:
C.F. = (m³ - m) / 12, where 'm' is the number of items in a tie. - Calculate Σ(C.F.ₓ) for all ties in the x-variable and Σ(C.F.ᵧ) for all ties in the y-variable.
- Use the general formula (which is just Karl Pearson's 'r' applied to ranks).
Practicals 11-14: Fitting of Discrete Distributions
(Practical 11: Binomial , 12: Poisson , 13: Negative Binomial , 14: Suitable distribution )
Focus: Given an observed frequency distribution (Oᵢ), find the expected frequencies (Eᵢ) according to a theoretical distribution (e.g., Binomial, Poisson).
General Procedure:
- Estimate Parameters:
- Binomial (n, p): 'n' is usually given. Estimate 'p' by setting the observed mean (x-bar) equal to the theoretical mean (np).
x-bar = np => p = x-bar / n - Poisson (λ): Estimate 'λ' by setting the observed mean (x-bar) equal to the theoretical mean (λ).
λ = x-bar
- Binomial (n, p): 'n' is usually given. Estimate 'p' by setting the observed mean (x-bar) equal to the theoretical mean (np).
- Calculate Probabilities:
- Binomial: Use the p.m.f. P(x) = C(n, x)pˣ(1-p)ⁿ⁻ˣ to find P(0), P(1), P(2), ... P(n).
- Poisson: Use the p.m.f. P(x) = (e⁻ˡᵃᵐᵇᵈᵃ * λˣ) / x! to find P(0), P(1), P(2), ...
- Tip for Poisson: Use the recurrence relation: P(x) = P(x-1) * (λ / x). First, calculate P(0) = e⁻ˡᵃᵐᵇᵈᵃ, then P(1) = P(0)*(λ/1), P(2) = P(1)*(λ/2), etc.
- Calculate Expected Frequencies (Eᵢ): Eᵢ = N * P(x)
- Where N is the total observed frequency (N = ΣOᵢ).
- Compare: Create a table of Observed Frequencies (Oᵢ) and Expected Frequencies (Eᵢ) to see how good the fit is. (Later, a Chi-Square Goodness-of-Fit test is used).
- Calculate the observed mean (x-bar) and variance (s²).
- If x-bar ≈ s², a Poisson distribution is likely a good fit.
- If s² < x-bar, a Binomial distribution is likely a good fit.
Practicals 15-16: Applications & Fitting of Normal Distribution
(Practical 15: Applications of Normal distribution , 16: Fitting of Normal distribution )
Focus: Using the properties of the Normal curve to find probabilities and fit it to data.
Practical 15: Applications
- This involves word problems like "Given X ~ N(μ, σ²), find P(a < X < b)."
- Procedure:
- Standardize the values 'a' and 'b' using the Z-formula:
Z = (X - μ) / σ - Z₁ = (a - μ) / σ
- Z₂ = (b - μ) / σ
- Find P(Z₁ < Z < Z₂) by looking up the areas for Z₁ and Z₂ in the Standard Normal (Z) table and subtracting them.
- Standardize the values 'a' and 'b' using the Z-formula:
- This is similar to fitting discrete distributions, but for continuous data.
- Procedure:
- Calculate the observed mean (x-bar) and standard deviation (σ) from the given frequency distribution. These are your estimates for μ and σ.
- For each class boundary (x), calculate the Z-score:
Z = (x - μ) / σ - Using the Z-table, find the cumulative area from -∞ up to each Z-score (P(Z ≤ z)).
- Find the area (probability) within each class interval by subtracting the cumulative areas of its boundaries.
P(a < X < b) = P(Z < Z₂) - P(Z < Z₁) - Calculate the Expected Frequency for that class:
Eᵢ = N * P(a < X < b) - Compare the Oᵢ and Eᵢ values.
Focus: Using the properties of the Normal curve to find probabilities and fit it to data.
Practical 15: Applications
Practical 16: Fitting