DSC-152 LAB: Descriptive Statistics and Probability Distributions

1. Course Details
2. Learning Objectives
3. Learning Outcomes
4. List of Practicals (with Notes)

1. Course Details

Course Code: DSC-152 LAB
Title: Descriptive Statistics and Probability Distributions
Credits: 03
Contact Hours: 90 Hours
Full Marks: 100
- End Semester Exam: 70
- Internal: 30
Pass Marks: 40
- End Semester Exam: 28
- Internal: 12

2. Learning Objectives

The main goals of this practical course are:

To develop skills in the graphical representation of data.
To compute measures of central tendency, dispersion, moments, and correlation coefficients.
To gain proficiency in fitting curves, such as polynomials and exponential curves, to data.

3. Learning Outcomes

After successfully completing this lab course, you will be able to:

Interpret graphs of visually represented data.
Interpret measures of central tendency, dispersion, moments, and correlation coefficients.
Identify the best-fitted model (curve or distribution) for a given set of data.

4. List of Practicals (with Notes)

This section details the 16 required practicals for the course.

Practical 1: Graphical representation of data.

Focus: This practical involves creating various statistical graphs to visualize data distributions.

Charts to Master:

Histogram: Used for continuous frequency distributions. Bars are adjacent. Can have equal or unequal class intervals (in which case, frequency density must be plotted).
Frequency Polygon: A line graph connecting the midpoints of the tops of histogram bars. Can also be drawn without a histogram.
Bar Diagram (or Bar Chart): Used for discrete or categorical data. Bars are separated.
Pie Chart: Shows the proportion of different categories in a circle. Calculate the angle for each component as (Component Value / Total Value) * 360°.
Ogive Curves:
- Less Than Ogive: Plots cumulative frequency (less than type) against the upper class boundary. It is a rising curve.
- More Than Ogive: Plots cumulative frequency (more than type) against the lower class boundary. It is a falling curve.

Exam Tip: The intersection point of the "Less Than" and "More Than" Ogive curves corresponds to the Median of the distribution on the x-axis.

Practical 2: Problems based on measures of central tendency.

Focus: Calculating the "center" or "average" of a dataset using different methods.

Formulas to Apply:

Arithmetic Mean (μ or x-bar):
- Ungrouped Data: Σx / n
- Grouped Data: Σ(f * x) / N, where x = midpoint and N = Σf
Median:
- Ungrouped Data: The (n+1)/2-th value of the sorted data.
- Grouped Data: L + [ (N/2 - C) / f ] * h
  - L = Lower boundary of median class
  - N = Total frequency
  - C = Cumulative frequency of the class before the median class
  - f = Frequency of the median class
  - h = Class width
Mode:
- Ungrouped Data: The most frequent value.
- Grouped Data: L + [ (f₁ - f₀) / (2f₁ - f₀ - f₂) ] * h
  - L = Lower boundary of modal class
  - f₁ = Frequency of modal class
  - f₀ = Frequency of class before modal class
  - f₂ = Frequency of class after modal class
Geometric Mean (G.M.) and Harmonic Mean (H.M.) for ungrouped data.

Practical 3: Problems based on measures of dispersion.

Focus: Calculating the "spread" or "variability" of a dataset.

Formulas to Apply:

Range: Highest Value - Lowest Value
Quartile Deviation (Semi-Interquartile Range): (Q₃ - Q₁) / 2
- You must first calculate the first quartile (Q₁) and third quartile (Q₃) using a method similar to the median.
Mean Deviation:
- About Mean: Σ |x - mean| / n
- About Median: Σ |x - median| / n
Standard Deviation (σ): The square root of Variance.
Variance (σ²):
- Ungrouped Data: [ Σ(x - mean)² ] / n OR [ Σx² / n ] - (mean)²
- Grouped Data: [ Σf(x - mean)² ] / N OR [ Σ(f * x²) / N ] - (mean)²

Practical 4: Problems based on combined mean and variance and coefficient of variation.

Focus: Combining statistics from two or more groups and comparing their variability.

Formulas to Apply:

Combined Mean (x-bar₁₂):
x-bar₁₂ = (n₁ * x-bar₁ + n₂ * x-bar₂) / (n₁ + n₂)
Combined Variance (σ²₁₂):
σ²₁₂ = [ n₁(σ₁² + d₁²) + n₂(σ₂² + d₂²) ] / (n₁ + n₂)
- d₁ = x-bar₁ - x-bar₁₂
- d₂ = x-bar₂ - x-bar₁₂
Coefficient of Variation (C.V.):
C.V. = (σ / |mean|) * 100
- This is a relative measure of dispersion, expressed as a percentage. It is used to compare the variability of two datasets, even if their means are different.

A dataset with a lower C.V. is considered more consistent or less variable.

Practical 5: Problems based on moments, skewness and kurtosis.

Focus: Describing the shape of the distribution.

Formulas to Apply:

Raw Moments (μ'ᵣ): (Moments about origin)
- μ'₁ = Σ(f * x) / N = Mean
- μ'₂ = Σ(f * x²) / N
Central Moments (μᵣ): (Moments about mean)
- μ₁ = 0 (always)
- μ₂ = μ'₂ - (μ'₁)² = Variance
- μ₃ = μ'₃ - 3μ'₂μ'₁ + 2(μ'₁)³
- μ₄ = μ'₄ - 4μ'₃μ'₁ + 6μ'₂(μ'₁)² - 3(μ'₁)⁴
Karl Pearson's Coefficient of Skewness (Sk):
Sk = (Mean - Mode) / σ OR Sk = 3 * (Mean - Median) / σ
- If Sk > 0, positively skewed (right tail).
- If Sk < 0, negatively skewed (left tail).
- If Sk = 0, symmetric.
Moment Coefficient of Skewness (β₁):
β₁ = μ₃² / μ₂³
- β₁ = 0 for a symmetric distribution.
Moment Coefficient of Kurtosis (β₂):
β₂ = μ₄ / μ₂²
- If β₂ = 3, Mesokurtic (Normal curve).
- If β₂ > 3, Leptokurtic (More peaked than normal).
- If β₂ < 3, Platykurtic (Flatter than normal).

Practical 6: Fitting of polynomials, exponential curves.

Focus: Using the Principle of Least Squares to find the "best fit" curve for a set of (x, y) data points.

Curves to Fit:

Straight Line (Polynomial of degree 1): y = a + bx
- Normal Equations:
  Σy = na + bΣx
  Σxy = aΣx + bΣx²
- Solve this 2x2 system of equations for 'a' and 'b'.
Parabola (Polynomial of degree 2): y = a + bx + cx²
- Normal Equations:
  Σy = na + bΣx + cΣx²
  Σxy = aΣx + bΣx² + cΣx³
  Σx²y = aΣx² + bΣx³ + cΣx⁴
- Solve this 3x3 system for 'a', 'b', and 'c'.
Exponential Curve: y = abˣ
- Transform first: Take log on both sides.
  log(y) = log(a) + x * log(b)
- Let Y = log(y), A = log(a), B = log(b).
- The equation becomes Y = A + Bx, which is a straight line.
- Fit this line using the normal equations for a straight line:
  ΣY = nA + BΣx
  ΣxY = AΣx + BΣx²
- Solve for A and B, then find the original parameters: a = antilog(A) and b = antilog(B).

Practical 7: Karl Pearson's correlation coefficient.

Focus: Calculating the strength and direction of the linear relationship between two quantitative variables (x, y).

Formula (r):

r = Cov(x, y) / (σₓ * σᵧ)

Computational Formula:

r = [ nΣxy - (Σx)(Σy) ] / sqrt( [nΣx² - (Σx)²] * [nΣy² - (Σy)²] )

Properties of 'r':

-1 ≤ r ≤ +1
r = +1: Perfect positive linear correlation.
r = -1: Perfect negative linear correlation.
r = 0: No linear correlation.

Practical 8: Correlation coefficient for a bivariate frequency distribution.

Focus: Calculating Karl Pearson's 'r' when the data is given in a bivariate frequency table (a "correlation table").

Formula:

r = [ NΣ(f * u * v) - (Σfᵤu)(Σfᵥv) ] / sqrt( [NΣ(fᵤu²) - (Σfᵤu)²] * [NΣ(fᵥv²) - (Σfᵥv)²] )

This is the same formula as Practical 7, but using coded values (u, v) and frequencies (f).
u = (x - A) / h (step-deviation for x)
v = (y - B) / k (step-deviation for y)
fᵤ and fᵥ are the marginal frequencies.
Σ(f * u * v) is the sum of the product of frequencies and their respective u,v values from the cells of the table.

Correlation is independent of change of origin and scale. This is why we can use the (u, v) coded values to simplify calculations, and the 'r' value will be the same as for the original (x, y) data.

Practical 9: Fitting of lines of regression.

Focus: Finding the "best fit" straight line (y = a + bx) to predict one variable from another.

Two Lines of Regression:

Regression Line of Y on X: (Used to predict Y if X is known)
(y - y-bar) = bᵧₓ * (x - x-bar)
- bᵧₓ is the regression coefficient of Y on X.
- bᵧₓ = Cov(x, y) / σₓ² = r * (σᵧ / σₓ)
Regression Line of X on Y: (Used to predict X if Y is known)
(x - x-bar) = bₓᵧ * (y - y-bar)
- bₓᵧ is the regression coefficient of X on Y.
- bₓᵧ = Cov(x, y) / σᵧ² = r * (σₓ / σᵧ)

Properties:

The two lines intersect at the point (x-bar, y-bar).
r² = bᵧₓ * bₓᵧ (The geometric mean of the coefficients is 'r').
Both coefficients must have the same sign.

Practical 10: Spearman rank correlation with and without ties.

Focus: Calculating the correlation between two variables when the data is ranked (ordinal). It measures the strength of a monotonic relationship (not just linear).

Formulas:

Case 1: No Ties in Ranks
R = 1 - [ (6 * Σd²) / (n³ - n) ]
- d = R₁ - R₂ (Difference between the ranks for each observation)
- n = Number of pairs
Case 2: With Ties in Ranks
- If a tie occurs, assign the average rank to all tied items.
- Calculate Correction Factors (C.F.) for each tie:
  C.F. = (m³ - m) / 12, where 'm' is the number of items in a tie.
- Calculate Σ(C.F.ₓ) for all ties in the x-variable and Σ(C.F.ᵧ) for all ties in the y-variable.
- Use the general formula (which is just Karl Pearson's 'r' applied to ranks).

Practicals 11-14: Fitting of Discrete Distributions

(Practical 11: Binomial , 12: Poisson , 13: Negative Binomial , 14: Suitable distribution )

Focus: Given an observed frequency distribution (Oᵢ), find the expected frequencies (Eᵢ) according to a theoretical distribution (e.g., Binomial, Poisson).

General Procedure:

Estimate Parameters:
- Binomial (n, p): 'n' is usually given. Estimate 'p' by setting the observed mean (x-bar) equal to the theoretical mean (np).
  x-bar = np => p = x-bar / n
- Poisson (λ): Estimate 'λ' by setting the observed mean (x-bar) equal to the theoretical mean (λ).
  λ = x-bar
Calculate Probabilities:
- Binomial: Use the p.m.f. P(x) = C(n, x)pˣ(1-p)ⁿ⁻ˣ to find P(0), P(1), P(2), ... P(n).
- Poisson: Use the p.m.f. P(x) = (e⁻ˡᵃᵐᵇᵈᵃ * λˣ) / x! to find P(0), P(1), P(2), ...
- Tip for Poisson: Use the recurrence relation: P(x) = P(x-1) * (λ / x). First, calculate P(0) = e⁻ˡᵃᵐᵇᵈᵃ, then P(1) = P(0)*(λ/1), P(2) = P(1)*(λ/2), etc.
Calculate Expected Frequencies (Eᵢ):
Eᵢ = N * P(x)
- Where N is the total observed frequency (N = ΣOᵢ).
Compare: Create a table of Observed Frequencies (Oᵢ) and Expected Frequencies (Eᵢ) to see how good the fit is. (Later, a Chi-Square Goodness-of-Fit test is used).

"Fitting a suitable distribution" means you must first decide which one is appropriate.

Calculate the observed mean (x-bar) and variance (s²).
If x-bar ≈ s², a Poisson distribution is likely a good fit.
If s² < x-bar, a Binomial distribution is likely a good fit.

Practicals 15-16: Applications & Fitting of Normal Distribution

(Practical 15: Applications of Normal distribution , 16: Fitting of Normal distribution )

Focus: Using the properties of the Normal curve to find probabilities and fit it to data.

Practical 15: Applications

Focus: Using the properties of the Normal curve to find probabilities and fit it to data.

Practical 15: Applications

This involves word problems like "Given X ~ N(μ, σ²), find P(a < X < b)."
Procedure:
1. Standardize the values 'a' and 'b' using the Z-formula:
  Z = (X - μ) / σ
2. Z₁ = (a - μ) / σ
3. Z₂ = (b - μ) / σ
4. Find P(Z₁ < Z < Z₂) by looking up the areas for Z₁ and Z₂ in the Standard Normal (Z) table and subtracting them.

Practical 16: Fitting

This is similar to fitting discrete distributions, but for continuous data.
Procedure:
1. Calculate the observed mean (x-bar) and standard deviation (σ) from the given frequency distribution. These are your estimates for μ and σ.
2. For each class boundary (x), calculate the Z-score:
  Z = (x - μ) / σ
3. Using the Z-table, find the cumulative area from -∞ up to each Z-score (P(Z ≤ z)).
4. Find the area (probability) within each class interval by subtracting the cumulative areas of its boundaries.
  P(a < X < b) = P(Z < Z₂) - P(Z < Z₁)
5. Calculate the Expected Frequency for that class:
  Eᵢ = N * P(a < X < b)
6. Compare the Oᵢ and Eᵢ values.

Knowlet

DSC-152 LAB: Descriptive Statistics and Probability Distributions

Table of Contents

1. Course Details

2. Learning Objectives

3. Learning Outcomes

4. List of Practicals (with Notes)

Practical 1: Graphical representation of data.

Practical 2: Problems based on measures of central tendency.

Practical 3: Problems based on measures of dispersion.

Practical 4: Problems based on combined mean and variance and coefficient of variation.

Practical 5: Problems based on moments, skewness and kurtosis.

Practical 6: Fitting of polynomials, exponential curves.

Practical 7: Karl Pearson's correlation coefficient.

Practical 8: Correlation coefficient for a bivariate frequency distribution.

Practical 9: Fitting of lines of regression.

Practical 10: Spearman rank correlation with and without ties.

Practicals 11-14: Fitting of Discrete Distributions

Practicals 15-16: Applications & Fitting of Normal Distribution