SEC-151L: Statistical Data Analysis using R (Lab)

This lab course provides hands-on practice for the concepts learned in SEC-151T. You will use R and R-Studio to perform data analysis.

List of Practicals

1. Plotting simple graphs in R

Commands:

# Sample data x <- rnorm(100, mean = 50, sd = 10) # 100 normal random numbers y_counts <- table(sample(c("A", "B", "C"), 100, replace = TRUE)) # Histogram (for continuous data) hist(x) # Bar Diagram (for categorical/discrete data) barplot(y_counts) # Pie Diagram pie(y_counts) # Boxplot boxplot(x) # Stem-and-leaf plot (displays in console) stem(x) # Ogives (Cumulative Frequency Plot) # Create a cumulative frequency table cum_freq <- cumsum(table(cut(x, breaks=10))) # Plot the ogive plot(cum_freq, type = "b", xlab = "Class Intervals", ylab = "Cumulative Frequency")

2. Adding labels, titles, and legends to plots. Saving and exporting plots.

Commands:

# 1. Open the file device png("my_beautiful_plot.png") # 2. Create the plot with all customizations hist(x, main = "Histogram of X", xlab = "Value of X", ylab = "Frequency", col = "lightblue", border = "darkblue" ) # Add a legend (more common on scatter plots, but for example) legend("topright", legend = "Sample Data", fill = "lightblue") # 3. Close the device to save the file dev.off()

3. Tabulation of raw data in R

Commands:

# One-way frequency table my_vector <- c("A", "B", "A", "A", "C", "B") table(my_vector) # Two-way cross-tabulation df <- data.frame( gender = c("M", "F", "M", "F", "F"), smokes = c("Yes", "No", "No", "Yes", "No") ) table(dfgender, dfsmokes)

4. To compute mean, median and mode for a grouped frequency data in R

Note: R works best with raw data. For grouped data, calculations are more manual or require special packages. For the mode, you must create a function.

# Function to find the mode get_mode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } # Assume 'x' is your vector of raw data mean(x) median(x) get_mode(x)

5. To compute Geometric mean and Harmonic mean.

Note: These are not in base R and require simple custom functions.

# Sample data (must be positive) x <- c(4, 5, 8, 10, 12) # Geometric Mean # G.M. = (x1 * x2 * ... * xn)^(1/n) # We use logs to avoid overflow: exp(mean(log(x))) gm <- exp(mean(log(x))) print(gm) # Harmonic Mean # H.M. = n / (1/x1 + 1/x2 + ... + 1/xn) hm <- length(x) / sum(1/x) print(hm)

6. To compute mean, median, variance, covariance, standard deviation in R.

Commands:

x <- c(10, 12, 15, 18, 20) y <- c(5, 8, 7, 12, 10) mean(x) median(x) var(x) # Sample variance (divides by n-1) sd(x) # Sample standard deviation cov(x, y) # Covariance between x and y

7. Computation of partition values, skewness and kurtosis in R.

Commands:

# 1. Partition values (Quantiles) x <- rnorm(100) quantile(x) # Gives 0%, 25%, 50%, 75%, 100% quantile(x, probs = c(0.1, 0.9)) # Gives 10th and 90th percentiles # 2. Skewness and Kurtosis (requires 'e1071' package) # install.packages("e1071") # Run this once library(e1071) skewness(x) kurtosis(x)

8. To compute correlation and lines of regression in R.

Commands:

# Assume 'x' and 'y' vectors exist x <- c(1, 2, 3, 4, 5) y <- c(2, 3, 5, 4, 6) # Correlation (Pearson is default) cor(x, y) # Correlation (Spearman) cor(x, y, method = "spearman") # Fit the regression line (y = a + bx) model <- lm(y ~ x) # See the coefficients (a and b) print(model) # See the full analysis (R-squared, p-values, etc.) summary(model)

9. Random number generation from different distributions in R.

Commands: (The 'r' prefix stands for 'random')

# 10 random numbers from a Normal distribution # (mean=0, sd=1 is default) rnorm(10, mean = 50, sd = 10) # 10 random numbers from a Uniform distribution # (min=0, max=1 is default) runif(10, min = 1, max = 6) # Simulates a die roll (sort of) # 10 random numbers from a Binomial distribution # (n=trials, p=probability) rbinom(10, size = 5, prob = 0.5) # Simulates 10 people flipping 5 coins each # 10 random numbers from a Poisson distribution # (lambda=average rate) rpois(10, lambda = 3) # e.g., # of customers in 10 different minutes

10. Fitting of simple linear regression in R and its interpretation.

This involves using lm() and summary() as in practical 8.

# 1. Fit the model model <- lm(y ~ x) # 2. Get the summary for interpretation model_summary <- summary(model) print(model_summary)

Interpretation:

11. Fitting of polynomials and exponential curves in R.

Commands:

# 1. Polynomial (e.g., y = a + b1*x + b2*x^2) # Use I() to treat x^2 "as is" poly_model <- lm(y ~ x + I(x^2)) summary(poly_model) # 2. Exponential (y = a*e^(b*x)) # We linearize by taking the log of y: log(y) = log(a) + b*x # Note: y must be positive exp_model <- lm(log(y) ~ x) summary(exp_model)

12. Fitting of Binomial and Poisson distribution in R.

This usually means performing a Chi-Square Goodness-of-Fit Test to see if observed data "fits" a theoretical distribution.

# Example: Does a die roll fit a fair uniform distribution? observed_counts <- c(18, 22, 19, 21, 20, 20) # Total 120 rolls expected_probs <- c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6) chisq.test(x = observed_counts, p = expected_probs) # The p-value tells you the goodness of fit. # p-value > 0.05: Good fit (Do not reject null hypothesis) # p-value < 0.05: Poor fit (Reject null hypothesis)

13. Problems based on selecting random sample in R (with and without replacement).

Command: sample()

population <- 1:100 # A population of numbers 1 to 100 # 1. Sampling WITHOUT replacement sample_without <- sample(population, size = 10, replace = FALSE) print(sample_without) # 2. Sampling WITH replacement sample_with <- sample(population, size = 10, replace = TRUE) print(sample_with)

14. Problems based on plotting normal probability plot in R (P-P plot and Q-Q plot).

The most common is the Q-Q (Quantile-Quantile) Plot, used to check if data is normally distributed.

Commands:

x <- rnorm(100) # Data that IS normal # x <- runif(100) # Data that is NOT normal # 1. Create the Q-Q plot qqnorm(x) # 2. Add the theoretical line qqline(x, col = "red")

Interpretation: If the points fall closely along the straight red line, the data is considered normally distributed.