Unit 2 Note - SEC 151 Statistics Semester 2

1. Plotting simple graphs in R

Commands:

# Sample data x <- rnorm(100, mean = 50, sd = 10) # 100 normal random numbers y_counts <- table(sample(c("A", "B", "C"), 100, replace = TRUE)) # Histogram (for continuous data) hist(x) # Bar Diagram (for categorical/discrete data) barplot(y_counts) # Pie Diagram pie(y_counts) # Boxplot boxplot(x) # Stem-and-leaf plot (displays in console) stem(x) # Ogives (Cumulative Frequency Plot) # Create a cumulative frequency table cum_freq <- cumsum(table(cut(x, breaks=10))) # Plot the ogive plot(cum_freq, type = "b", xlab = "Class Intervals", ylab = "Cumulative Frequency")

2. Adding labels, titles, and legends to plots. Saving and exporting plots.

Commands:

# 1. Open the file device png("my_beautiful_plot.png") # 2. Create the plot with all customizations hist(x, main = "Histogram of X", xlab = "Value of X", ylab = "Frequency", col = "lightblue", border = "darkblue" ) # Add a legend (more common on scatter plots, but for example) legend("topright", legend = "Sample Data", fill = "lightblue") # 3. Close the device to save the file dev.off()

3. Tabulation of raw data in R

Commands:

# One-way frequency table my_vector <- c("A", "B", "A", "A", "C", "B") table(my_vector) # Two-way cross-tabulation df <- data.frame( gender = c("M", "F", "M", "F", "F"), smokes = c("Yes", "No", "No", "Yes", "No") ) table(dfgender, dfsmokes)

4. To compute mean, median and mode for a grouped frequency data in R

Note: R works best with raw data. For grouped data, calculations are more manual or require special packages. For the mode, you must create a function.

# Function to find the mode get_mode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } # Assume 'x' is your vector of raw data mean(x) median(x) get_mode(x)

5. To compute Geometric mean and Harmonic mean.

Note: These are not in base R and require simple custom functions.

# Sample data (must be positive) x <- c(4, 5, 8, 10, 12) # Geometric Mean # G.M. = (x1 * x2 * ... * xn)^(1/n) # We use logs to avoid overflow: exp(mean(log(x))) gm <- exp(mean(log(x))) print(gm) # Harmonic Mean # H.M. = n / (1/x1 + 1/x2 + ... + 1/xn) hm <- length(x) / sum(1/x) print(hm)

6. To compute mean, median, variance, covariance, standard deviation in R.

Commands:

x <- c(10, 12, 15, 18, 20) y <- c(5, 8, 7, 12, 10) mean(x) median(x) var(x) # Sample variance (divides by n-1) sd(x) # Sample standard deviation cov(x, y) # Covariance between x and y

7. Computation of partition values, skewness and kurtosis in R.

Commands:

# 1. Partition values (Quantiles) x <- rnorm(100) quantile(x) # Gives 0%, 25%, 50%, 75%, 100% quantile(x, probs = c(0.1, 0.9)) # Gives 10th and 90th percentiles # 2. Skewness and Kurtosis (requires 'e1071' package) # install.packages("e1071") # Run this once library(e1071) skewness(x) kurtosis(x)

8. To compute correlation and lines of regression in R.

Commands:

# Assume 'x' and 'y' vectors exist x <- c(1, 2, 3, 4, 5) y <- c(2, 3, 5, 4, 6) # Correlation (Pearson is default) cor(x, y) # Correlation (Spearman) cor(x, y, method = "spearman") # Fit the regression line (y = a + bx) model <- lm(y ~ x) # See the coefficients (a and b) print(model) # See the full analysis (R-squared, p-values, etc.) summary(model)

9. Random number generation from different distributions in R.

Commands: (The 'r' prefix stands for 'random')

# 10 random numbers from a Normal distribution # (mean=0, sd=1 is default) rnorm(10, mean = 50, sd = 10) # 10 random numbers from a Uniform distribution # (min=0, max=1 is default) runif(10, min = 1, max = 6) # Simulates a die roll (sort of) # 10 random numbers from a Binomial distribution # (n=trials, p=probability) rbinom(10, size = 5, prob = 0.5) # Simulates 10 people flipping 5 coins each # 10 random numbers from a Poisson distribution # (lambda=average rate) rpois(10, lambda = 3) # e.g., # of customers in 10 different minutes

10. Fitting of simple linear regression in R and its interpretation.

This involves using lm() and summary() as in practical 8.

# 1. Fit the model model <- lm(y ~ x) # 2. Get the summary for interpretation model_summary <- summary(model) print(model_summary)

Interpretation:

Coefficients (Estimate): The intercept (a) and slope (b).
Coefficients (Pr(>|t|)): The p-value. If < 0.05, the variable is a statistically significant predictor.
Multiple R-squared: The percentage of variance in 'y' explained by 'x'.

11. Fitting of polynomials and exponential curves in R.

Commands:

# 1. Polynomial (e.g., y = a + b1*x + b2*x^2) # Use I() to treat x^2 "as is" poly_model <- lm(y ~ x + I(x^2)) summary(poly_model) # 2. Exponential (y = a*e^(b*x)) # We linearize by taking the log of y: log(y) = log(a) + b*x # Note: y must be positive exp_model <- lm(log(y) ~ x) summary(exp_model)

12. Fitting of Binomial and Poisson distribution in R.

This usually means performing a Chi-Square Goodness-of-Fit Test to see if observed data "fits" a theoretical distribution.

# Example: Does a die roll fit a fair uniform distribution? observed_counts <- c(18, 22, 19, 21, 20, 20) # Total 120 rolls expected_probs <- c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6) chisq.test(x = observed_counts, p = expected_probs) # The p-value tells you the goodness of fit. # p-value > 0.05: Good fit (Do not reject null hypothesis) # p-value < 0.05: Poor fit (Reject null hypothesis)

13. Problems based on selecting random sample in R (with and without replacement).

Command: sample()

population <- 1:100 # A population of numbers 1 to 100 # 1. Sampling WITHOUT replacement sample_without <- sample(population, size = 10, replace = FALSE) print(sample_without) # 2. Sampling WITH replacement sample_with <- sample(population, size = 10, replace = TRUE) print(sample_with)

14. Problems based on plotting normal probability plot in R (P-P plot and Q-Q plot).

The most common is the Q-Q (Quantile-Quantile) Plot, used to check if data is normally distributed.

Commands:

x <- rnorm(100) # Data that IS normal # x <- runif(100) # Data that is NOT normal # 1. Create the Q-Q plot qqnorm(x) # 2. Add the theoretical line qqline(x, col = "red")

Interpretation: If the points fall closely along the straight red line, the data is considered normally distributed.

Knowlet

SEC-151L: Statistical Data Analysis using R (Lab)

List of Practicals

1. Plotting simple graphs in R

2. Adding labels, titles, and legends to plots. Saving and exporting plots.

3. Tabulation of raw data in R

4. To compute mean, median and mode for a grouped frequency data in R

5. To compute Geometric mean and Harmonic mean.

6. To compute mean, median, variance, covariance, standard deviation in R.

7. Computation of partition values, skewness and kurtosis in R.

8. To compute correlation and lines of regression in R.

9. Random number generation from different distributions in R.

10. Fitting of simple linear regression in R and its interpretation.

11. Fitting of polynomials and exponential curves in R.

12. Fitting of Binomial and Poisson distribution in R.

13. Problems based on selecting random sample in R (with and without replacement).

14. Problems based on plotting normal probability plot in R (P-P plot and Q-Q plot).