SEC-151L: Statistical Data Analysis using R (Lab)
This lab course provides hands-on practice for the concepts learned in SEC-151T. You will use R and R-Studio to perform data analysis.
List of Practicals
1. Plotting simple graphs in R
Commands:
# Sample data
x <- rnorm(100, mean = 50, sd = 10) # 100 normal random numbers
y_counts <- table(sample(c("A", "B", "C"), 100, replace = TRUE))
# Histogram (for continuous data)
hist(x)
# Bar Diagram (for categorical/discrete data)
barplot(y_counts)
# Pie Diagram
pie(y_counts)
# Boxplot
boxplot(x)
# Stem-and-leaf plot (displays in console)
stem(x)
# Ogives (Cumulative Frequency Plot)
# Create a cumulative frequency table
cum_freq <- cumsum(table(cut(x, breaks=10)))
# Plot the ogive
plot(cum_freq, type = "b",
xlab = "Class Intervals", ylab = "Cumulative Frequency")
2. Adding labels, titles, and legends to plots. Saving and exporting plots.
Commands:
# 1. Open the file device
png("my_beautiful_plot.png")
# 2. Create the plot with all customizations
hist(x,
main = "Histogram of X",
xlab = "Value of X",
ylab = "Frequency",
col = "lightblue",
border = "darkblue"
)
# Add a legend (more common on scatter plots, but for example)
legend("topright", legend = "Sample Data", fill = "lightblue")
# 3. Close the device to save the file
dev.off()
3. Tabulation of raw data in R
Commands:
# One-way frequency table
my_vector <- c("A", "B", "A", "A", "C", "B")
table(my_vector)
# Two-way cross-tabulation
df <- data.frame(
gender = c("M", "F", "M", "F", "F"),
smokes = c("Yes", "No", "No", "Yes", "No")
)
table(dfgender, dfsmokes)
4. To compute mean, median and mode for a grouped frequency data in R
Note: R works best with raw data. For grouped data, calculations are more manual or require special packages. For the mode, you must create a function.
# Function to find the mode
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Assume 'x' is your vector of raw data
mean(x)
median(x)
get_mode(x)
5. To compute Geometric mean and Harmonic mean.
Note: These are not in base R and require simple custom functions.
# Sample data (must be positive)
x <- c(4, 5, 8, 10, 12)
# Geometric Mean
# G.M. = (x1 * x2 * ... * xn)^(1/n)
# We use logs to avoid overflow: exp(mean(log(x)))
gm <- exp(mean(log(x)))
print(gm)
# Harmonic Mean
# H.M. = n / (1/x1 + 1/x2 + ... + 1/xn)
hm <- length(x) / sum(1/x)
print(hm)
6. To compute mean, median, variance, covariance, standard deviation in R.
Commands:
x <- c(10, 12, 15, 18, 20)
y <- c(5, 8, 7, 12, 10)
mean(x)
median(x)
var(x) # Sample variance (divides by n-1)
sd(x) # Sample standard deviation
cov(x, y) # Covariance between x and y
7. Computation of partition values, skewness and kurtosis in R.
Commands:
# 1. Partition values (Quantiles)
x <- rnorm(100)
quantile(x) # Gives 0%, 25%, 50%, 75%, 100%
quantile(x, probs = c(0.1, 0.9)) # Gives 10th and 90th percentiles
# 2. Skewness and Kurtosis (requires 'e1071' package)
# install.packages("e1071") # Run this once
library(e1071)
skewness(x)
kurtosis(x)
8. To compute correlation and lines of regression in R.
Commands:
# Assume 'x' and 'y' vectors exist
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 5, 4, 6)
# Correlation (Pearson is default)
cor(x, y)
# Correlation (Spearman)
cor(x, y, method = "spearman")
# Fit the regression line (y = a + bx)
model <- lm(y ~ x)
# See the coefficients (a and b)
print(model)
# See the full analysis (R-squared, p-values, etc.)
summary(model)
9. Random number generation from different distributions in R.
Commands: (The 'r' prefix stands for 'random')
# 10 random numbers from a Normal distribution
# (mean=0, sd=1 is default)
rnorm(10, mean = 50, sd = 10)
# 10 random numbers from a Uniform distribution
# (min=0, max=1 is default)
runif(10, min = 1, max = 6) # Simulates a die roll (sort of)
# 10 random numbers from a Binomial distribution
# (n=trials, p=probability)
rbinom(10, size = 5, prob = 0.5) # Simulates 10 people flipping 5 coins each
# 10 random numbers from a Poisson distribution
# (lambda=average rate)
rpois(10, lambda = 3) # e.g., # of customers in 10 different minutes
10. Fitting of simple linear regression in R and its interpretation.
This involves using lm() and summary() as in practical 8.
# 1. Fit the model
model <- lm(y ~ x)
# 2. Get the summary for interpretation
model_summary <- summary(model)
print(model_summary)
Interpretation:
- Coefficients (Estimate): The intercept (a) and slope (b).
- Coefficients (Pr(>|t|)): The p-value. If < 0.05, the variable is a statistically significant predictor.
- Multiple R-squared: The percentage of variance in 'y' explained by 'x'.
11. Fitting of polynomials and exponential curves in R.
Commands:
# 1. Polynomial (e.g., y = a + b1*x + b2*x^2)
# Use I() to treat x^2 "as is"
poly_model <- lm(y ~ x + I(x^2))
summary(poly_model)
# 2. Exponential (y = a*e^(b*x))
# We linearize by taking the log of y: log(y) = log(a) + b*x
# Note: y must be positive
exp_model <- lm(log(y) ~ x)
summary(exp_model)
12. Fitting of Binomial and Poisson distribution in R.
This usually means performing a Chi-Square Goodness-of-Fit Test to see if observed data "fits" a theoretical distribution.
# Example: Does a die roll fit a fair uniform distribution?
observed_counts <- c(18, 22, 19, 21, 20, 20) # Total 120 rolls
expected_probs <- c(1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
chisq.test(x = observed_counts, p = expected_probs)
# The p-value tells you the goodness of fit.
# p-value > 0.05: Good fit (Do not reject null hypothesis)
# p-value < 0.05: Poor fit (Reject null hypothesis)
13. Problems based on selecting random sample in R (with and without replacement).
Command: sample()
population <- 1:100 # A population of numbers 1 to 100
# 1. Sampling WITHOUT replacement
sample_without <- sample(population, size = 10, replace = FALSE)
print(sample_without)
# 2. Sampling WITH replacement
sample_with <- sample(population, size = 10, replace = TRUE)
print(sample_with)
14. Problems based on plotting normal probability plot in R (P-P plot and Q-Q plot).
The most common is the Q-Q (Quantile-Quantile) Plot, used to check if data is normally distributed.
Commands:
x <- rnorm(100) # Data that IS normal
# x <- runif(100) # Data that is NOT normal
# 1. Create the Q-Q plot
qqnorm(x)
# 2. Add the theoretical line
qqline(x, col = "red")
Interpretation: If the points fall closely along the straight red line, the data is considered normally distributed.