SEC-151T: Statistical Data Analysis using R (Theory)
Table of Contents
Unit 1: Introduction to R
Introduction to R and its features
- What is R? R is a powerful, open-source programming language and software environment used for statistical computing and graphics.
- Features:
- Free and Open Source: Anyone can use it for free.
- Cross-Platform: Works on Windows, Mac, and Linux.
- Comprehensive: Has a vast collection of tools for data manipulation, analysis, and visualization.
- Packages: Its main strength. R has thousands of user-contributed "packages" (like add-ons) for specialized tasks (e.g., genetics, finance, machine learning).
- Vectorized Operations: R is optimized to work on entire vectors (lists of numbers) at once, making code shorter and faster.
Installing R and R-Studio
- Install R: First, you must install the base R system from CRAN (Comprehensive R Archive Network). Go to `cran.r-project.org` and download the version for your operating system.
- Install R-Studio: R-Studio is an Integrated Development Environment (IDE) that provides a much user-friendlier interface for R. It includes a text editor, a console, plot viewers, and more. Download the free R-Studio Desktop from `rstudio.com` (now Posit).
Basic R syntax and data types
R is an object-oriented language. You create "objects" and give them names using the assignment operator <- (or =).
x <- 10
y <- "hello"
Basic Data Types (Atomic Vectors)
- numeric: Numbers with decimals (e.g., 10.5, 55, 78.3).
- integer: Whole numbers (e.g., 1L, 25L, -2L). The 'L' tells R to store it as an integer.
- logical: Boolean values,
TRUEorFALSE(orT/F). - character: Text strings, enclosed in quotes (e.g., "hello", "data", "2024").
Key Data Structures
- Vector: The most basic data structure. A 1D list of elements of the same type. Created with
c()(combine).v <- c(10, 20, 30, 40, 50) - Data Frame: The most important structure for statistics. A 2D table, like an Excel spreadsheet. Columns can be of different types, but all must have the same length. df <- data.frame(
name = c("Alice", "Bob"),
age = c(25, 30),
is_student = c(TRUE, FALSE)
) - List: A very flexible container that can hold anything, including other lists or data frames.
- Matrix: A 2D array where all elements are of the same type.
Importing data into R from various file formats
- CSV (Comma Separated Values): my_data <- read.csv("file_path/my_file.csv")
- Text Files (e.g., .txt): my_data <- read.table("file_path/my_file.txt", header = TRUE)
- Excel Files: You must first install and load a package, like
readxl.install.packages("readxl") # Do this once
library(readxl) # Do this every session
my_data <- read_excel("file_path/my_file.xlsx", sheet = "Sheet1")
Problems faced in case of missing data
In R, missing values are represented by the special marker NA (Not Available).
Most R functions will fail or return NA if their input contains an NA value.
mean(x) # This will return NA
Data cleaning techniques
- Identifying NAs: Use the
is.na()function to find them.is.na(x) # Returns: FALSE FALSE FALSE TRUE FALSE - Handling NAs in functions: Many functions have an argument
na.rm = TRUE(NA remove) to tell them to ignore missing values.mean(x, na.rm = TRUE) # Returns 2.75 - Removing NA rows: The
na.omit()function will delete any row from a data frame that contains at least one NA value.clean_data <- na.omit(my_data_frame)
Unit 2: Data Visualization in R
Introduction to data visualization in R
R is famous for its high-quality, customizable graphics. Visualization is crucial for exploring data, identifying patterns, checking assumptions, and communicating results.
Basic plotting functions in R
R's "base graphics" system provides powerful and quick functions for creating plots.
Creating and customizing various types of plots
- Histograms: For visualizing the distribution of a single continuous variable. hist(my_dataage)
- Bar charts: For visualizing the frequency of a single categorical variable. gendercounts <- table(mydatagender)
barplot(gender_counts) - Box plots (Box-and-Whisker): For visualizing the "five-number summary" (min, Q1, median, Q3, max) and identifying outliers. boxplot(my_datasalary)
- Pie charts: For visualizing proportions. (Note: Pie charts are generally discouraged by statisticians as bar charts are easier to read). pie(gendercounts)
- Frequency polygons and curves: A histogram shows bars; a frequency polygon shows the same information as a line. A frequency curve is a smoothed version (a density plot). # For a density curve (frequency curve)
plot(density(mydataage))
Adding labels, titles, and legends to plots
You can add these as arguments directly inside the plot() command:
main = "My Plot Title", # Main title
xlab = "X-Axis Label", # X-axis label
ylab = "Y-Axis Label", # Y-axis label
col = "blue", # Color of points
pch = 19 # Point character (19 is a solid circle)
)
You can also use the legend() function to add a legend after the plot is created.
Saving and exporting plots
You can save a plot to a file (like PNG, PDF, or JPEG) in two steps:
- Open a graphics device: e.g.,
png("my_plot.png") - Create your plot: e.g.,
hist(my_dataage) - Close the device:
dev.off()(This saves the file).
Unit 3: Descriptive Statistics in R (Part 1)
Measures of central tendency
- Mean: mean(mydatasalary, na.rm = TRUE)
- Median: median(my_datasalary, na.rm = TRUE)
- Mode: R has no built-in
mode()function. You must find it usingtable().counts <- table(mydatacategory)
mode_value <- names(counts)[which.max(counts)]
Weighted mean
Used when some values are more important than others.
weights <- c(0.5, 0.3, 0.2) # Weights must sum to 1
weighted.mean(values, weights)
Geometric mean
Used for averaging growth rates. R has no built-in function.
gm <- exp(mean(log(x)))
Measures of dispersion
- Range: range(my_datasalary) # Returns (min, max)
- Variance: var(mydatasalary, na.rm = TRUE)
- Standard deviation: sd(my_datasalary, na.rm = TRUE)
Coefficient of variation (C.V.)
A relative measure of dispersion, useful for comparing variability of datasets with different means. C.V. = (SD / Mean) * 100.
Percentiles and quartiles
The quantile() function is very powerful.
quantile(mydatasalary)
# Get the 90th percentile
quantile(my_datasalary, probs = 0.90)
Interquartile range (IQR)
The distance between the 75th (Q3) and 25th (Q1) percentile. It's a robust measure of spread.
Unit 4: Descriptive Statistics in R (Part 2)
Measures of skewness and kurtosis
These are not in base R. You must install the e1071 package.
library(e1071)
# Skewness:
skewness(my_datasalary)
# Kurtosis:
kurtosis(mydatasalary)
- Skewness Interpretation:
~ 0: Roughly symmetric.> 0: Positively (right) skewed.< 0: Negatively (left) skewed.
- Kurtosis Interpretation: (Note: R's
kurtosis()function often reports "excess" kurtosis, where 0 is normal).~ 0: Mesokurtic (normal peak/tails).> 0: Leptokurtic (high peak, fat tails).< 0: Platykurtic (flat peak, thin tails).
Introduction to bivariate analysis
Analyzing two variables at the same time.
- Cross-tabulation: Creates a frequency table for two categorical variables. table(my_datagender, mydatadepartment)
- Scatter plots: The best way to visualize the relationship between two continuous variables. plot(x = my_dataadvertisingspend, y = mydatasales)
Pearson's and Spearman's correlation coefficient
The cor() function calculates the correlation.
- Pearson's (r): Measures the strength of the linear relationship. (Default)
- Spearman's (rho): Measures the strength of the monotonic relationship (non-parametric, based on ranks).
cor(my_datax, mydatay, method = "pearson")
# Spearman rank correlation
cor(my_datax, mydatay, method = "spearman")
Unit 5: Regression in R
Simple linear regression
A statistical method to model the linear relationship between one independent variable (x) and one dependent variable (y). The model is: Y = β₀ + β₁X + ε.
Fitting of regression models in R
We use the lm() (linear model) function. The formula y ~ x is read as "y is modeled by x".
model <- lm(sales ~ advertising_spend, data = my_data)
Least Squares method
The lm() function automatically uses the Method of Ordinary Least Squares (OLS). This is the method that finds the line (i.e., the values for β₀ and β₁) that minimizes the sum of the squared residuals (the vertical distances from each point to the line).
Evaluating and interpreting regression results
The summary() function is the most important tool for evaluating a model.
This will output:
- Coefficients:
(Intercept)(β₀): The predicted value of y when x is 0.advertising_spend(β₁): The slope. For every 1-unit increase in x, y is predicted to increase by this amount.
- Std. Error, t value, Pr(>|t|): These are used for hypothesis testing. The
Pr(>|t|)(or p-value) tells you if the variable is statistically significant. A p-value < 0.05 is usually considered significant. - Multiple R-squared: The "Coefficient of Determination" (r²). This tells you what percentage of the variation in y is explained by x. A value of 0.75 means 75% of the variation in sales is explained by advertising spend.
Prediction using fitted regression model
You can use the predict() function to make predictions for new data.
new_data <- data.frame(advertising_spend = c(1000, 1500, 2000))
# 2. Use predict() to find the corresponding y-values
predicted_sales <- predict(model, newdata = new_data)