SEC-151T: Statistical Data Analysis using R (Theory)
Unit 1: Introduction to R
Introduction to R and its features
- What is R? R is a powerful, open-source programming language and software environment used for statistical computing and graphics.
- Features:
- Free and Open Source: Anyone can use it for free.
- Cross-Platform: Works on Windows, Mac, and Linux.
- Comprehensive: Has a vast collection of tools for data manipulation, analysis, and visualization.
- Packages: Its main strength. R has thousands of user-contributed "packages" (like add-ons) for specialized tasks (e.g., genetics, finance, machine learning).
- Vectorized Operations: R is optimized to work on entire vectors (lists of numbers) at once, making code shorter and faster.
Installing R and R-Studio
- Install R: First, you must install the base R system from CRAN (Comprehensive R Archive Network). Go to `cran.r-project.org` and download the version for your operating system.
- Install R-Studio: R-Studio is an Integrated Development Environment (IDE) that provides a much user-friendlier interface for R. It includes a text editor, a console, plot viewers, and more. Download the free R-Studio Desktop from `rstudio.com` (now Posit).
Basic R syntax and data types
R is an object-oriented language. You create "objects" and give them names using the assignment operator <- (or =).
# 'x' is the object name, '<-' is the assignment operator, 10 is the value
x <- 10
y <- "hello"
Basic Data Types (Atomic Vectors)
- numeric: Numbers with decimals (e.g., 10.5, 55, 78.3).
- integer: Whole numbers (e.g., 1L, 25L, -2L). The 'L' tells R to store it as an integer.
- logical: Boolean values,
TRUE or FALSE (or T/F).
- character: Text strings, enclosed in quotes (e.g., "hello", "data", "2024").
Key Data Structures
- Vector: The most basic data structure. A 1D list of elements of the same type. Created with
c() (combine).
v <- c(10, 20, 30, 40, 50)
- Data Frame: The most important structure for statistics. A 2D table, like an Excel spreadsheet. Columns can be of different types, but all must have the same length.
df <- data.frame(
name = c("Alice", "Bob"),
age = c(25, 30),
is_student = c(TRUE, FALSE)
)
- List: A very flexible container that can hold anything, including other lists or data frames.
- Matrix: A 2D array where all elements are of the same type.
Importing data into R from various file formats
- CSV (Comma Separated Values):
my_data <- read.csv("file_path/my_file.csv")
- Text Files (e.g., .txt):
my_data <- read.table("file_path/my_file.txt", header = TRUE)
- Excel Files: You must first install and load a package, like
readxl.
install.packages("readxl") # Do this once
library(readxl) # Do this every session
my_data <- read_excel("file_path/my_file.xlsx", sheet = "Sheet1")
Problems faced in case of missing data
In R, missing values are represented by the special marker NA (Not Available).
Most R functions will fail or return NA if their input contains an NA value.
x <- c(1, 2, 3, NA, 5)
mean(x) # This will return NA
Data cleaning techniques
- Identifying NAs: Use the
is.na() function to find them.
is.na(x) # Returns: FALSE FALSE FALSE TRUE FALSE
- Handling NAs in functions: Many functions have an argument
na.rm = TRUE (NA remove) to tell them to ignore missing values.
mean(x, na.rm = TRUE) # Returns 2.75
- Removing NA rows: The
na.omit() function will delete any row from a data frame that contains at least one NA value.
clean_data <- na.omit(my_data_frame)
Unit 2: Data Visualization in R
Introduction to data visualization in R
R is famous for its high-quality, customizable graphics. Visualization is crucial for exploring data, identifying patterns, checking assumptions, and communicating results.
Basic plotting functions in R
R's "base graphics" system provides powerful and quick functions for creating plots.
Creating and customizing various types of plots
Adding labels, titles, and legends to plots
You can add these as arguments directly inside the plot() command:
plot(my_datax, mydatay,
main = "My Plot Title", # Main title
xlab = "X-Axis Label", # X-axis label
ylab = "Y-Axis Label", # Y-axis label
col = "blue", # Color of points
pch = 19 # Point character (19 is a solid circle)
)
You can also use the legend() function to add a legend after the plot is created.
Saving and exporting plots
You can save a plot to a file (like PNG, PDF, or JPEG) in two steps:
- Open a graphics device: e.g.,
png("my_plot.png")
- Create your plot: e.g.,
hist(my_dataage) - Close the device:
dev.off() (This saves the file).
Unit 3: Descriptive Statistics in R (Part 1)
Measures of central tendency
- Mean:
mean(mydatasalary, na.rm = TRUE)
- Median:
median(my_datasalary, na.rm = TRUE)
- Mode: R has no built-in
mode() function. You must find it using table(). counts <- table(mydatacategory)
mode_value <- names(counts)[which.max(counts)]
Weighted mean
Used when some values are more important than others.
values <- c(10, 20, 30)
weights <- c(0.5, 0.3, 0.2) # Weights must sum to 1
weighted.mean(values, weights)
Geometric mean
Used for averaging growth rates. R has no built-in function.
# Assumes x contains positive numbers
gm <- exp(mean(log(x)))
Measures of dispersion
- Range:
range(my_datasalary) # Returns (min, max)
- Variance:
var(mydatasalary, na.rm = TRUE)
- Standard deviation:
sd(my_datasalary, na.rm = TRUE)
Coefficient of variation (C.V.)
A relative measure of dispersion, useful for comparing variability of datasets with different means. C.V. = (SD / Mean) * 100.
cv <- (sd(mydatasalary) / mean(my_datasalary)) * 100
Percentiles and quartiles
The quantile() function is very powerful.
# Get quartiles (0%, 25%, 50%, 75%, 100%)
quantile(mydatasalary)
# Get the 90th percentile
quantile(my_datasalary, probs = 0.90)
Interquartile range (IQR)
The distance between the 75th (Q3) and 25th (Q1) percentile. It's a robust measure of spread.
IQR(mydatasalary)
Unit 4: Descriptive Statistics in R (Part 2)
Measures of skewness and kurtosis
These are not in base R. You must install the e1071 package.
install.packages("e1071")
library(e1071)
# Skewness:
skewness(my_datasalary)
# Kurtosis:
kurtosis(mydatasalary)
- Skewness Interpretation:
~ 0: Roughly symmetric.
> 0: Positively (right) skewed.
< 0: Negatively (left) skewed.
- Kurtosis Interpretation: (Note: R's
kurtosis() function often reports "excess" kurtosis, where 0 is normal).
~ 0: Mesokurtic (normal peak/tails).
> 0: Leptokurtic (high peak, fat tails).
< 0: Platykurtic (flat peak, thin tails).
Introduction to bivariate analysis
Analyzing two variables at the same time.
- Cross-tabulation: Creates a frequency table for two categorical variables.
table(my_datagender, mydatadepartment)
- Scatter plots: The best way to visualize the relationship between two continuous variables.
plot(x = my_dataadvertisingspend, y = mydatasales)
Pearson's and Spearman's correlation coefficient
The cor() function calculates the correlation.
- Pearson's (r): Measures the strength of the linear relationship. (Default)
- Spearman's (rho): Measures the strength of the monotonic relationship (non-parametric, based on ranks).
# Pearson correlation (default)
cor(my_datax, mydatay, method = "pearson")
# Spearman rank correlation
cor(my_datax, mydatay, method = "spearman")
Unit 5: Regression in R
Simple linear regression
A statistical method to model the linear relationship between one independent variable (x) and one dependent variable (y). The model is: Y = β₀ + β₁X + ε.
Fitting of regression models in R
We use the lm() (linear model) function. The formula y ~ x is read as "y is modeled by x".
# Fit the model
model <- lm(sales ~ advertising_spend, data = my_data)
Least Squares method
The lm() function automatically uses the Method of Ordinary Least Squares (OLS). This is the method that finds the line (i.e., the values for β₀ and β₁) that minimizes the sum of the squared residuals (the vertical distances from each point to the line).
Evaluating and interpreting regression results
The summary() function is the most important tool for evaluating a model.
summary(model)
This will output:
- Coefficients:
(Intercept) (β₀): The predicted value of y when x is 0.
advertising_spend (β₁): The slope. For every 1-unit increase in x, y is predicted to increase by this amount.
- Std. Error, t value, Pr(>|t|): These are used for hypothesis testing. The
Pr(>|t|) (or p-value) tells you if the variable is statistically significant. A p-value < 0.05 is usually considered significant.
- Multiple R-squared: The "Coefficient of Determination" (r²). This tells you what percentage of the variation in y is explained by x. A value of 0.75 means 75% of the variation in sales is explained by advertising spend.
Prediction using fitted regression model
You can use the predict() function to make predictions for new data.
# 1. Create a data frame with the new x-values
new_data <- data.frame(advertising_spend = c(1000, 1500, 2000))
# 2. Use predict() to find the corresponding y-values
predicted_sales <- predict(model, newdata = new_data)