SEC-151T: Statistical Data Analysis using R (Theory)

Unit 1: Introduction to R
Unit 2: Data Visualization in R
Unit 3: Descriptive Statistics in R (Part 1)
Unit 4: Descriptive Statistics in R (Part 2)
Unit 5: Regression in R

Unit 1: Introduction to R

Introduction to R and its features

What is R? R is a powerful, open-source programming language and software environment used for statistical computing and graphics.
Features:
- Free and Open Source: Anyone can use it for free.
- Cross-Platform: Works on Windows, Mac, and Linux.
- Comprehensive: Has a vast collection of tools for data manipulation, analysis, and visualization.
- Packages: Its main strength. R has thousands of user-contributed "packages" (like add-ons) for specialized tasks (e.g., genetics, finance, machine learning).
- Vectorized Operations: R is optimized to work on entire vectors (lists of numbers) at once, making code shorter and faster.

Installing R and R-Studio

Install R: First, you must install the base R system from CRAN (Comprehensive R Archive Network). Go to `cran.r-project.org` and download the version for your operating system.
Install R-Studio: R-Studio is an Integrated Development Environment (IDE) that provides a much user-friendlier interface for R. It includes a text editor, a console, plot viewers, and more. Download the free R-Studio Desktop from `rstudio.com` (now Posit).

Basic R syntax and data types

R is an object-oriented language. You create "objects" and give them names using the assignment operator <- (or =).

# 'x' is the object name, '<-' is the assignment operator, 10 is the value
x <- 10
y <- "hello"

Basic Data Types (Atomic Vectors)

numeric: Numbers with decimals (e.g., 10.5, 55, 78.3).
integer: Whole numbers (e.g., 1L, 25L, -2L). The 'L' tells R to store it as an integer.
logical: Boolean values, TRUE or FALSE (or T/F).
character: Text strings, enclosed in quotes (e.g., "hello", "data", "2024").

Key Data Structures

Vector: The most basic data structure. A 1D list of elements of the same type. Created with c() (combine).
v <- c(10, 20, 30, 40, 50)
Data Frame: The most important structure for statistics. A 2D table, like an Excel spreadsheet. Columns can be of different types, but all must have the same length.
df <- data.frame(
  name = c("Alice", "Bob"),
  age = c(25, 30),
  is_student = c(TRUE, FALSE)
)
List: A very flexible container that can hold anything, including other lists or data frames.
Matrix: A 2D array where all elements are of the same type.

Importing data into R from various file formats

CSV (Comma Separated Values):
my_data <- read.csv("file_path/my_file.csv")
Text Files (e.g., .txt):
my_data <- read.table("file_path/my_file.txt", header = TRUE)
Excel Files: You must first install and load a package, like readxl.
install.packages("readxl") # Do this once
library(readxl) # Do this every session
my_data <- read_excel("file_path/my_file.xlsx", sheet = "Sheet1")

Problems faced in case of missing data

In R, missing values are represented by the special marker NA (Not Available).

Most R functions will fail or return NA if their input contains an NA value.

x <- c(1, 2, 3, NA, 5)
mean(x) # This will return NA

Data cleaning techniques

Identifying NAs: Use the is.na() function to find them.
is.na(x) # Returns: FALSE FALSE FALSE TRUE FALSE
Handling NAs in functions: Many functions have an argument na.rm = TRUE (NA remove) to tell them to ignore missing values.
mean(x, na.rm = TRUE) # Returns 2.75
Removing NA rows: The na.omit() function will delete any row from a data frame that contains at least one NA value.
clean_data <- na.omit(my_data_frame)

Unit 2: Data Visualization in R

Introduction to data visualization in R

R is famous for its high-quality, customizable graphics. Visualization is crucial for exploring data, identifying patterns, checking assumptions, and communicating results.

Basic plotting functions in R

R's "base graphics" system provides powerful and quick functions for creating plots.

Creating and customizing various types of plots

Histograms: For visualizing the distribution of a single continuous variable.
hist(my_dataage)
Bar charts: For visualizing the frequency of a single categorical variable.
gender_counts <- table(my_datagender)
barplot(gender_counts)
Box plots (Box-and-Whisker): For visualizing the "five-number summary" (min, Q1, median, Q3, max) and identifying outliers.
boxplot(my_datasalary)
Pie charts: For visualizing proportions. (Note: Pie charts are generally discouraged by statisticians as bar charts are easier to read).
pie(gender_counts)
Frequency polygons and curves: A histogram shows bars; a frequency polygon shows the same information as a line. A frequency curve is a smoothed version (a density plot).
# For a density curve (frequency curve)
plot(density(my_dataage))

Adding labels, titles, and legends to plots

You can add these as arguments directly inside the plot() command:

plot(my_datax, my_datay,
  main = "My Plot Title", # Main title
  xlab = "X-Axis Label", # X-axis label
  ylab = "Y-Axis Label", # Y-axis label
  col = "blue", # Color of points
  pch = 19 # Point character (19 is a solid circle)
)

You can also use the legend() function to add a legend after the plot is created.

Saving and exporting plots

You can save a plot to a file (like PNG, PDF, or JPEG) in two steps:

Open a graphics device: e.g., png("my_plot.png")
Create your plot: e.g., hist(my_dataage)
Close the device: dev.off() (This saves the file).

Unit 3: Descriptive Statistics in R (Part 1)

Measures of central tendency

Mean:
mean(my_datasalary, na.rm = TRUE)
Median:
median(my_datasalary, na.rm = TRUE)
Mode: R has no built-in mode() function. You must find it using table().
counts <- table(my_datacategory)
mode_value <- names(counts)[which.max(counts)]

Weighted mean

Used when some values are more important than others.

values <- c(10, 20, 30)
weights <- c(0.5, 0.3, 0.2) # Weights must sum to 1
weighted.mean(values, weights)

Geometric mean

Used for averaging growth rates. R has no built-in function.

# Assumes x contains positive numbers
gm <- exp(mean(log(x)))

Measures of dispersion

Range:
range(my_datasalary) # Returns (min, max)
Variance:
var(my_datasalary, na.rm = TRUE)
Standard deviation:
sd(my_datasalary, na.rm = TRUE)

Coefficient of variation (C.V.)

A relative measure of dispersion, useful for comparing variability of datasets with different means. C.V. = (SD / Mean) * 100.

cv <- (sd(my_datasalary) / mean(my_datasalary)) * 100

Percentiles and quartiles

The quantile() function is very powerful.

# Get quartiles (0%, 25%, 50%, 75%, 100%)
quantile(my_datasalary)
# Get the 90th percentile
quantile(my_datasalary, probs = 0.90)

Interquartile range (IQR)

The distance between the 75th (Q3) and 25th (Q1) percentile. It's a robust measure of spread.

IQR(my_datasalary)

Unit 4: Descriptive Statistics in R (Part 2)

Measures of skewness and kurtosis

These are not in base R. You must install the e1071 package.

install.packages("e1071")
library(e1071)
# Skewness:
skewness(my_datasalary)
# Kurtosis:
kurtosis(my_datasalary)

Skewness Interpretation:
- ~ 0: Roughly symmetric.
- > 0: Positively (right) skewed.
- < 0: Negatively (left) skewed.
Kurtosis Interpretation: (Note: R's kurtosis() function often reports "excess" kurtosis, where 0 is normal).
- ~ 0: Mesokurtic (normal peak/tails).
- > 0: Leptokurtic (high peak, fat tails).
- < 0: Platykurtic (flat peak, thin tails).

Introduction to bivariate analysis

Analyzing two variables at the same time.

Cross-tabulation: Creates a frequency table for two categorical variables.
table(my_datagender, my_datadepartment)
Scatter plots: The best way to visualize the relationship between two continuous variables.
plot(x = my_dataadvertising_spend, y = my_datasales)

Pearson's and Spearman's correlation coefficient

The cor() function calculates the correlation.

Pearson's (r): Measures the strength of the linear relationship. (Default)
Spearman's (rho): Measures the strength of the monotonic relationship (non-parametric, based on ranks).

# Pearson correlation (default)
cor(my_datax, my_datay, method = "pearson")
# Spearman rank correlation
cor(my_datax, my_datay, method = "spearman")

Unit 5: Regression in R

Simple linear regression

A statistical method to model the linear relationship between one independent variable (x) and one dependent variable (y). The model is: Y = β₀ + β₁X + ε.

Fitting of regression models in R

We use the lm() (linear model) function. The formula y ~ x is read as "y is modeled by x".

# Fit the model
model <- lm(sales ~ advertising_spend, data = my_data)

Least Squares method

The lm() function automatically uses the Method of Ordinary Least Squares (OLS). This is the method that finds the line (i.e., the values for β₀ and β₁) that minimizes the sum of the squared residuals (the vertical distances from each point to the line).

Evaluating and interpreting regression results

The summary() function is the most important tool for evaluating a model.

summary(model)

This will output:

Coefficients:
- (Intercept) (β₀): The predicted value of y when x is 0.
- advertising_spend (β₁): The slope. For every 1-unit increase in x, y is predicted to increase by this amount.
Std. Error, t value, Pr(>|t|): These are used for hypothesis testing. The Pr(>|t|) (or p-value) tells you if the variable is statistically significant. A p-value < 0.05 is usually considered significant.
Multiple R-squared: The "Coefficient of Determination" (r²). This tells you what percentage of the variation in y is explained by x. A value of 0.75 means 75% of the variation in sales is explained by advertising spend.

Prediction using fitted regression model

You can use the predict() function to make predictions for new data.

# 1. Create a data frame with the new x-values
new_data <- data.frame(advertising_spend = c(1000, 1500, 2000))
# 2. Use predict() to find the corresponding y-values
predicted_sales <- predict(model, newdata = new_data)