SEC-151T: Statistical Data Analysis using R (Theory)

Table of Contents

Unit 1: Introduction to R

Introduction to R and its features

Installing R and R-Studio

  1. Install R: First, you must install the base R system from CRAN (Comprehensive R Archive Network). Go to `cran.r-project.org` and download the version for your operating system.
  2. Install R-Studio: R-Studio is an Integrated Development Environment (IDE) that provides a much user-friendlier interface for R. It includes a text editor, a console, plot viewers, and more. Download the free R-Studio Desktop from `rstudio.com` (now Posit).

Basic R syntax and data types

R is an object-oriented language. You create "objects" and give them names using the assignment operator <- (or =).

# 'x' is the object name, '<-' is the assignment operator, 10 is the value
x <- 10
y <- "hello"

Basic Data Types (Atomic Vectors)

Key Data Structures

Importing data into R from various file formats

Problems faced in case of missing data

In R, missing values are represented by the special marker NA (Not Available).

Most R functions will fail or return NA if their input contains an NA value.

x <- c(1, 2, 3, NA, 5)
mean(x) # This will return NA

Data cleaning techniques

Unit 2: Data Visualization in R

Introduction to data visualization in R

R is famous for its high-quality, customizable graphics. Visualization is crucial for exploring data, identifying patterns, checking assumptions, and communicating results.

Basic plotting functions in R

R's "base graphics" system provides powerful and quick functions for creating plots.

Creating and customizing various types of plots

Adding labels, titles, and legends to plots

You can add these as arguments directly inside the plot() command:

plot(my_datax, mydatay,
  main = "My Plot Title", # Main title
  xlab = "X-Axis Label", # X-axis label
  ylab = "Y-Axis Label", # Y-axis label
  col = "blue", # Color of points
  pch = 19 # Point character (19 is a solid circle)
)

You can also use the legend() function to add a legend after the plot is created.

Saving and exporting plots

You can save a plot to a file (like PNG, PDF, or JPEG) in two steps:

  1. Open a graphics device: e.g., png("my_plot.png")
  2. Create your plot: e.g., hist(my_dataage)
  3. Close the device: dev.off() (This saves the file).

Unit 3: Descriptive Statistics in R (Part 1)

Measures of central tendency

Weighted mean

Used when some values are more important than others.

values <- c(10, 20, 30)
weights <- c(0.5, 0.3, 0.2) # Weights must sum to 1
weighted.mean(values, weights)

Geometric mean

Used for averaging growth rates. R has no built-in function.

# Assumes x contains positive numbers
gm <- exp(mean(log(x)))

Measures of dispersion

Coefficient of variation (C.V.)

A relative measure of dispersion, useful for comparing variability of datasets with different means. C.V. = (SD / Mean) * 100.

cv <- (sd(mydatasalary) / mean(my_datasalary)) * 100

Percentiles and quartiles

The quantile() function is very powerful.

# Get quartiles (0%, 25%, 50%, 75%, 100%)
quantile(mydatasalary)
# Get the 90th percentile
quantile(my_datasalary, probs = 0.90)

Interquartile range (IQR)

The distance between the 75th (Q3) and 25th (Q1) percentile. It's a robust measure of spread.

IQR(mydatasalary)

Unit 4: Descriptive Statistics in R (Part 2)

Measures of skewness and kurtosis

These are not in base R. You must install the e1071 package.

install.packages("e1071")
library(e1071)
# Skewness:
skewness(my_datasalary)
# Kurtosis:
kurtosis(mydatasalary)

Introduction to bivariate analysis

Analyzing two variables at the same time.

Pearson's and Spearman's correlation coefficient

The cor() function calculates the correlation.

# Pearson correlation (default)
cor(my_datax, mydatay, method = "pearson")
# Spearman rank correlation
cor(my_datax, mydatay, method = "spearman")

Unit 5: Regression in R

Simple linear regression

A statistical method to model the linear relationship between one independent variable (x) and one dependent variable (y). The model is: Y = β₀ + β₁X + ε.

Fitting of regression models in R

We use the lm() (linear model) function. The formula y ~ x is read as "y is modeled by x".

# Fit the model
model <- lm(sales ~ advertising_spend, data = my_data)

Least Squares method

The lm() function automatically uses the Method of Ordinary Least Squares (OLS). This is the method that finds the line (i.e., the values for β₀ and β₁) that minimizes the sum of the squared residuals (the vertical distances from each point to the line).

Evaluating and interpreting regression results

The summary() function is the most important tool for evaluating a model.

summary(model)

This will output:

Prediction using fitted regression model

You can use the predict() function to make predictions for new data.

# 1. Create a data frame with the new x-values
new_data <- data.frame(advertising_spend = c(1000, 1500, 2000))
# 2. Use predict() to find the corresponding y-values
predicted_sales <- predict(model, newdata = new_data)