Unit 1: Introduction to Statistics and Data Presentation
1.1 Definition, Scope, and Limitations of Statistics
Definition of Statistics
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It provides tools and methods to find patterns, make decisions, and deal with uncertainty.
Scope of Statistics
Statistics is used in almost every field:
- Government: For policy-making, census data, and economic planning.
- Business & Economics: For market research, quality control, and financial forecasting.
- Science: For designing experiments, testing hypotheses, and analyzing results (e.g., in medicine, biology, physics).
- Social Sciences: For analyzing survey data, demographic trends, and psychological studies.
- Daily Life: Understanding weather forecasts, sports analytics, and news reports.
Limitations of Statistics
- Deals with aggregates, not individuals: Statistics describes group behavior, not the story of a single person or item.
- Deals with quantitative data: It primarily works with numbers. Qualitative data (like honesty, beauty, or attitude) must be converted to a numerical scale to be analyzed.
- Results are true "on average": Statistical laws are not as exact as laws of physics; they are probabilistic.
- Can be misused: Data can be manipulated to support a biased conclusion. "There are three kinds of lies: lies, damned lies, and statistics."
1.2 Population and Sample
- Population: The entire collection of individuals or items about which we want to draw a conclusion. For example, "all students at Assam University" or "all light bulbs produced by a factory."
- Sample: A subset of the population that is selected for study. We study the sample to make inferences about the whole population. For example, "500 students surveyed at Assam University."
1.3 Types of Data
By Source
- Primary Data: Data collected first-hand by the researcher for a specific purpose (e.g., through surveys, experiments, or direct observation).
- Secondary Data: Data that has already been collected by someone else and is available for use (e.g., government census data, company records, academic journals).
By Nature
- Quantitative Data: Represents amounts or quantities (numerical). Can be measured.
- Discrete Data: Can only take specific, countable values (often integers). There are "gaps" between values. Examples: Number of children in a family (0, 1, 2...), number of cars sold.
- Continuous Data: Can take any value within a given range. It is measured, not counted. Examples: Height (170.1cm, 170.11cm...), temperature, time.
- Qualitative Data (or Categorical Data): Represents qualities or characteristics. Data is placed into categories.
- Nominal Data: Categories with no natural order. Examples: Gender (Male, Female), Eye Color (Blue, Brown, Green), Religion.
- Ordinal Data: Categories that have a meaningful order or rank, but the distance between categories is not uniform. Examples: Education Level (High School, Bachelor's, Master's), Customer Rating (Poor, Good, Excellent).
By Time
- Cross-Sectional Data: Data collected at a single point in time across multiple subjects (e.t., a survey of 100 people's income in 2024).
- Time-Series Data: Data collected for a single subject over multiple time periods. Example: A company's monthly sales from 2020 to 2024.
1.4 Presentation of Data (Tables and Diagrams)
Tabulation
Organizing data into rows and columns in a table. A good table has a clear title, labeled columns and rows, and (if needed) a source note.
Diagrams
Visual representations of data. The choice of diagram depends on the type of data.
- Bar Diagram: Uses the height of bars to represent the frequency or value of discrete/categorical data. Bars are separated by gaps.
- Pie Chart: A circle divided into "slices" to show the proportion of different categories.
- Pictogram: Uses icons or pictures to represent data.
1.5 Frequency Distributions
A table that organizes raw data by summarizing the number of times (frequency) each value or group of values (class interval) occurs.
For Discrete Data
Simply list the values and their frequencies.
For Continuous Data
Data is grouped into class intervals.
- Class Limits: The stated boundaries (e.g., 10-19, 20-29).
- Class Boundaries: The "true" boundaries that leave no gaps (e.g., 9.5-19.5, 19.5-29.5).
- Class Width (h): Upper boundary - Lower boundary.
- Mid-point (x): (Upper boundary + Lower boundary) / 2.
1.6 Graphical Representation of Frequency Distributions
Histogram
The most common graph for a continuous frequency distribution. It uses bars to represent frequency, but unlike a bar chart, the bars are adjacent (touching) to represent the continuous nature of the data. The x-axis is marked with class boundaries.
Unequal Class Intervals: If class widths are unequal, you must plot
Frequency Density on the y-axis, not frequency.
Frequency Density = Frequency / Class Width
In this case, the
area of the bar represents the frequency.
Frequency Polygon
A line graph formed by connecting the midpoints of the tops of the bars in a histogram. It gives a clearer picture of the shape of the distribution.
Cumulative Frequency Curves (Ogive)
An Ogive (or Ogive curve) is a graph of a cumulative frequency distribution. It is very useful for finding the median and other partition values (like quartiles).
- "Less Than" Ogive:
- Plots: "Less than" cumulative frequency against the upper class boundaries.
- Shape: A rising curve, starting from 0.
- "More Than" Ogive:
- Plots: "More than" cumulative frequency against the lower class boundaries.
- Shape: A falling curve, ending at 0.
Finding the Median: The median is the x-value (on the horizontal axis) corresponding to the intersection point of the "Less Than" and "More Than" Ogives.