
What is statistics?
Statistics is the science of learning from data. It includes:- collecting data
- organizing and summarizing data
- analyzing patterns
- interpreting results to make decisions
Major branches of statistics
Descriptive statistics
Descriptive statistics focuses on organizing, summarizing, and presenting data. Examples:- class average marks
- highest and lowest temperature this week
- bar chart of students by branch
Inferential statistics
Inferential statistics uses sample data to draw conclusions about a population. Because inference involves uncertainty, probability is essential. Examples:- surveying 200 voters to estimate support in a city
- testing a sample of bulbs to estimate defect rate in full production
Population and sample
- Population: the complete set of all units of interest.
- Sample: a subset of the population used for study.
- Population: all students in your college
- Sample: 120 students chosen for a survey
Census vs sample survey
- Census: data from every unit in the population
- Sample survey: data from part of the population
Parameter vs statistic
- Parameter: numerical summary of a population (usually unknown)
- Statistic: numerical summary from a sample (used to estimate parameter)
- Population mean height = parameter
- Sample mean height = statistic
Purpose of statistical analysis
- If the goal is to describe and summarize observed information, the study is descriptive.
- If the goal is to use a sample and make conclusions about a population, the study is inferential.
- A descriptive study can be done on a sample or a population.
What is data?
Data are facts and figures collected for analysis, presentation, and interpretation. Data can be:- numbers (exam marks, income)
- labels (department, blood group)
- text or categories (feedback type)
Why do we collect data?
We collect data to understand characteristics of groups such as people, places, or things. Typical goals:- comparison (which section performed better)
- prediction (next month sales)
- decision-making (admit/reject, pass/fail)
Data collection
- Primary data: collected first-hand for current study
- Secondary data: already collected by someone else (reports, government data, publications)
- surveys
- experiments
- observations
- census reports
- institutional records
- research articles
Cases and variables
- Case (observation): a unit from which data are collected.
- Variable: a characteristic or attribute that can vary across cases.
- Cases: each student.
- Variables: name, date of birth, marks, board, etc.
- Rows represent cases.
- Columns represent variables.
0 are not the same.
Also note:
- variable names should be clear (
attendance_percent,final_marks) - keep units consistent in one column
Categorical and numerical variables
Categorical data (qualitative)
- Represents labels or groups.
- gender
- branch (
CSE,ECE,ME) - grade (
A,B,C)
- Nominal: categories with no natural order (blood group)
- Ordinal: categories with order (poor < average < good)
Numerical data (quantitative)
- Describes numerical properties of cases.
- Uses measured units.
- age, height, weight
- number of siblings
- salary
- Discrete: countable values (
0, 1, 2, ...) - Continuous: measurable values on a scale (height, time, temperature)
Measurement units
The unit gives meaning to numerical values (for example, kilograms for weight, rupees for price, centimeters for height).Values in a numerical variable should be recorded in a common unit. Bad practice: mixing
cm and m in one height column without conversion.
Scales of measurement (important)
- Nominal: labels only
- Ordinal: rank/order, unequal gaps
- Interval: equal gaps, no true zero (temperature in Celsius)
- Ratio: equal gaps, true zero (height, weight, income)
Data classification
- Categorical
- Numerical
- Discrete
- Continuous
Cross-sectional and time-series data
- Cross-sectional data: data observed at one point in time across cases.
- Time-series data: data recorded over time for one case or unit.
- Time plot: a graph of time-series values in chronological order.
- Cross-sectional: income of 50 households in March 2026
- Time-series: monthly electricity bill of one hostel from Jan-Dec 2025
Common mistakes to avoid
- confusing population with sample
- treating category labels as numbers for arithmetic
- ignoring units while comparing values
- assuming sample results are exact for population without uncertainty
