Learning outcomes
- explain what data means in statistics
- distinguish primary and secondary data
- understand why data is collected
- identify units, consistency, and missing values in a dataset
What is data in practice?
- Data are recorded observations about people, objects, or events.
- Data becomes useful only when it is:
- relevant
- accurate
- organized
- interpreted correctly
Why do we collect data?
- to compare groups
- to monitor performance
- to detect patterns
- to make decisions
- to predict or estimate future values
- comparing marks across sections
- monitoring monthly expenses
- analyzing defect rates in production
Sources of data
Primary data
- Collected first-hand for the current study.
- Methods:
- survey
- interview
- observation
- experiment
- directly relevant
- controlled collection process
- time-consuming
- costly
Secondary data
- Already collected by someone else.
- Sources:
- government reports
- research articles
- company records
- census tables
- fast to obtain
- cheaper
- may not match the current purpose exactly
- quality depends on original source
Cases, values, and missing data
- Each case has one recorded value for each variable.
- Sometimes a value is missing.
- missing value means information was not recorded
0is a real recorded value if the quantity truly equals zero
siblings = 0means no siblings- blank entry means unknown or not recorded
Units and consistency
- Numerical data must be recorded in a common unit.
- Bad practice:
- height mixed as
160 cm,1.72 m,170 cmin the same column
- height mixed as
- comparison becomes misleading
- averages become invalid without conversion
Good dataset habits
- clear variable names
- one kind of information per column
- same unit throughout
- consistent coding rules
height_cmattendance_percentfinal_marks
Exam hints and traps
- Primary does not mean “better” in every case; it means first-hand.
- Secondary data is still useful if the source is reliable.
- Missing value and zero must never be treated as the same.
- Data quality matters before calculation begins.
Quick practice
- Classify as primary or secondary:
- government census report
- marks collected from your own class survey
- Why is mixing
cmandmin one column a problem? - Explain the difference between blank attendance and
0attendance.
Answer key
-
- census report: secondary
- own class survey: primary
- Units become inconsistent; direct comparison and summary are unreliable.
- Blank means not recorded;
0means recorded value is zero.
