Skip to main content

Learning outcomes

  • explain what data means in statistics
  • distinguish primary and secondary data
  • understand why data is collected
  • identify units, consistency, and missing values in a dataset

What is data in practice?

  • Data are recorded observations about people, objects, or events.
  • Data becomes useful only when it is:
    • relevant
    • accurate
    • organized
    • interpreted correctly

Why do we collect data?

  • to compare groups
  • to monitor performance
  • to detect patterns
  • to make decisions
  • to predict or estimate future values
Examples:
  • comparing marks across sections
  • monitoring monthly expenses
  • analyzing defect rates in production

Sources of data

Primary data

  • Collected first-hand for the current study.
  • Methods:
    • survey
    • interview
    • observation
    • experiment
Advantages:
  • directly relevant
  • controlled collection process
Limitations:
  • time-consuming
  • costly

Secondary data

  • Already collected by someone else.
  • Sources:
    • government reports
    • research articles
    • company records
    • census tables
Advantages:
  • fast to obtain
  • cheaper
Limitations:
  • may not match the current purpose exactly
  • quality depends on original source

Cases, values, and missing data

  • Each case has one recorded value for each variable.
  • Sometimes a value is missing.
Important distinction:
  • missing value means information was not recorded
  • 0 is a real recorded value if the quantity truly equals zero
Example:
  • siblings = 0 means no siblings
  • blank entry means unknown or not recorded

Units and consistency

  • Numerical data must be recorded in a common unit.
  • Bad practice:
    • height mixed as 160 cm, 1.72 m, 170 cm in the same column
Why this matters:
  • comparison becomes misleading
  • averages become invalid without conversion

Good dataset habits

  • clear variable names
  • one kind of information per column
  • same unit throughout
  • consistent coding rules
Examples:
  • height_cm
  • attendance_percent
  • final_marks

Exam hints and traps

  • Primary does not mean “better” in every case; it means first-hand.
  • Secondary data is still useful if the source is reliable.
  • Missing value and zero must never be treated as the same.
  • Data quality matters before calculation begins.

Quick practice

  1. Classify as primary or secondary:
    • government census report
    • marks collected from your own class survey
  2. Why is mixing cm and m in one column a problem?
  3. Explain the difference between blank attendance and 0 attendance.

Answer key

    • census report: secondary
    • own class survey: primary
  1. Units become inconsistent; direct comparison and summary are unreliable.
  2. Blank means not recorded; 0 means recorded value is zero.