ML Fundamentals

Handling Missing Data (Imputation Techniques)

Learning Outcome

3

Apply Mean Imputation for continuous variables (e.g., normalized-losses).

1

Identify non-standard missing values (like ?) in raw datasets.

2

Differentiate between standardizing data and imputing data.

4

Apply Mode Imputation for categorical variables (e.g., num-of-doors).

5

Judge when to drop data versus when to fix it.

Topic Name-Recall(Slide3)

Hook/Story/Analogy(Slide 4)

Transition from Analogy to Technical Concept(Slide 5)

The "Hidden" Missing Data

In datasets, some values are missing.

Normally, missing values look like this:

  • Python and libraries like Pandas automatically recognize NaN as missing data.
  • NaN → This is the standard missing value.

But in some datasets, missing values are written differently, like:

  • Python thinks this is just normal text, not a missing value.

Python will not treat it as missing unless we tell it to.

  • ? (question mark)

Key Idea :

In the file "missing_value_dataset.csv ", the missing values are written as : 

So Python will not understand these as missing data by default.

We need to:

Convert them into NaN so Python can handle them properly.

The Strategy Map

When we find missing values in a dataset, we must decide:

What should we do with the empty or unknown data?

There are two main options.

Drop Data (The Nuclear Option)

Replace Data (Imputation)

Drop Data (The Nuclear Option)

This means removing data completely.

If a dataset has 1,000 rows and you drop 300 rows, you lose a lot of data.

Two ways to drop:

Drop the whole row

  • If a row has missing values, delete that entire row.
  • Problem: You lose useful information.

Drop the whole column

  • If a column has many missing values, remove that column.
  • Problem: You lose an important feature.

Example :

Replace Data (Imputation)

This means filling the missing values with something reasonable.

Common ways to replace:

Replace with Mean (Average)

 Replace with Mode (Most Frequent)

Used for numeric data.

Used for categorical data (like color, fuel type).

Prices: 10, 20, 30, ?


Mean = (10+20+30)/3 = 20


Replace ? with 20

Fuel type: Petrol, Diesel, Petrol, ?


Mode = Petrol


Replace ? with Petrol

Example :

Example :

The Golden Rule:

If the Target (what we predict) is missing → DROP.

If a Feature (what we use to predict) is missing → IMPUTE.

Imputing with Mean (Continuous Data)

The Price column in the dataset has missing values.

Some cells are empty (NaN).

This column contains numbers (car prices), not text.

So it is called continuous data.

Instead of deleting the rows, we:

  • Calculate the average of the existing values.
  • Fill the missing values with that average.

Why do we use the average?

The average represents the typical value of the column.

It is a safe and neutral guess.

It does not change the data too much.

So the overall pattern of the data remains balanced.

Imputing with Mode (Categorical Data)

So it is called categorical data.

The column Fuel_Type has missing values.

This column contains text categories, like:

  • Petrol
  • Disel

Why we cannot use the average?

  • For numbers, we can calculate the average.

 

  • But for categories, it does not make sense.

Fuel type is text, not numbers:

  • (Petrol + Diesel) ÷ 2 ❌
  • Petrol-Diesel mix ❌

So average does not make sense for text data

Instead of the average, we use the mode.

Mode = the most common value in the column.

Counts:

  • Petrol → 2 times

  • Diesel → 3 times

 

Mode = Diesel

 

So we replace the missing value with Diesel.

Petrol , ? , Diesel , Petrol , Diesel , ? , Diesel

When to Drop

Target = the value we want the model to predict.

The column Fuel_Type has missing values.The price column has some missing values.

But price is the target variable.

If we want to build a model to predict car price, then price is the answer.

If the price is missing, we have two choices:

Fill it with an average or guess.

Remove that row.

If we fill it with a guess, we are giving the model a fake answer.

That is like:

  • A teacher giving students the wrong answer key.
  • Students learn the wrong thing.

So the model will learn incorrect patterns.

Never guess the target value.

So:

  • If target is missing → drop the row.

If we fill Honda’s price with an average, it becomes a fake value.

So we remove that row.

After dropping:

Summary

5

Build strong branding

4

Use different marketing channels

3

Target the right audience

2

Create and communicate value

1

Understand customer needs

Choose cool, soft colors instead of vibrant colors
Max 5 Points for Summary & Min 2

Quiz

Which platform is mainly used for professional networking and B2B marketing ?

A. Facebook

B. Instagram

C. LinkedIn

D. Snapchat

Quiz-Answer

Which platform is mainly used for professional networking and B2B marketing ?

A. Facebook

B. Instagram

C. LinkedIn

D. Snapchat

Handling Missing Data (Imputation Techniques)

By Content ITV

Handling Missing Data (Imputation Techniques)

  • 2