Data Cleaning in Pandas: A step by step Guide

Mastering Data Cleaning in Pandas: A Step-by-Step Guide with Code Examples.

Β 

Data cleaning is the most fundamental part of data analytics. Every Data Professional do their hands dirty with raw data first, and then they dive deeper for further analysis and conclusion.

In entire process of data handling, data cleaning needs over 60 percent of your efforts all.

In this blog post we will learn different techniques for data cleaning using python’s Pandas library. I will cover all the aspects from basics to advanced with code examples.

Also practice with any python code editors. You can also take help of Google Colaboratory to practice code. Just copy code from here, paste it in colaboartory and execute the code. SO, without talking much, lets start the world of data cleaning.

Β 

Β 

INTRODUCTION

Cleaning data is one of the most crucial steps in any data analysis process. If your data is messy, incomplete, or inconsistent, your results would not be reliable. Luckily, Pandas provides powerful tools to clean and preprocess your data effectively.

In this guide, we will explore different techniques of data cleaning in Pandas, along with detailed explanations and Python code examples. By the end, you will be well-equipped to handle real-world datasets like a pro! πŸš€

Β 

πŸ”Ή Why is Data Cleaning Important?

Data cleaning ensures that your dataset is: βœ… Accurate – Free from incorrect values. βœ… Complete – No missing data where it matters. βœ… Consistent – Standardized formats and structures. βœ… Relevant – Includes only necessary data.

Now, let’s dive into the most common data cleaning techniques using Pandas!

Β 

1️⃣ Handling Missing Values

πŸ“Œ Real-Life Example:

Imagine you have a customer dataset, but some email addresses and ages are missing. You need to decide whether to fill, drop, or impute those missing values.

βœ… Code Example:

import pandas as pd

# Creating a sample DataFrame
data = {
    'Customer_ID': [101, 102, 103, 104, 105],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Email': ['alice@example.com', 'bob@example.com', None, 'david@example.com', None],
    'Age': [25, 30, None, 40, 22]
}
df = pd.DataFrame(data)

# Checking for missing values
df.isnull().sum()

# Filling missing values
df['Email'].fillna('Not Provided', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)

print(df)

πŸ” Output:

   Customer_ID     Name            Email  Age
0         101    Alice  alice@example.com  25.0
1         102      Bob    bob@example.com  30.0
2         103  Charlie      Not Provided  29.25
3         104    David  david@example.com  40.0
4         105      Eve      Not Provided  22.0

πŸ”₯ Takeaway:

  • fillna() replaces missing values with specified values.
  • You can use the mean, median, or mode for numerical columns.

Β 

2️⃣ Removing Duplicates

πŸ“Œ Real-Life Example:

Imagine a sales dataset where some transactions were recorded twice. You need to remove duplicates to avoid skewed reports.

βœ… Code Example:

# Creating a sample DataFrame
data = {
    'Order_ID': [1, 2, 2, 3, 4, 4, 5],
    'Product': ['Laptop', 'Phone', 'Phone', 'Tablet', 'Monitor', 'Monitor', 'Keyboard'],
    'Amount': [1000, 500, 500, 300, 200, 200, 150]
}
df = pd.DataFrame(data)

# Removing duplicates
df.drop_duplicates(inplace=True)

print(df)

πŸ” Output:

   Order_ID  Product  Amount
0        1   Laptop    1000
1        2    Phone     500
3        3   Tablet     300
4        4  Monitor     200
6        5 Keyboard     150

πŸ”₯ Takeaway:

  • Use drop_duplicates() to remove duplicate rows.
  • The inplace=True ensures changes are applied directly.

Β 

3️⃣ Standardizing Data Formats

πŸ“Œ Real-Life Example:

Your dataset contains inconsistent date formats like YYYY-MM-DD, DD/MM/YYYY, and Month Day, Year. Standardizing these formats ensures smooth processing.

βœ… Code Example:

# Creating a sample DataFrame
data = {
    'Date': ['2024-02-26', '26/02/2024', 'Feb 26, 2024', '2024/02/26']
}
df = pd.DataFrame(data)

# Converting all dates to standard format
df['Date'] = pd.to_datetime(df['Date'])
print(df)

πŸ” Output:

         Date
0 2024-02-26
1 2024-02-26
2 2024-02-26
3 2024-02-26

πŸ”₯ Takeaway:

  • pd.to_datetime() automatically converts different formats to a standard one.

Β 

4️⃣ Dealing with Outliers

πŸ“Œ Real-Life Example:

In a salary dataset, you notice an entry with $1,000,000β€”an obvious outlier that may distort the analysis.

βœ… Code Example:

# Creating a sample DataFrame
data = {
    'Employee': ['John', 'Jane', 'Mike', 'Anna', 'Tom'],
    'Salary': [50000, 55000, 60000, 1000000, 52000]
}
df = pd.DataFrame(data)

# Identifying outliers using IQR (Interquartile Range)
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Defining bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filtering out outliers
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]
print(df)

πŸ” Output:

  Employee  Salary
0     John   50000
1     Jane   55000
2     Mike   60000
4      Tom   52000

πŸ”₯ Takeaway:

  • The IQR method helps identify extreme values.
  • Filtering outliers keeps your dataset accurate.

Β 

🎯 Conclusion

Mastering data cleaning in Pandas will save you time and ensure high-quality analyses. Here’s a quick recap:

  • Handling Missing Data β†’ Use fillna() to replace missing values.
  • Removing Duplicates β†’ Use drop_duplicates() to avoid redundant records.
  • Standardizing Formats β†’ Convert dates and text to consistent formats.
  • Dealing with Outliers β†’ Use statistical techniques like IQR.

Next time you work with raw data, apply these techniques to clean it efficiently! πŸš€

Β 

πŸ“’ In this blog post you have learned some of the popular techniques of data clean. What’s your favorite data cleaning trick? Let me know in the comments.

Leave a Comment

Your email address will not be published. Required fields are marked *