Mastering Data Cleaning in Pandas: A Step-by-Step Guide with Code Examples.
Β
Data cleaning is the most fundamental part of data analytics. Every Data Professional do their hands dirty with raw data first, and then they dive deeper for further analysis and conclusion.
In entire process of data handling, data cleaning needs over 60 percent of your efforts all.
In this blog post we will learn different techniques for data cleaning using python’s Pandas library. I will cover all the aspects from basics to advanced with code examples.
Also practice with any python code editors. You can also take help of Google Colaboratory to practice code. Just copy code from here, paste it in colaboartory and execute the code. SO, without talking much, lets start the world of data cleaning.
Β
Β
INTRODUCTION
Cleaning data is one of the most crucial steps in any data analysis process. If your data is messy, incomplete, or inconsistent, your results would not be reliable. Luckily, Pandas provides powerful tools to clean and preprocess your data effectively.
In this guide, we will explore different techniques of data cleaning in Pandas, along with detailed explanations and Python code examples. By the end, you will be well-equipped to handle real-world datasets like a pro! π
Β
πΉ Why is Data Cleaning Important?
Data cleaning ensures that your dataset is: β Accurate β Free from incorrect values. β Complete β No missing data where it matters. β Consistent β Standardized formats and structures. β Relevant β Includes only necessary data.
Now, letβs dive into the most common data cleaning techniques using Pandas!
Β
1οΈβ£ Handling Missing Values
π Real-Life Example:
Imagine you have a customer dataset, but some email addresses and ages are missing. You need to decide whether to fill, drop, or impute those missing values.
β Code Example:
import pandas as pd
# Creating a sample DataFrame
data = {
'Customer_ID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Email': ['alice@example.com', 'bob@example.com', None, 'david@example.com', None],
'Age': [25, 30, None, 40, 22]
}
df = pd.DataFrame(data)
# Checking for missing values
df.isnull().sum()
# Filling missing values
df['Email'].fillna('Not Provided', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
π Output:
Customer_ID Name Email Age
0 101 Alice alice@example.com 25.0
1 102 Bob bob@example.com 30.0
2 103 Charlie Not Provided 29.25
3 104 David david@example.com 40.0
4 105 Eve Not Provided 22.0
π₯ Takeaway:
- fillna() replaces missing values with specified values.
- You can use the mean, median, or mode for numerical columns.
Β
2οΈβ£ Removing Duplicates
π Real-Life Example:
Imagine a sales dataset where some transactions were recorded twice. You need to remove duplicates to avoid skewed reports.
β Code Example:
# Creating a sample DataFrame
data = {
'Order_ID': [1, 2, 2, 3, 4, 4, 5],
'Product': ['Laptop', 'Phone', 'Phone', 'Tablet', 'Monitor', 'Monitor', 'Keyboard'],
'Amount': [1000, 500, 500, 300, 200, 200, 150]
}
df = pd.DataFrame(data)
# Removing duplicates
df.drop_duplicates(inplace=True)
print(df)
π Output:
Order_ID Product Amount
0 1 Laptop 1000
1 2 Phone 500
3 3 Tablet 300
4 4 Monitor 200
6 5 Keyboard 150
π₯ Takeaway:
- Use drop_duplicates() to remove duplicate rows.
- The inplace=True ensures changes are applied directly.
Β
3οΈβ£ Standardizing Data Formats
π Real-Life Example:
Your dataset contains inconsistent date formats like YYYY-MM-DD
, DD/MM/YYYY
, and Month Day, Year
. Standardizing these formats ensures smooth processing.
β Code Example:
# Creating a sample DataFrame
data = {
'Date': ['2024-02-26', '26/02/2024', 'Feb 26, 2024', '2024/02/26']
}
df = pd.DataFrame(data)
# Converting all dates to standard format
df['Date'] = pd.to_datetime(df['Date'])
print(df)
π Output:
Date
0 2024-02-26
1 2024-02-26
2 2024-02-26
3 2024-02-26
π₯ Takeaway:
- pd.to_datetime() automatically converts different formats to a standard one.
Β
4οΈβ£ Dealing with Outliers
π Real-Life Example:
In a salary dataset, you notice an entry with $1,000,000βan obvious outlier that may distort the analysis.
β Code Example:
# Creating a sample DataFrame
data = {
'Employee': ['John', 'Jane', 'Mike', 'Anna', 'Tom'],
'Salary': [50000, 55000, 60000, 1000000, 52000]
}
df = pd.DataFrame(data)
# Identifying outliers using IQR (Interquartile Range)
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
# Defining bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filtering out outliers
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]
print(df)
π Output:
Employee Salary
0 John 50000
1 Jane 55000
2 Mike 60000
4 Tom 52000
π₯ Takeaway:
- The IQR method helps identify extreme values.
- Filtering outliers keeps your dataset accurate.
Β
π― Conclusion
Mastering data cleaning in Pandas will save you time and ensure high-quality analyses. Hereβs a quick recap:
- Handling Missing Data β Use
fillna()
to replace missing values. - Removing Duplicates β Use
drop_duplicates()
to avoid redundant records. - Standardizing Formats β Convert dates and text to consistent formats.
- Dealing with Outliers β Use statistical techniques like IQR.
Next time you work with raw data, apply these techniques to clean it efficiently! π
Β
π’ In this blog post you have learned some of the popular techniques of data clean. Whatβs your favorite data cleaning trick? Let me know in the comments.