Learn Key features of Python pandas with coding examples. Also practice pandas on the Loan Recovery dataset. Download dataset and practice with the code shared below.
Introduction
Pandas is an open source Python library providing high performance data manipulation and analysis tools. Pandas is built on top of NumPy, it offers intuitive data structures and functions essential for handling structured data.
Pandas is widely used in data science analytics, finance, statistics, and machine learning applications. It simplifies complex data processing tasks, making Python a powerful language for data analysis.
” Pandas is designed for efficient data manipulation and analysis.”
Mark zukewin
What is the Key Features of Pandas?
- Data Structures: Provides Series (1D) and DataFrame (2D) objects.
- Data Cleaning: Handles unstructured data. Cleaning data helps to identify or bring data into a pattern that later on can be analyzed for achieving business goals.
- Handling Missing Values: It provides multiple ways like backward fill, forward fills, others to handle missing data.
- Data Transformation: Supports filtering, sorting, and reshaping data.
- Multiple File Formats Supports(File Handling): Reads and writes multiple file formats (CSV, Excel, JSON, SQL, etc.).
- Grouping & Aggregation: Enables data summarization using group operations.
- Time-Series Analysis: Supports date/time indexing and time-based functions.
Installing Pandas
How to ensure pandas is installed in your local machine? We can Ensure Pandas is installed in our Python environment using:
pip install pandas
Alternatively, if using Anaconda type following code on conda terminal:
conda install pandas
Pandas Data Structures
Series
A Pandas Series is a one dimensional labeled array, similar to a column in an Excel sheet. How to create a series using pandas library. First use import keyword then follow the library name.
import pandas as pd
# Creating a Series
s = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])
print(s)
DataFrame
A DataFrame is a two dimensional table with labeled rows and columns. A DataFrame is collection of rows and columns. It’s similar to excel spreadsheet. DataFrame provides structured arrangement of data. This make data analysis very simple and easy to understand.
# Creating a DataFrame
data = {
'Name': ['Ram', 'jack', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New Delhi', 'bengaluru', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Reading and Writing Data
Pandas support multiple file formats. Different pandas functions are availabel that makes easy to read files. We can read different formats of file such as excel spreadsheets, csv(Coma seperated values) and other formats. Pandas also helps to convert one format of file into other format. Let’s implement in python.
How to Read CSV Files in Pandas:
df = pd.read_csv('data.csv')
print(df.head())
Writing to CSV Files:
df.to_csv('output.csv', index=False)
Reading Excel Files:
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
Exploring Data
Understanding the dataset is crucial before analysis:
print(df.info()) # Displays column data types and non-null values
print(df.describe()) # Summary statistics for numerical columns
print(df.columns) # List of column names
Data Selection and Filtering
Selecting Columns:
print(df['Name']) # Selecting a single column
print(df[['Name', 'City']]) # Selecting multiple columns
Filtering Rows:
filtered_df = df[df['Age'] > 28] # Filtering rows based on condition
print(filtered_df)
Data Cleaning and Handling Missing Values
Missing values can be managed using:
df.dropna() # Remove rows with missing values
df.fillna({'Age': df['Age'].mean()}) # Fill missing Age with mean value
Data Manipulation
Adding a New Column:
df['Salary'] = [60000, 75000, 82000]
Applying Functions:
def categorize_age(age):
return 'Young' if age < 30 else 'Senior'
df['Category'] = df['Age'].apply(categorize_age)
Grouping and Aggregation
Summarizing data using groupby
:
df.groupby('City')['Salary'].mean()
Merging and Joining DataFrames
Combining datasets is common in data analysis.
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Department': ['HR', 'IT']})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [70000, 80000]})
merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)
Time Series Analysis
Pandas excels in handling time-series data:
df['Date'] = pd.to_datetime(df['Date']) # Convert to datetime format
df.set_index('Date', inplace=True) # Set datetime column as index
print(df.resample('M').mean()) # Resample data by month
Let’s Learn Pandas with Loan Recovery Dataset
import pandas as pd
# Load dataset
df = pd.read_csv('loan_recovery_data.csv')
# Display basic information
print(df.info())
# Show first five rows
print(df.head())
- The
info()
method provides column names, data types, and missing values. The head()
displays the first five rows to understand the structure.
2. Check for Missing Values
print(df.isnull().sum())
# Fill missing values with mean (for numerical columns)
df.fillna(df.mean(), inplace=True)
isnull().sum()
helps identify columns with missing values.fillna(df.mean())
fills missing values in numerical columns with the column mean.
3. Basic Statistics and Summary
print(df.describe()) # Summary statistics
print(df['Loan_Status'].value_counts()) # Count of loan status categories
describe()
provides insights like mean, min, and max values for numerical columns.value_counts()
shows the distribution of loan statuses.
4. Analyzing Loan Recovery Rates
# Calculate average recovered amount for different loan statuses
recovery_rate = df.groupby('Loan_Status')['Recovered_Amount'].mean()
print(recovery_rate)
- Groups data by
Loan_Status
and calculates the mean recovered amount.
5. Identifying Defaulters
# Filter out defaulted loans
defaulters = df[df['Loan_Status'] == 'Default']
# Summary of defaulters
print(defaulters.describe())
Extracts rows where Loan_Status
is ‘Default’ and summarizes defaulter trends
6. Loan Recovery Trends Over Time
import matplotlib.pyplot as plt
# Convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# Set date as index
df.set_index('Date', inplace=True)
# Resample monthly and sum recovered amounts
df.resample('M')['Recovered_Amount'].sum().plot(kind='line', title="Loan Recovery Trends")
plt.xlabel('Month')
plt.ylabel('Total Recovered Amount')
plt.show()
- Converts the
Date
column to adatetime
type. - Resamples monthly data and plots loan recovery trends.
7. Loan Amount vs Recovery Amount Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Scatter plot of Loan Amount vs Recovered Amount
sns.scatterplot(x=df['Loan_Amount'], y=df['Recovered_Amount'])
plt.title("Loan Amount vs Recovered Amount")
plt.show()
- Visualizes the relationship between loan amounts and recovered amounts.
8. Recovery by Loan Type
# Grouping by loan type
recovery_by_type = df.groupby('Loan_Type')['Recovered_Amount'].mean()
# Plot bar chart
recovery_by_type.plot(kind='bar', title="Average Recovery by Loan Type")
plt.xlabel('Loan Type')
plt.ylabel('Recovered Amount')
plt.show()
Groups data by Loan_Type
to analyze recovery across different loan categories.
Outcomes
- Identified trends: Monthly recovery variations and loan default rates.
- Explored defaulters: Characteristics of defaulting loans.
- Visualized patterns: Loan amount vs. recovery and loan type recovery rates.
Real-World Applications of Pandas
- Financial Analysis: Stock market trends, risk analysis.
- Healthcare: Patient records management, medical research.
- Marketing: Customer segmentation, sales forecasting.
- Machine Learning: Data preprocessing, feature engineering.
Conclusion
Pandas is an essential library for anyone working with data in Python. Its powerful tools for data manipulation, analysis, and visualization make it a must-have skill for data scientists and analysts. By mastering Pandas, you unlock the ability to efficiently process and analyze large datasets with ease.