Pandas: The Essential Python Library for Data Analysis

Learn Key features of Python pandas with coding examples. Also practice pandas on the Loan Recovery dataset. Download dataset and practice with the code shared below. 

 

Introduction

Pandas is an open source Python library providing high performance data manipulation and analysis tools. Pandas is built on top of NumPy, it offers intuitive data structures and functions essential for handling structured data.

Pandas is widely used in data science analytics, finance, statistics, and machine learning applications. It simplifies complex data processing tasks, making Python a powerful language for data analysis.

” Pandas is designed for efficient data manipulation and analysis.”

                                           Mark zukewin

What is the Key Features of Pandas?

  • Data Structures: Provides Series (1D) and DataFrame (2D) objects.
  • Data Cleaning: Handles unstructured data. Cleaning data helps to identify or bring data into a pattern that later on can be analyzed for achieving business goals.
  • Handling Missing Values: It provides multiple ways like backward fill, forward fills, others to handle missing data.
  • Data Transformation: Supports filtering, sorting, and reshaping data.
  • Multiple File Formats Supports(File Handling): Reads and writes multiple file formats (CSV, Excel, JSON, SQL, etc.).
  • Grouping & Aggregation: Enables data summarization using group operations.
  • Time-Series Analysis: Supports date/time indexing and time-based functions.

Installing Pandas

How to ensure pandas is installed in your local machine? We can Ensure Pandas is installed in our Python environment using:

pip install pandas

Alternatively, if using Anaconda type following code on conda terminal:

conda install pandas

Pandas Data Structures

Series

A Pandas Series is a one dimensional labeled array, similar to a column in an Excel sheet. How to create a series using pandas library. First use import keyword then follow the library name.

import pandas as pd

# Creating a Series
s = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])
print(s)

DataFrame

A DataFrame is a two dimensional table with labeled rows and columns. A DataFrame is collection of rows and columns. It’s similar to excel spreadsheet. DataFrame provides structured arrangement of data. This make data analysis very simple and easy to understand.

# Creating a DataFrame
data = {
    'Name': ['Ram', 'jack', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New Delhi', 'bengaluru', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

Reading and Writing Data

Pandas support multiple file formats.  Different pandas functions are availabel that makes easy to read files. We can read different formats of file such as excel spreadsheets, csv(Coma seperated values) and other formats. Pandas also helps to convert one format of file into other format. Let’s implement in python.

How to Read CSV Files in Pandas:

df = pd.read_csv('data.csv')
print(df.head())

Writing to CSV Files:

df.to_csv('output.csv', index=False)

Reading Excel Files:

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Exploring Data

Understanding the dataset is crucial before analysis:

print(df.info())  # Displays column data types and non-null values
print(df.describe())  # Summary statistics for numerical columns
print(df.columns)  # List of column names

Data Selection and Filtering

Selecting Columns:

print(df['Name'])  # Selecting a single column
print(df[['Name', 'City']])  # Selecting multiple columns

Filtering Rows:

filtered_df = df[df['Age'] > 28]  # Filtering rows based on condition
print(filtered_df)

Data Cleaning and Handling Missing Values

Missing values can be managed using:

df.dropna()  # Remove rows with missing values
df.fillna({'Age': df['Age'].mean()})  # Fill missing Age with mean value

Data Manipulation

Adding a New Column:

df['Salary'] = [60000, 75000, 82000]

Applying Functions:

def categorize_age(age):
    return 'Young' if age < 30 else 'Senior'

df['Category'] = df['Age'].apply(categorize_age)

Grouping and Aggregation

Summarizing data using groupby:

df.groupby('City')['Salary'].mean()

Merging and Joining DataFrames

Combining datasets is common in data analysis.

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Department': ['HR', 'IT']})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [70000, 80000]})
merged_df = pd.merge(df1, df2, on='Name')
print(merged_df)

Time Series Analysis

Pandas excels in handling time-series data:

df['Date'] = pd.to_datetime(df['Date'])  # Convert to datetime format
df.set_index('Date', inplace=True)  # Set datetime column as index
print(df.resample('M').mean())  # Resample data by month

Let’s Learn Pandas with Loan Recovery Dataset

a step-by-step guide in exploring a loan recovery dataset using Pandas. I will include essential data analysis techniques for understanding trends, defaulters, recovery rates, and more. You can download Loan Recovery data set from here. Or download any dataset from Kaggle and apply this code.  Read till the end.
 
1. Load the Loan Recovery Dataset
import pandas as pd

# Load dataset
df = pd.read_csv('loan_recovery_data.csv')

# Display basic information
print(df.info())

# Show first five rows
print(df.head())
  • The info() method provides column names, data types, and missing values.
  • The head() displays the first five rows to understand the structure.

2. Check for Missing Values

print(df.isnull().sum())

# Fill missing values with mean (for numerical columns)
df.fillna(df.mean(), inplace=True)
  • isnull().sum() helps identify columns with missing values.
  • fillna(df.mean()) fills missing values in numerical columns with the column mean.

3. Basic Statistics and Summary

print(df.describe()) # Summary statistics
print(df['Loan_Status'].value_counts()) # Count of loan status categories
  • describe() provides insights like mean, min, and max values for numerical columns.
  • value_counts() shows the distribution of loan statuses.

4. Analyzing Loan Recovery Rates

# Calculate average recovered amount for different loan statuses
recovery_rate = df.groupby('Loan_Status')['Recovered_Amount'].mean() 
print(recovery_rate)
  • Groups data by Loan_Status and calculates the mean recovered amount.

5. Identifying Defaulters

# Filter out defaulted loans
defaulters = df[df['Loan_Status'] == 'Default']

# Summary of defaulters
print(defaulters.describe())

Extracts rows where Loan_Status is ‘Default’ and summarizes defaulter trends

6. Loan Recovery Trends Over Time

import matplotlib.pyplot as plt

# Convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Set date as index
df.set_index('Date', inplace=True)

# Resample monthly and sum recovered amounts
df.resample('M')['Recovered_Amount'].sum().plot(kind='line', title="Loan Recovery Trends")

plt.xlabel('Month')
plt.ylabel('Total Recovered Amount')
plt.show()
  • Converts the Date column to a datetime type.
  • Resamples monthly data and plots loan recovery trends.

7. Loan Amount vs Recovery Amount Analysis

import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plot of Loan Amount vs Recovered Amount
sns.scatterplot(x=df['Loan_Amount'], y=df['Recovered_Amount'])
plt.title("Loan Amount vs Recovered Amount")
plt.show()
  • Visualizes the relationship between loan amounts and recovered amounts.

8. Recovery by Loan Type

# Grouping by loan type
recovery_by_type = df.groupby('Loan_Type')['Recovered_Amount'].mean()

# Plot bar chart
recovery_by_type.plot(kind='bar', title="Average Recovery by Loan Type")

plt.xlabel('Loan Type')
plt.ylabel('Recovered Amount')
plt.show()

Groups data by Loan_Type to analyze recovery across different loan categories.

 

Outcomes

  • Identified trends: Monthly recovery variations and loan default rates.
  • Explored defaulters: Characteristics of defaulting loans.
  • Visualized patterns: Loan amount vs. recovery and loan type recovery rates.

Real-World Applications of Pandas

  • Financial Analysis: Stock market trends, risk analysis.
  • Healthcare: Patient records management, medical research.
  • Marketing: Customer segmentation, sales forecasting.
  • Machine Learning: Data preprocessing, feature engineering.

Conclusion

Pandas is an essential library for anyone working with data in Python. Its powerful tools for data manipulation, analysis, and visualization make it a must-have skill for data scientists and analysts. By mastering Pandas, you unlock the ability to efficiently process and analyze large datasets with ease.

Leave a Comment

Your email address will not be published. Required fields are marked *