Data Analysis and Hypothesis Testing Using a Banking Dataset

Introduction

In this analysis, we will explore a banking dataset, preprocess the data, build a classification machine learning model, and perform hypothesis testing to validate insights.

The goal is to classify customers based on their likelihood of subscribing to a term deposit and to use statistical tests to support our findings.

Step 1: Load and Explore the Data

We start by loading a banking dataset and exploring its structure.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy import stats
from scipy.stats import f_oneway, mannwhitneyu, ks_2samp

# Load the dataset
data = pd.read_csv("banking.csv")  # Ensure the correct dataset path

# Display basic information
data.info()
print(data.head())

# Check for missing values
print(data.isnull().sum())

Step 2: Data Preprocessing

We need to encode categorical variables, handle missing values, and normalize numerical features.

# Encoding categorical variables
le = LabelEncoder()
data['job'] = le.fit_transform(data['job'])
data['marital'] = le.fit_transform(data['marital'])
data['education'] = le.fit_transform(data['education'])
data['default'] = le.fit_transform(data['default'])
data['housing'] = le.fit_transform(data['housing'])
data['loan'] = le.fit_transform(data['loan'])
data['contact'] = le.fit_transform(data['contact'])
data['month'] = le.fit_transform(data['month'])
data['poutcome'] = le.fit_transform(data['poutcome'])
data['y'] = le.fit_transform(data['y'])  # Target variable

# Splitting data into features and target
X = data.drop(columns=['y'])
y = data['y']

# Standardizing numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

Step 3: Build and Train a Classification Model

We will use a Random Forest classifier to predict whether a customer subscribes to a term deposit.

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Step 4: Perform Hypothesis Testing

Hypothesis 1: Is the average age of customers who subscribe different from those who don’t?

# Split ages based on subscription status
subscribed_age = data[data['y'] == 1]['age']
not_subscribed_age = data[data['y'] == 0]['age']

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(subscribed_age, not_subscribed_age)
print(f"T-Statistic: {t_stat:.4f}, P-Value: {p_value:.4f}")
if p_value < 0.05:
    print("Significant difference in age between subscribed and non-subscribed customers.")
else:
    print("No significant difference in age.")

Hypothesis 2: Is there an association between marital status and subscription?

# Create a contingency table
contingency_table = pd.crosstab(data['marital'], data['y'])

# Perform Chi-Square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-Square Statistic: {chi2:.4f}, P-Value: {p:.4f}")
if p < 0.05:
    print("Marital status has a significant association with subscription.")
else:
    print("No significant association between marital status and subscription.")

Hypothesis 3: Is there a significant difference in balance across different education levels? (ANOVA Test)

# Perform ANOVA Test
anova_stat, anova_p = f_oneway(
    data[data['education'] == 0]['balance'],
    data[data['education'] == 1]['balance'],
    data[data['education'] == 2]['balance']
)

print(f"ANOVA Statistic: {anova_stat:.4f}, P-Value: {anova_p:.4f}")
if anova_p < 0.05:
    print("Significant difference in balance across education levels.")
else:
    print("No significant difference in balance across education levels.")

Hypothesis 4: Are balances different between subscribed and non-subscribed customers? (Mann-Whitney U Test)

# Perform Mann-Whitney U Test
mw_stat, mw_p = mannwhitneyu(subscribed_age, not_subscribed_age)

print(f"Mann-Whitney U Statistic: {mw_stat:.4f}, P-Value: {mw_p:.4f}")
if mw_p < 0.05:
    print("Significant difference in balance between subscribed and non-subscribed customers.")
else:
    print("No significant difference in balance.")

Hypothesis 5: Are the distributions of balance different between subscribed and non-subscribed customers? (Kolmogorov-Smirnov Test)

# Perform Kolmogorov-Smirnov Test
ks_stat, ks_p = ks_2samp(subscribed_age, not_subscribed_age)

print(f"Kolmogorov-Smirnov Statistic: {ks_stat:.4f}, P-Value: {ks_p:.4f}")
if ks_p < 0.05:
    print("The distributions of balance are significantly different between subscribed and non-subscribed customers.")
else:
    print("No significant difference in distributions.")

Conclusion

We explored a banking dataset, built a classification model, and validated key insights using various hypothesis tests.

The results show statistical differences in age, marital status, education level, and balance between different customer groups.

These insights can be used for more targeted marketing strategies and customer segmentation, helping businesses optimize decision-making based on data-driven evidence.

Understanding Hypothesis Testing

Understanding Confusion Matrix

Understanding Sampling Distribution

Most Asked Statistics Interview Questions in data Science