Building a Spam Filter with Machine Learning

Every day we get spam email, messages or calls. A less digitally literate person become victim of these spam messages and face financial loss, mental trauma. Such digital fraud can be minimize using AI tools and Technologies.

Do spam filters use machine learning?Yes, They use ML algorithms commonly to filter spam emails. It is used to classify into spam or ham. Spam messages are more than just annoying; they can be dangerous. Whether it is phishing attempts or unwanted ads, filtering spam is a crucial task in modern communication systems. In this blog post, we will walk through how to build a simple yet effective spam filter using machine learning in Python.

By the end, you will have a working model that can classify messages as spam or ham (not spam), and you will understand the key steps involved in building such a system.

Step 1: Set Up Your Environment

First, install the necessary Python libraries:

pip install pandas scikit-learn matplotlib seaborn

Step 2: Load the Dataset

We will use the SMS Spam Collection Dataset, which contains thousands of labeled SMS messages.

import pandas as pd

# Load dataset
df = pd.read_csv("spam.csv", encoding='latin-1')[['v1', 'v2']]
df.columns = ['label', 'message']
print(df.head())

Step 3: Preprocess the Data

We’ll convert the labels to binary values and vectorize the text using TF-IDF.

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Encode labels
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Split data
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

# Vectorize text
vectorizer = TfidfVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

Step 4: Train the Model

We will use a Naive Bayes classifier, which is well-suited for text classification tasks.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Train model
model = MultinomialNB()
model.fit(X_train_vec, y_train)

# Predict
y_pred = model.predict(X_test_vec)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Step 5: Test with New Messages

Now let’s test the model with some custom messages.

def predict_spam(message):
    message_vec = vectorizer.transform([message])
    prediction = model.predict(message_vec)
    return "Spam" if prediction[0] == 1 else "Ham"

# Try it out
print(predict_spam("Congratulations! You've won a free ticket to Bahamas. Call now!"))
print(predict_spam("Hey, are we still meeting for lunch today?"))

Step 6: Visualize the Results

A confusion matrix helps us understand how well the model is performing.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Step 7: Take It Further

What machine learning model is used for spam detection?

This basic Naive Bayes classifier model is a great start, but there is plenty of room to grow:

Try different classifiers like SVM or Random Forest
Use text preprocessing techniques like stemming or lemmatization
Tune hyperparameters for better performance
Explore deep learning models like LSTM for sequential data

Conclusion

Spam filtering is a classic example of how machine learning can solve real-world problems. With just a few lines of code, you have built a model that can intelligently classify messages. Whether you are building an email client, a messaging app, or just exploring ML, this project is a solid foundation.