Learn how to build a House Price Prediction model using Linear Regression in Python. Step-by-step guide with code examples, data preprocessing, and evaluation metrics explained.
INTRODUCTION
Predicting house prices is one of the most fundamental applications of data science and machine learning. Imagine being able to forecast how much a house is worth based on factors like its size, location, and condition. Sounds interesting, right? That’s where linear regression comes into play. In this article, we’ll break down the concept of linear regression, explore how it works, and explain how you can use it to build a house price prediction model.
Here’s a step-by-step explanation of how to build the House Price Prediction model using Linear Regression in Python. This includes data loading, preprocessing, training, and evaluating the model.
Step 1: Import Libraries
Start by importing the necessary Python libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Step 2: Load the Dataset
Click here to download the Housing Price Dataset. This dataset contains 13 columns and 546 rows. This small dataset is small but good enough to practice machine learning concepts. Download the dataset and practice along with the steps given below.
# Load the dataset
data = pd.read_csv('Housing.csv')
# Display the first few rows of the dataset
print(data.head())
Step 3: Explore the dataset
Check for missing values, outliers, or categorical columns that need encoding.
# Check for missing values
print(data.isnull().sum())
# Basic statistical summary
print(data.describe())
# Visualize correlations
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Step 4: Encode Categorical Data
The dataset has many categorical features (like furnishingstatus, roadside, guestrooms and others), encode them into numeric values using one-hot encoding.
# One-hot encoding for categorical variables
data = pd.get_dummies(data, columns=['mainroad','guestroom','basement',
'hotwaterheating','prefarea','furnishingstatus',
'airconditioning',], drop_first=True)
# Check the transformed data
print(data.head())
Step 5: Split the Data
Split the dataset into features (X) and target (y), then divide it into training and testing sets.
# Define features and target variable
X = data.drop('Price', axis=1) # All columns except 'Price'
y = data['Price'] # Target variable
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
print(f"Training set size: {X_train.shape}")
print(f"Testing set size: {X_test.shape}")
Step 6: Train the Linear Regression Model
Use the LinearRegression class from scikit-learn to train the model.
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model on the training data
model.fit(X_train, y_train)
# Display the model's coefficients and intercept
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
Step 7: Evaluate the Model
Measure the performance of the model on the test set using metrics like MAE, RMSE, and R² score.
# Make predictions on the test set
y_pred = model.predict(X_test)
# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R² Score: {r2}")
Step 8: Visualize the Results
Create a scatter plot to compare the actual and predicted house prices.
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs. Predicted Prices")
plt.show()
Full Code Example
Here’s the complete code for the House Price Prediction model using linear regression:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Step 1: Load the dataset
data = pd.read_csv('house_prices.csv')
# Step 2: Explore the data
print(data.head())
print(data.describe())
print(data.isnull().sum())
# Step 3: Visualize correlations
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
# Step 4: Encode categorical data
data = pd.get_dummies(data, columns=['Location', 'Condition'], drop_first=True)
# Step 5: Split the data
X = data.drop('Price', axis=1)
y = data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 6: Train the model
model = LinearRegression()
model.fit(X_train, y_train)
print("Model Coefficients:", model.coef_)
print("Model Intercept:", model.intercept_)
# Step 7: Evaluate the model
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R² Score: {r2}")
# Step 8: Visualize the results
plt.scatter(y_test, y_pred, alpha=0.7)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs. Predicted Prices")
plt.show()
After running the model, you will see metrics like MAE, RMSE, and R squared score, which indicate how well the model performs. Aim for lower errors and R-squared score close to 1 .