Machine Learning Model for Real Estate Price Prediction

Predicting House Prices(Real Estate) is one of the most practical and popular applications of machine learning. Predicting House prices is a classic machine learning problem that combines data analysis, feature engineering, and regression modeling. In this step by step guide, we will build a price prediction model using a real estate dataset in Pthon. In this blog post we will cover the following topics:

Loading and exploring the data
Cleaning and preprocessing
Splitting the data
Training the model
Evaluating performance
Interpreting feature importance
Making predictions on new data

Step 1: Load the Dataset

Download the dataset from kaggle:- real_estate.csv First we will load the Dataset with the following columns:

Location
Size (sqft)
Bedrooms
Bathrooms
Year Built
Price

import pandas as pd

# Load the dataset
df = pd.read_csv(‘real_estate.csv’)
print(df.head())

Step 2: Data Preprocessing

Clean and Preprocess The data

Before training a model, we need to clean and prepare the data.

Handle Missing Values

Missing data can break your model or distort its predictions. Let us inspect and remove incomplete rows:

# Check for missing values
print(df.isnull().sum())

# Fill or drop missing values
df = df.dropna()

Note: We can also fill missing values with mean/median if data is sparse.

Encode Categorical Variables

Machine learning models do not understand text, so we need to convert categorical data like Location into numbers using one-hot encoding:

# Convert 'Location' to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Location'], drop_first=True)

Separate Features and Target

We now split the data into input features (X) and the target variable (y):

X = df.drop('Price', axis=1)  # Features
y = df['Price']               # Target

Step 3: Split the Data

To evaluate our model properly, We will split the data into training and testing sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train the Model

We will use a Random Forest Regressor a powerful and easy-to-use ensemble learning model:

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

Step 5: Evaluate the Model Performance

To measure how well the model is doing, we shall use:

Mean Squared Error (MSE): Measures average squared prediction error
R² Score: Measures how well the variance in price is explained by the model

Let’s check how well our model performs.

from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Step 6: Understanding Feature Importance

Understanding which features influence price the most. Let’s visualize that:

import matplotlib.pyplot as plt

feature_importances = pd.Series(model.feature_importances_, index=X.columns)
feature_importances.nlargest(10).plot(kind='barh')
plt.title("Top 10 Feature Importances")
plt.show()

Step 7: Make Predictions

Now you can use the model to predict prices for new listings. With the trained model, you can now predict the price of new homes. Just be sure to include all required features (with the same encoding):

# Example input
new_listing = pd.DataFrame({
    'Size (sqft)': [2000],
    'Bedrooms': [3],
    'Bathrooms': [2],
    'Year Built': [2010],
    'Location_New York': [1],  # Assuming one-hot encoded location
    # Add other location columns as needed
})

predicted_price = model.predict(new_listing)
print(f"Predicted Price: ${predicted_price[0]:,.2f}")

Final Thoughts

This project demonstrates how to build a regression model for real estate price prediction using Python and scikit-learn. You can improve it by:

Using advanced models like XGBoost or LightGBM
Performing hyperparameter tuning
Adding more features like proximity to schools, crime rates, etc.