Predicting House Prices(Real Estate) is one of the most practical and popular applications of machine learning. Predicting House prices is a classic machine learning problem that combines data analysis, feature engineering, and regression modeling. In this step by step guide, we will build a price prediction model using a real estate dataset in Pthon. In this blog post we will cover the following topics:
Loading and exploring the data
Cleaning and preprocessing
Splitting the data
Training the model
Evaluating performance
Interpreting feature importance
Making predictions on new data
Step 1: Load the Dataset
Download the dataset from kaggle:- real_estate.csv
First we will load the Dataset with the following columns:
Location
Size (sqft)
Bedrooms
Bathrooms
Year Built
Price
import pandas as pd
# Load the dataset
df = pd.read_csv(‘real_estate.csv’)
print(df.head())
Step 2: Data Preprocessing
Clean and Preprocess The data
Before training a model, we need to clean and prepare the data.
Handle Missing Values
Missing data can break your model or distort its predictions. Let us inspect and remove incomplete rows:
# Check for missing values
print(df.isnull().sum())
# Fill or drop missing values
df = df.dropna()
Note: We can also fill missing values with mean/median if data is sparse.
Encode Categorical Variables
Machine learning models do not understand text, so we need to convert categorical data like Location into numbers using one-hot encoding:
# Convert 'Location' to numerical using one-hot encoding
df = pd.get_dummies(df, columns=['Location'], drop_first=True)
Separate Features and Target
We now split the data into input features (X
) and the target variable (y
):
X = df.drop('Price', axis=1) # Features
y = df['Price'] # Target
Step 3: Split the Data
To evaluate our model properly, We will split the data into training and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train the Model
We will use a Random Forest Regressor a powerful and easy-to-use ensemble learning model:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
Step 5: Evaluate the Model Performance
To measure how well the model is doing, we shall use:
Mean Squared Error (MSE): Measures average squared prediction error
R² Score: Measures how well the variance in price is explained by the model
Let’s check how well our model performs.
from sklearn.metrics import mean_squared_error, r2_score
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")
Step 6: Understanding Feature Importance
Understanding which features influence price the most. Let’s visualize that:
import matplotlib.pyplot as plt
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
feature_importances.nlargest(10).plot(kind='barh')
plt.title("Top 10 Feature Importances")
plt.show()
Step 7: Make Predictions
Now you can use the model to predict prices for new listings. With the trained model, you can now predict the price of new homes. Just be sure to include all required features (with the same encoding):
# Example input
new_listing = pd.DataFrame({
'Size (sqft)': [2000],
'Bedrooms': [3],
'Bathrooms': [2],
'Year Built': [2010],
'Location_New York': [1], # Assuming one-hot encoded location
# Add other location columns as needed
})
predicted_price = model.predict(new_listing)
print(f"Predicted Price: ${predicted_price[0]:,.2f}")
Final Thoughts
This project demonstrates how to build a regression model for real estate price prediction using Python and scikit-learn. You can improve it by:
- Using advanced models like XGBoost or LightGBM
- Performing hyperparameter tuning
- Adding more features like proximity to schools, crime rates, etc.