Real Estate Price Prediction with ML: Python Tutorial (sklearn + XGBoost)

Real Estate Price Prediction with ML: Python Tutorial (sklearn + XGBoost)

This tutorial walks through a complete ML pipeline for predicting house prices: data loading, exploratory analysis, feature engineering, training three different models, and comparing them on held-out data. We use the California Housing dataset so you can run everything without downloading anything.

pip install scikit-learn xgboost pandas numpy matplotlib

The Problem

Given features about a housing district in California (median income, house age, location, population), predict the median house value. This is a regression problem with real-world messiness: skewed distributions, correlated features, and geographic data.


Step 1: Load the Data

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
df = housing.frame

print(df.shape)          # (20640, 9)
print(df.dtypes)
print(df.describe())

Features:

  • MedInc — median income in block group
  • HouseAge — median house age
  • AveRooms, AveBedrms — average rooms/bedrooms per household
  • Population, AveOccup — block group population and average occupancy
  • Latitude, Longitude — geographic coordinates
  • MedHouseVal — target (median house value in $100k)

Step 2: Exploratory Data Analysis

import matplotlib.pyplot as plt

# Target distribution
plt.figure(figsize=(8, 4))
plt.hist(df["MedHouseVal"], bins=50, edgecolor="k")
plt.title("Distribution of Median House Values")
plt.xlabel("Median House Value ($100k)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("target_distribution.png", dpi=100)

# Correlation with target
print(df.corr()["MedHouseVal"].sort_values(ascending=False))

Key finding: MedInc has the strongest correlation with price (0.69). Latitude and longitude are moderately correlated — geography matters, but needs encoding.


Step 3: Feature Engineering

Raw features rarely give you the best signal. We'll add a few derived features:

def engineer_features(df):
    df = df.copy()

    # Rooms and bedrooms per person are more informative than totals
    df["RoomsPerPerson"] = df["AveRooms"] / df["AveOccup"]
    df["BedrmsPerRoom"] = df["AveBedrms"] / df["AveRooms"]

    # High income flag (top quartile)
    df["HighIncome"] = (df["MedInc"] > df["MedInc"].quantile(0.75)).astype(int)

    # Distance from San Francisco (a major price driver)
    sf_lat, sf_lon = 37.7749, -122.4194
    df["DistFromSF"] = np.sqrt(
        (df["Latitude"] - sf_lat) ** 2 + (df["Longitude"] - sf_lon) ** 2
    )

    return df

df_feat = engineer_features(df)
print(df_feat.shape)  # (20640, 13)

Step 4: Train/Test Split and Scaling

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

TARGET = "MedHouseVal"
features = [c for c in df_feat.columns if c != TARGET]

X = df_feat[features].values
y = df_feat[TARGET].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Step 5: Train Three Models

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

models = {
    "Ridge Regression": Ridge(alpha=1.0),
    "Random Forest": RandomForestRegressor(
        n_estimators=200, max_features=0.5, n_jobs=-1, random_state=42
    ),
    "XGBoost": xgb.XGBRegressor(
        n_estimators=500,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        n_jobs=-1,
    ),
}

# Ridge uses scaled features; tree models don't need scaling
fit_data = {
    "Ridge Regression": (X_train_scaled, X_test_scaled),
    "Random Forest": (X_train, X_test),
    "XGBoost": (X_train, X_test),
}

for name, model in models.items():
    X_tr, _ = fit_data[name]
    model.fit(X_tr, y_train)
    print(f"Trained {name}")

Step 6: Evaluate and Compare

from sklearn.metrics import mean_squared_error, r2_score

print(f"\n{'Model':<22} {'RMSE':>8} {'R²':>8}")
print("-" * 42)

results = {}
for name, model in models.items():
    _, X_te = fit_data[name]
    preds = model.predict(X_te)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    r2 = r2_score(y_test, preds)
    results[name] = {"rmse": rmse, "r2": r2}
    print(f"{name:<22} {rmse:>8.3f} {r2:>8.3f}")

Expected results:

Model                  RMSE       R²
------------------------------------------
Ridge Regression      0.731    0.600
Random Forest         0.508    0.807
XGBoost               0.463    0.837

RMSE is in units of $100k, so XGBoost's error of ~$46k is meaningfully better than Ridge's ~$73k.


Step 7: Feature Importances (XGBoost)

xgb_model = models["XGBoost"]
importances = sorted(
    zip(features, xgb_model.feature_importances_),
    key=lambda x: -x[1]
)

print("\nTop features (XGBoost):")
for name, imp in importances[:6]:
    bar = "█" * int(imp * 200)
    print(f"  {name:<20} {imp:.3f}  {bar}")

Typical output:

Top features (XGBoost):
  MedInc               0.512  ████████████████████████████
  Longitude            0.098  ██████
  Latitude             0.087  █████
  DistFromSF           0.071  ████
  AveOccup             0.055  ███
  RoomsPerPerson       0.041  ██

Median income dominates. The engineered DistFromSF and RoomsPerPerson features both made the top 6, justifying the engineering step.


Which Model to Use and Why

ModelRMSETraining timeExplainability
Ridge Regression0.7310.60<1sHigh (coefficients)
Random Forest0.5080.81~10sMedium (importances)
XGBoost0.4630.84~20sMedium (SHAP values)

Use Ridge when you need a fast baseline or need to explain the model coefficient-by-coefficient to a stakeholder.

Use Random Forest when you want solid accuracy without hyperparameter tuning.

Use XGBoost when accuracy matters most and you're willing to tune learning rate, depth, and subsampling. Add SHAP for explainability:

import shap

explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test[:100])
shap.summary_plot(shap_values, X_test[:100], feature_names=features)

Related articles