This tutorial walks through a complete ML pipeline for predicting house prices: data loading, exploratory analysis, feature engineering, training three different models, and comparing them on held-out data. We use the California Housing dataset so you can run everything without downloading anything.
pip install scikit-learn xgboost pandas numpy matplotlib
Given features about a housing district in California (median income, house age, location, population), predict the median house value. This is a regression problem with real-world messiness: skewed distributions, correlated features, and geographic data.
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame
print(df.shape) # (20640, 9)
print(df.dtypes)
print(df.describe())
Features:
MedInc — median income in block groupHouseAge — median house ageAveRooms, AveBedrms — average rooms/bedrooms per householdPopulation, AveOccup — block group population and average occupancyLatitude, Longitude — geographic coordinatesMedHouseVal — target (median house value in $100k)import matplotlib.pyplot as plt
# Target distribution
plt.figure(figsize=(8, 4))
plt.hist(df["MedHouseVal"], bins=50, edgecolor="k")
plt.title("Distribution of Median House Values")
plt.xlabel("Median House Value ($100k)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("target_distribution.png", dpi=100)
# Correlation with target
print(df.corr()["MedHouseVal"].sort_values(ascending=False))
Key finding: MedInc has the strongest correlation with price (0.69). Latitude and longitude are moderately correlated — geography matters, but needs encoding.
Raw features rarely give you the best signal. We'll add a few derived features:
def engineer_features(df):
df = df.copy()
# Rooms and bedrooms per person are more informative than totals
df["RoomsPerPerson"] = df["AveRooms"] / df["AveOccup"]
df["BedrmsPerRoom"] = df["AveBedrms"] / df["AveRooms"]
# High income flag (top quartile)
df["HighIncome"] = (df["MedInc"] > df["MedInc"].quantile(0.75)).astype(int)
# Distance from San Francisco (a major price driver)
sf_lat, sf_lon = 37.7749, -122.4194
df["DistFromSF"] = np.sqrt(
(df["Latitude"] - sf_lat) ** 2 + (df["Longitude"] - sf_lon) ** 2
)
return df
df_feat = engineer_features(df)
print(df_feat.shape) # (20640, 13)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
TARGET = "MedHouseVal"
features = [c for c in df_feat.columns if c != TARGET]
X = df_feat[features].values
y = df_feat[TARGET].values
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
models = {
"Ridge Regression": Ridge(alpha=1.0),
"Random Forest": RandomForestRegressor(
n_estimators=200, max_features=0.5, n_jobs=-1, random_state=42
),
"XGBoost": xgb.XGBRegressor(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
random_state=42,
n_jobs=-1,
),
}
# Ridge uses scaled features; tree models don't need scaling
fit_data = {
"Ridge Regression": (X_train_scaled, X_test_scaled),
"Random Forest": (X_train, X_test),
"XGBoost": (X_train, X_test),
}
for name, model in models.items():
X_tr, _ = fit_data[name]
model.fit(X_tr, y_train)
print(f"Trained {name}")
from sklearn.metrics import mean_squared_error, r2_score
print(f"\n{'Model':<22} {'RMSE':>8} {'R²':>8}")
print("-" * 42)
results = {}
for name, model in models.items():
_, X_te = fit_data[name]
preds = model.predict(X_te)
rmse = np.sqrt(mean_squared_error(y_test, preds))
r2 = r2_score(y_test, preds)
results[name] = {"rmse": rmse, "r2": r2}
print(f"{name:<22} {rmse:>8.3f} {r2:>8.3f}")
Expected results:
Model RMSE R²
------------------------------------------
Ridge Regression 0.731 0.600
Random Forest 0.508 0.807
XGBoost 0.463 0.837
RMSE is in units of $100k, so XGBoost's error of ~$46k is meaningfully better than Ridge's ~$73k.
xgb_model = models["XGBoost"]
importances = sorted(
zip(features, xgb_model.feature_importances_),
key=lambda x: -x[1]
)
print("\nTop features (XGBoost):")
for name, imp in importances[:6]:
bar = "█" * int(imp * 200)
print(f" {name:<20} {imp:.3f} {bar}")
Typical output:
Top features (XGBoost):
MedInc 0.512 ████████████████████████████
Longitude 0.098 ██████
Latitude 0.087 █████
DistFromSF 0.071 ████
AveOccup 0.055 ███
RoomsPerPerson 0.041 ██
Median income dominates. The engineered DistFromSF and RoomsPerPerson features both made the top 6, justifying the engineering step.
| Model | RMSE | R² | Training time | Explainability |
|---|---|---|---|---|
| Ridge Regression | 0.731 | 0.60 | <1s | High (coefficients) |
| Random Forest | 0.508 | 0.81 | ~10s | Medium (importances) |
| XGBoost | 0.463 | 0.84 | ~20s | Medium (SHAP values) |
Use Ridge when you need a fast baseline or need to explain the model coefficient-by-coefficient to a stakeholder.
Use Random Forest when you want solid accuracy without hyperparameter tuning.
Use XGBoost when accuracy matters most and you're willing to tune learning rate, depth, and subsampling. Add SHAP for explainability:
import shap
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test[:100])
shap.summary_plot(shap_values, X_test[:100], feature_names=features)