Most tutorials explain neural networks with diagrams and metaphors. This one explains them with math and code — the minimal version you need to understand what's actually happening when you call model.fit().
By the end you'll have a 2-layer network trained from scratch in NumPy that learns XOR. No PyTorch, no TensorFlow, no abstractions hiding the mechanics.
pip install numpy
Consider the XOR problem:
import numpy as np
X = np.array([[0,0],[0,1],[1,0],[1,1]]) # inputs
y = np.array([[0],[1],[1],[0]]) # XOR outputs
No straight line separates the 1s from the 0s. Logistic regression fails. You need a model that can bend the decision boundary — which is what hidden layers do.
A single neuron takes a vector of inputs x, multiplies each by a weight, adds a bias, and passes the result through a non-linear activation function:
output = activation(W · x + b)
That dot product W · x is just a weighted sum. The activation function is what makes it non-linear. Without it, stacking layers would still give you a linear model.
We'll use the sigmoid activation: σ(z) = 1 / (1 + e^(-z))
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_deriv(z):
s = sigmoid(z)
return s * (1 - s)
We'll build:
np.random.seed(42)
# Weights and biases initialized randomly
W1 = np.random.randn(2, 4) * 0.1 # (input_size, hidden_size)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * 0.1 # (hidden_size, output_size)
b2 = np.zeros((1, 1))
Data flows left to right through the network. Each layer applies the linear transformation then the activation:
def forward(X, W1, b1, W2, b2):
# Hidden layer
Z1 = X @ W1 + b1 # (4, 2) @ (2, 4) = (4, 4)
A1 = sigmoid(Z1) # apply activation elementwise
# Output layer
Z2 = A1 @ W2 + b2 # (4, 4) @ (4, 1) = (4, 1)
A2 = sigmoid(Z2) # final prediction (probability)
cache = (Z1, A1, Z2, A2)
return A2, cache
We store the intermediate values in cache — we'll need them for backprop.
We use binary cross-entropy loss. It penalizes confident wrong predictions much harder than uncertain ones:
def compute_loss(y_true, y_pred):
m = y_true.shape[0]
# Clip to avoid log(0)
y_pred = np.clip(y_pred, 1e-9, 1 - 1e-9)
loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return loss
Backprop is just the chain rule applied repeatedly from output to input. We compute how much each weight contributed to the loss and update it in the direction that reduces the loss.
def backward(X, y, W1, W2, cache, lr=0.1):
Z1, A1, Z2, A2 = cache
m = X.shape[0]
# Output layer gradients
dZ2 = A2 - y # d(loss)/d(Z2)
dW2 = A1.T @ dZ2 / m
db2 = np.sum(dZ2, axis=0, keepdims=True) / m
# Hidden layer gradients (chain rule through sigmoid)
dA1 = dZ2 @ W2.T
dZ1 = dA1 * sigmoid_deriv(Z1)
dW1 = X.T @ dZ1 / m
db1 = np.sum(dZ1, axis=0, keepdims=True) / m
# Gradient descent update
W2 -= lr * dW2
b2 -= lr * db2
W1 -= lr * dW1
b1 -= lr * db1
return W1, b1, W2, b2
lr = 1.0
epochs = 10000
for epoch in range(epochs):
A2, cache = forward(X, W1, b1, W2, b2)
loss = compute_loss(y, A2)
W1, b1, W2, b2 = backward(X, y, W1, W2, cache, lr=lr)
if epoch % 1000 == 0:
print(f"Epoch {epoch:5d} | Loss: {loss:.4f}")
Expected output (loss decreasing toward ~0.01):
Epoch 0 | Loss: 0.6928
Epoch 1000 | Loss: 0.6823
Epoch 2000 | Loss: 0.5471
Epoch 3000 | Loss: 0.1702
Epoch 4000 | Loss: 0.0669
Epoch 5000 | Loss: 0.0390
Epoch 6000 | Loss: 0.0271
Epoch 7000 | Loss: 0.0204
Epoch 8000 | Loss: 0.0162
Epoch 9000 | Loss: 0.0132
A2, _ = forward(X, W1, b1, W2, b2)
predictions = (A2 > 0.5).astype(int)
print("Input | Target | Predicted")
for xi, yi, pi in zip(X, y, predictions):
print(f" {xi} | {yi[0]} | {pi[0]}")
Output:
Input | Target | Predicted
[0 0] | 0 | 0
[0 1] | 1 | 1
[1 0] | 1 | 1
[1 1] | 0 | 0
The network learned XOR — a function that no linear model can represent — by learning a non-linear representation in the hidden layer.
When you call model.fit(X, y) in PyTorch or TensorFlow, this is exactly what's happening under the hood — millions of times, with:
The mechanics are the same. Understanding this loop makes debugging failed training runs much easier: exploding gradients, vanishing gradients, and dead ReLUs all have root causes visible at this level.