How Neural Networks Actually Work: Forward Pass & Backprop in NumPy

How Neural Networks Actually Work: Forward Pass & Backprop in NumPy

Most tutorials explain neural networks with diagrams and metaphors. This one explains them with math and code — the minimal version you need to understand what's actually happening when you call model.fit().

By the end you'll have a 2-layer network trained from scratch in NumPy that learns XOR. No PyTorch, no TensorFlow, no abstractions hiding the mechanics.

pip install numpy

The Problem: Non-Linear Data

Consider the XOR problem:

import numpy as np

X = np.array([[0,0],[0,1],[1,0],[1,1]])  # inputs
y = np.array([[0],[1],[1],[0]])           # XOR outputs

No straight line separates the 1s from the 0s. Logistic regression fails. You need a model that can bend the decision boundary — which is what hidden layers do.


What a Neuron Actually Computes

A single neuron takes a vector of inputs x, multiplies each by a weight, adds a bias, and passes the result through a non-linear activation function:

output = activation(W · x + b)

That dot product W · x is just a weighted sum. The activation function is what makes it non-linear. Without it, stacking layers would still give you a linear model.

We'll use the sigmoid activation: σ(z) = 1 / (1 + e^(-z))

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

Network Architecture

We'll build:

  • Input layer: 2 neurons (x1, x2)
  • Hidden layer: 4 neurons with sigmoid activation
  • Output layer: 1 neuron with sigmoid activation (binary output)
np.random.seed(42)

# Weights and biases initialized randomly
W1 = np.random.randn(2, 4) * 0.1   # (input_size, hidden_size)
b1 = np.zeros((1, 4))

W2 = np.random.randn(4, 1) * 0.1   # (hidden_size, output_size)
b2 = np.zeros((1, 1))

Forward Pass

Data flows left to right through the network. Each layer applies the linear transformation then the activation:

def forward(X, W1, b1, W2, b2):
    # Hidden layer
    Z1 = X @ W1 + b1          # (4, 2) @ (2, 4) = (4, 4)
    A1 = sigmoid(Z1)           # apply activation elementwise

    # Output layer
    Z2 = A1 @ W2 + b2          # (4, 4) @ (4, 1) = (4, 1)
    A2 = sigmoid(Z2)           # final prediction (probability)

    cache = (Z1, A1, Z2, A2)
    return A2, cache

We store the intermediate values in cache — we'll need them for backprop.


Loss Function

We use binary cross-entropy loss. It penalizes confident wrong predictions much harder than uncertain ones:

def compute_loss(y_true, y_pred):
    m = y_true.shape[0]
    # Clip to avoid log(0)
    y_pred = np.clip(y_pred, 1e-9, 1 - 1e-9)
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

Backpropagation

Backprop is just the chain rule applied repeatedly from output to input. We compute how much each weight contributed to the loss and update it in the direction that reduces the loss.

def backward(X, y, W1, W2, cache, lr=0.1):
    Z1, A1, Z2, A2 = cache
    m = X.shape[0]

    # Output layer gradients
    dZ2 = A2 - y                          # d(loss)/d(Z2)
    dW2 = A1.T @ dZ2 / m
    db2 = np.sum(dZ2, axis=0, keepdims=True) / m

    # Hidden layer gradients (chain rule through sigmoid)
    dA1 = dZ2 @ W2.T
    dZ1 = dA1 * sigmoid_deriv(Z1)
    dW1 = X.T @ dZ1 / m
    db1 = np.sum(dZ1, axis=0, keepdims=True) / m

    # Gradient descent update
    W2 -= lr * dW2
    b2 -= lr * db2
    W1 -= lr * dW1
    b1 -= lr * db1

    return W1, b1, W2, b2

Training Loop

lr = 1.0
epochs = 10000

for epoch in range(epochs):
    A2, cache = forward(X, W1, b1, W2, b2)
    loss = compute_loss(y, A2)
    W1, b1, W2, b2 = backward(X, y, W1, W2, cache, lr=lr)

    if epoch % 1000 == 0:
        print(f"Epoch {epoch:5d} | Loss: {loss:.4f}")

Expected output (loss decreasing toward ~0.01):

Epoch     0 | Loss: 0.6928
Epoch  1000 | Loss: 0.6823
Epoch  2000 | Loss: 0.5471
Epoch  3000 | Loss: 0.1702
Epoch  4000 | Loss: 0.0669
Epoch  5000 | Loss: 0.0390
Epoch  6000 | Loss: 0.0271
Epoch  7000 | Loss: 0.0204
Epoch  8000 | Loss: 0.0162
Epoch  9000 | Loss: 0.0132

Checking the Result

A2, _ = forward(X, W1, b1, W2, b2)
predictions = (A2 > 0.5).astype(int)

print("Input | Target | Predicted")
for xi, yi, pi in zip(X, y, predictions):
    print(f"  {xi}  |   {yi[0]}    |    {pi[0]}")

Output:

Input | Target | Predicted
  [0 0]  |   0    |    0
  [0 1]  |   1    |    1
  [1 0]  |   1    |    1
  [1 1]  |   0    |    0

The network learned XOR — a function that no linear model can represent — by learning a non-linear representation in the hidden layer.


What This Means in Practice

When you call model.fit(X, y) in PyTorch or TensorFlow, this is exactly what's happening under the hood — millions of times, with:

  • Mini-batches instead of full-batch gradient descent
  • Adam or RMSProp instead of vanilla SGD
  • Dropout and batch norm for regularization
  • Autograd computing the chain rule automatically

The mechanics are the same. Understanding this loop makes debugging failed training runs much easier: exploding gradients, vanishing gradients, and dead ReLUs all have root causes visible at this level.


Related articles