Build Neural Network From Scratch
A practical guide of building a neural network from scratch.
7/13/20258 min read
I recently watched a series of videos by Andrej Karpathy on neural networks. After following his explanations, I realized how surprisingly approachable it is to build a neural network from scratch. The hands-on experience gave me a practical understanding of how neural networks function and how they are trained, and I’d like to share some of those insights with you.
Intuitive Understanding of Neural Network
Let’s begin with an intuitive look at neural networks. At their core, neural networks consist of three types of layers: the input layer, hidden layers, and the output layer. The input layer holds the raw input data — the number of neurons here matches the number of features in each training example. The hidden layers and the output layer are made up of neurons, each with its own set of weights and bias. These weights and biases are used to compute a linear combination of the outputs from the previous layer. The result of this linear combination is passed through an activation function, which can be linear or non-linear. The output of each neuron — after applying the activation function — becomes the input for the next layer. This process repeats layer by layer until the network produces its final output from the output layer.
The weights and bias in each neuron are the parameters the network learns during training, and this happens through a process called backpropagation. To guide this learning, we define a loss function — such as mean squared error (MSE) for regression tasks or cross-entropy for classification — which measures the difference between the network’s predictions and the actual ground truth, or the negative log-likelihood of observing the training data under the current model parameters. Our goal is to minimize this loss function. We do this by applying gradient descent, an optimization method that updates the weights (and biases) in the opposite gradient direction that reduces the loss, step by step. Note: We can use the concept of the directional derivative to formally prove that moving in the direction of the gradient leads to the steepest increase of a function. The gradient of a function at a point gives the direction of maximum rate of change. Therefore, if we move in the opposite direction of the gradient, we achieve the steepest decrease.
In essence, training a neural network involves three major steps: 0) Initialization — Set the initial values of the weights and biases. 1) Forward Pass — Compute the network’s output using the current parameters. 2) Backpropagation — Calculate the loss, then compute the gradients of the model parameters with respect to the loss. 3) Parameter Update — Adjust the model parameters using the gradients and a predefined learning rate.
Implement Backpropagation
To build a neural network from scratch, the first critical step is implementing backpropagation. The most important part of this process is computing the derivatives of the loss function to all model parameters, primarily the weights and biases. The key to achieving this lies in the chain rule from calculus. For example, considering the loss function (y-y_pred)^2, we can easily calculate the loss function derivative wrt to y_pred, which is -2(y-y_pred). Now, assume the output neuron uses a linear activation function, such that y_pred = x*w+b. By applying the chain rule, dLoss/dw = (dLoss/dy_pred)*(dy_pred/dw). Similarly, we can iterate back to all the weights in the previous layers through this recursion. In simple terms, the global gradient of a parameter (child) = local gradient * upstream (parent) global gradient, where each "parent" is a function of its "child" in the forward pass.
Andrej implements backpropagation in a particularly elegant way (code link) by defining a Value object. This object not only stores the value of each computation but also keeps track of the operation performed, its input nodes (children), the local gradient for each operation, and the accumulated global gradient for each child. For basic mathematical operations like addition, multiplication, exponentiation, power, tanh, or ReLU, it’s straightforward to compute their local gradients. By embedding these operations directly into the Value class, each instance carries both the computed result and the gradient information needed for backpropagation. Moreover, Andrej’s design includes a method within the Value object that traverses the computation graph in reverse order — effectively setting up the backward pass to compute gradients.
Build A Neural Network
As we discussed earlier, a neural network is composed of layers, each consisting of individual neurons, making the neuron the fundamental building block of a neural network. We can define a Neuron class that takes the number of input features to the neuron as input, which determines how many weights the neuron will have. Inside the class, the neuron’s weights and bias are randomly initialized and represented by the Value objects we implemented earlier. The Neuron class provides a call method that computes the neuron’s output by applying an activation function to the linear combination of input x, weights, and bias. Additionally, the class exposes a method that returns all of its parameters (weights and bias), which will be useful later for gradient updates during training. This design integrates naturally with the computation graph maintained by the Value objects, allowing gradients to propagate automatically during backpropagation.
With the Neuron class in place, we can now build a Layer class on top of it, since each layer is simply a collection of neurons. The Layer class takes two inputs: the number of neurons in the layer (e.g., nout) and the input dimension of each neuron (e.g., nin). Upon initialization, the class creates a list of Neuron instances, one for each neuron in the layer. The class also defines a call method, which passes the input through each neuron and collects their outputs into a list, effectively producing the layer’s output. Finally, the Layer class provides a method that returns all parameters (weights and biases) of its neurons, flattened into a single list. This flattening makes it easier to loop over the parameters during backpropagation and gradient updates.
Now, we can construct the entire neural network, also known as a Multi-Layer Perceptron (MLP) — essentially a stack of layers. The MLP class takes two inputs: the initial input dimension and a list specifying the number of neurons in each hidden layer and the output layer, in order. Upon initialization, it creates a list of Layer instances, where the output size (nout) of the previous layer matches the input size (nin) of the next. Note: the output size of one layer is the number of neurons in that layer. The MLP class defines a call method that sequentially passes the output of each layer as input to the next. Finally, it provides a method that returns all parameters of the network, flattened into a single list, which simplifies gradient updates during training.
With the MLP class in place, we now have a complete neural network structure ready for training. As we discussed earlier, training a neural network typically involves three major steps. We start by creating an instance of the MLP class based on the desired network architecture. For each batch of training data, we pass the input through the network to generate predictions, which are then used to compute the loss. We perform backpropagation to compute the gradients of the loss with respect to all model parameters. Then, we update the parameters by subtracting the product of the learning rate and the computed gradients.
One important detail in training is that we must reset the gradients of all parameters to zero after each training batch. This prevents gradients from previous batches from accumulating and interfering with the current update. In classic gradient descent, we compute the "true" gradient over the entire dataset, giving the most accurate direction for minimizing the loss. However, this approach is often computationally expensive, especially for large datasets. To make training more efficient, we typically use Stochastic Gradient Descent (SGD), where instead of calculating the gradient on the entire dataset, we compute it on a single random sample (or a small mini-batch). While this gives a noisy estimate of the true gradient, it remains an unbiased approximation. Over many updates, these noisy gradients tend to average out, guiding the model in the right direction toward convergence. Because each sample (or batch) is meant to provide its own independent "vote" on how the parameters should adjust, it’s critical that we reset the gradients after each update. This ensures that each training step reflects only the signal from the current batch, rather than a mix of signals from previous ones. You can find Andrej Karpathy's code here.
Here are some practical notes about training a neural network model.
Learning rate: The learning rate determines the step size taken during gradient descent. A smaller learning rate slows down convergence and may cause the optimizer to get stuck in a plateau or suboptimal region of the loss surface. Conversely, a larger learning rate can cause the loss to diverge or even explode, leading to non-convergence. A common best practice is to experiment with a range of learning rates—starting with a few small and large values—and observe their effect on the loss to identify a suitable range where training is stable. You can use the linspace function to generate a list of learning rates within this range and iterate through them. By plotting the learning rate against the corresponding loss, you can identify the largest learning rate that ensures stable convergence without causing divergence. Another effective technique is learning rate decay, which gradually reduces the learning rate during training to help fine-tune the model as it approaches a minimum.
Number of training iterations: The number of training iterations refers to how many times the parameters of a neural network are updated using gradient descent. To illustrate this, imagine we are minimizing a function f(x) with respect to some variable x. We start with a random initial value, compute the gradient of f(x), and update x in the opposite direction of the gradient to reduce the function's value—this constitutes one iteration. We repeat this process iteratively: compute the new gradient of updated x and update x again, and so on. In the case of neural networks, the model parameters play the role of x; during each iteration, they are adjusted to minimize the loss function. Selecting an appropriate number of training iterations is crucial to avoid both overfitting and underfitting. This is typically done by splitting the dataset into training, development (dev), and test sets. The model is trained using the training set, while the dev set is used to monitor performance during training. By comparing the loss on the training set and the dev set, we can assess whether the model is underfitting (both losses are high) or overfitting (training loss is low but dev loss is high). This comparison helps determine when to stop training to achieve optimal generalization. Use validation-based early stopping and regularization to mitigate the risk of overfitting.
Batch size: As mentioned earlier, training on the entire dataset in each iteration can be computationally expensive. Instead, we often use a subset of the training data—called a batch—for each iteration. The batch size determines how many samples are used in each training step. A smaller batch size results in faster iterations but introduces more noise into the loss trajectory, which can make convergence less stable. Conversely, a larger batch size leads to more stable and accurate gradient estimates but slows down each iteration due to increased computation.
Number of Neurons per Layer: This determines the capacity of each layer to capture patterns in the data. Too few neurons can lead to underfitting, while too many may increase the risk of overfitting and computational cost. A good starting point is to choose a size similar to the number of input features, then adjust based on model performance. Use regularization (e.g., dropout, weight decay) when increasing neuron count to mitigate overfitting.
Number of Layers: Shallow networks (1–2 hidden layers) can handle simple tasks, but deeper networks are better suited for complex patterns and non-linear relationships. However, adding too many layers can cause issues like vanishing gradients or overfitting. Start with 1–3 layers and gradually deepen the architecture if the model underperforms, using techniques like batch normalization and residual connections for stability in deeper models.
Activation Functions:
Activation functions introduce non-linearity, enabling neural networks to learn complex functions. For hidden layers, common choices include ReLU (default in most cases), Leaky ReLU (for sparse gradients), or GELU (for smooth activation). For the output layer, the activation depends on the task: Use softmax for multi-class classification, sigmoid for binary classification, and no activation (linear) for regression tasks.