An Overview of Transformer Architecture
Transformer: The Foundation of Modern Large Language Models
6/8/20254 min read
The Transformer architecture is the foundation of most modern Large Language Models (LLMs). It was introduced by Google researchers in the influential paper "Attention is All You Need." This blog aims to provide an intuitive understanding of the Transformer architecture and uncover how LLMs work.
Each Transformer layer is composed of a multi-head self-attention mechanism followed by a feed-forward neural network. Modern LLMs typically contain many such layers.
Multi-Head Attention
Multi-head attention is built from several single-head self-attention mechanisms operating in parallel. Each self-attention head processes input using three key matrices: 1) Query matrix, 2) Key matrix, 3) Value matrix. After tokenizing the input text, we obtain the embedding for each token by looking it up in a large embedding matrix, commonly of size 12.8k × 50k, where 12.8k is the embedding dimension and 50k is the vocabulary size (i.e., number of distinct tokens). Each token's embedding is multiplied by 1) the query matrix to produce its query vector.; 2) the key matrix to produce its key vector; 3) the value matrix to produce its value vector.
Next, we compute the dot product between each query and every key vector in the sequence, resulting in the attention scores stored in the attention matrix. For decoder-only models like ChatGPT, only the upper triangular part of this matrix is retained (the lower triangular part of this matrix becomes zero)—this enforces causality, ensuring that each token can only attend to earlier tokens in the sequence.
To compute the attention output for a token, we take a weighted sum of all the value vectors, where the weights are the attention scores (usually normalized via softmax). For example, if the attention score between the 4th and 2nd tokens is 0.8, then 0.8 times the 2nd token's value vector will contribute to the 4th token’s new representation. This is repeated for all earlier tokens to form the final attended vector for the 4th token.
Multi-head attention consists of multiple self-attention heads operating in parallel. Each head independently computes attention over the input, capturing different aspects of the token relationships. The outputs from all the heads are concatenated and then projected back to the original hidden dimension. This result is added to the original input vector via a residual connection, producing an updated representation. You can think of the input vector as encoding the standalone semantic meaning of a token without any contextual information. After multi-head attention, the updated vector incorporates contextual meaning by attending to surrounding tokens, especially those that come before it in the sequence (in decoder models like ChatGPT).
The query, key, and value matrices are learned during training through backpropagation. Typically, the query and key matrices project the input embeddings into a lower-dimensional space—for example, from 12.8k to 128 dimensions (i.e., matrices of size 128 × 12.8k). The value matrix often retains the higher dimensionality (e.g., 12.8k × 12.8k), but it can also be represented as a composition of two smaller matrices (e.g., a 12.8k × 128 matrix followed by a 128 × 12.8k matrix), depending on implementation and parameter efficiency considerations. These matrices may seem abstract at first, but they each serve a specific purpose in the self-attention mechanism:
The query matrix transforms the input embedding into a query vector, which represents what kind of contextual information this token is seeking.
The key matrix transforms the input into a key vector, which encodes what kind of information this token offers—essentially, it acts like an “answer” to a potential query.
The value matrix transforms the input into a value vector, which carries the actual semantic content to be passed along if the token is considered relevant (i.e., attended to).
Together, these projections enable the model to determine how much attention each token should pay to every other token in the sequence.
Feed Forward Neural Network
The output vector from the multi-head attention layer is then passed into a feed-forward neural network (FFN). Each token’s vector is processed independently and in parallel by the same FFN.
The feed-forward network typically consists of two linear transformations with a nonlinear activation in between. First, the input vector is projected into a higher-dimensional space by multiplying it with an upward projection matrix (often called the "intermediate" or "hidden" weight matrix) and adding a bias term. This result is then passed through a ReLU activation function to introduce non-linearity.
Next, the activated vector is multiplied by a downward projection matrix to map it back to the original dimension—matching the size of the input to the FFN. This ensures dimensional consistency so the output can be added back via a residual connection and passed to the next Transformer layer.
We can think of each row in the upward projection matrix as defining a specific direction in the embedding space. These directions are often nearly orthogonal to one another and can be interpreted as capturing distinct semantic or abstract features. When the input vector is multiplied by this matrix and passed through a non-linear activation (e.g., ReLU), the result indicates how strongly the input vector aligns with each of these feature directions.
Similarly, each column in the downward projection matrix can be viewed as a direction in the embedding space. These directions determine how much each activated feature contributes to the final output vector. In other words, if a particular hidden unit is active (i.e., produces a large value after ReLU), its corresponding column in the downward matrix contributes proportionally to the output.
All components of the feed-forward network—the upward projection matrix, the downward projection matrix, and the bias vectors—are learned during training via backpropagation.
Summary
Transformer layers are stacked sequentially, with each layer taking the output of the previous one as input. Within each layer, residual connections and layer normalization help stabilize training. As the input passes through each layer, the token representations are progressively refined, allowing the model to capture richer contextual and semantic information across the sequence.
In decoder-only models like ChatGPT, after the input sequence passes through all the Transformer layers, we focus on the updated vector of the last token from the final layer. This vector encodes both the token's own meaning and its contextual relationship with all previous tokens. It is then passed through a linear projection layer (50k x 12.8k) followed by a softmax function to produce a probability distribution over the vocabulary. The model selects the most likely next token based on this distribution, effectively generating the next word in the sequence.
Through multiple self-attention and feed-forward layers, the updated vector of the last token gradually accumulates contextual information from all previous tokens. Each Transformer layer refines this representation by allowing the token to attend to earlier tokens and combine their semantic signals in increasingly abstract ways. The updated vectors at earlier positions are necessary for self-attention to compute the correct context at the last position.