A Simple Introduction of LLM

A Glimpse into Large Language Models

5/12/20256 min read

LLM

What is LLM?

A large language model (LLM) is a neural network trained to predict the next token (usually a sub-word unit, not an entire word) based on the context of what it has seen so far. At inference time, the model:

Receives an input prompt.
Predicts the most likely next token.
Appends the predicted token to the prompt and repeats the process—one token at a time—until it either emits a special end-of-sequence token or reaches a maximum length.

How is LLM pretrained?

Data: billions (or trillions) of tokens scraped from diverse sources like books, web pages, code, and more.
Method: For each sequence of length n, the model is trained to predict the token t₂ from t₁, then t₃ from t₁ t₂, and so on up to tn from t₁ … tn-1. This method is known as teacher forcing, where the ground-truth is fed back into the network during training.
Loss: Cross-entropy loss is calculated at every position, then summed (or averaged) over the sequence and across all the training examples in the batch.
Optimization: Back-propagation, combined with an optimizer like AdamW, adjusts the model's weights to minimize the loss.

After pre-training, many models undergo additional stages, such as supervised fine-tuning or reinforcement learning from human feedback (RLHF), to further align them with desired behaviors.

How does LLM make a prediction?

The LLM first predicts the next token. It then combines the original context with this newly generated token to predict the following token, repeating the cycle—predict → append → predict—until a stop token appears or a maximum length limit is reached.

Efficiency via KV-cache.
During generation, the model re-uses the key and value tensors produced by its self-attention layers. After each forward pass, the keys and values for the new token are appended to an internal cache. For the next token, the model only needs to compute the query-key dot products against those cached tensors; the earlier keys/values do not change. This KV-cache makes generation almost linear in generated length instead of quadratic.
Because the attention context depends on token order, shuffling otherwise identical words will alter the keys and values, leading to different predictions.
- Explicit Instructions: Instruct the model to evaluate inputs impartially, emphasizing that the order should not influence its judgment.
- Few-shot examples: Provide examples of previous evaluations where the order of presentation was varied. This can help the model understand that the order should not influence its judgment.
- Chain-of-Thought Prompting: Encourage the model to articulate its reasoning step-by-step, which can lead to more balanced evaluations.
- Multiple Permutations: Instead of just two orders, consider evaluating all possible permutations of the inputs and aggregating the results. This can further dilute position bias.
Sampling vs. Greedy Decoding.
In practice, the “next token” is often sampled using methods like temperature sampling, nucleus sampling, or beam search, instead of always selecting the single most probable token (greedy decoding). This helps introduce more variability and creativity into the generated text.

Supervised Fine-Tuning

What is supervised fine-tuning?

Supervised fine-tuning involves taking a pre-trained language model—one that already understands general language patterns—and further training it on a curated set of input-output pairs (prompts → desired responses). The goal is to minimize the cross-entropy loss on these examples, effectively guiding the model toward a specific style, format, or task.

When does SFT make sense?

Supervised fine-tuning (SFT) is most suitable for tasks where 1) Domain expertise is needed (eg, legal, medical, financial); 2) Consistency and accuracy in specific tasks (eg, sentiment analysis, text classification, structured data extraction); 3) Behavior alignment, such as ensuring safe and polite language in response; 4) Specialized understanding of tasks that requires the model to be highly specific, such as entity extraction, summarization, etc.

Instruction following/style adaptation
- Teach the model to answer in a specific tone, format or domain jargon (legal, medical, etc)
- Training data: 500 - 10k high-quality prompt-response pairs

Classification or scoring
- Train the model to classify or score outputs, such as sentiment analysis, product reviews, quality scores, or preference ranking.
- Training data: Labeled sentences or pairwise preference data. Keep labels short—single tokens if possible—to minimise ambiguity.
- Apply a loss mask on label tokens so the model isn't penalized for rephrasing the prompt.
- Why SFT is needed: 1) The model needs to learn to classify subjective language accurately. Pre-trained models might struggle to differentiate between subtle sentiments or handle ambiguous language; 2) Specific categories and intents need clear distinctions. Fine-tuning helps the model focus on identifying the most relevant features of the text for accurate classification.
Safety/policy
- Implement refusal rules, content filters, and establish a consistent persona for the model.
- Training data: Curated prompt-response pairs that specify do’s and don’ts.
Task-specific generation
- Use the model to generate specific outputs, such as summarize, SQL queries, code patches, docstrings, etc.
- Training data: A few hundred to tens of thousands of prompt-response pairs.

Structured extraction
- Teach the model to return structured data, such as JSON, medical ICD codes, etc.
- Training data: Parallel text and structured pairs (e.g., JSON data with descriptions).

How does supervised fine-tuning work?

The pre-trained model is given the full prompt and the desired response. During training, the loss is only calculated for the response tokens, not for the prompt, and the model’s weights are updated using the loss aggregated over all positions in the response across all training examples.

Trade-off of supervised fine-tuning

Pros
- Consistency in output: Supervised fine-tuning ensures that the model’s responses are aligned with the labels or output formats you require, reducing variation that might arise from prompt-based methods.
- Scaling the Model for Large or Frequent Tasks: Fine-tuning is beneficial when the prompt-based approach isn’t scalable or reliable enough for frequent or large-scale deployments.

Cons
- Higher resource cost: Fine-tuning requires computational resources and data to train the model, which may be a limitation for smaller organizations or projects with less infrastructure.
- Overfitting: If the fine-tuning data isn’t diverse or is too small, there’s a risk the model might overfit to the training data, reducing its ability to generalize to new situations.

Prompt-Based Approach

The prompt-based approach uses a pre-trained language model directly with well-crafted input prompts to guide its behavior. The model relies on the context provided in the prompt to generate the response.

When does a prompt-based approach make sense?

Information Retrieval
Information retrieval relies heavily on the pre-training phase. The model can pull relevant information from its "knowledge base," which includes facts, concepts, and associations that it learned during training. Because the model has already been trained on a broad spectrum of data, it can identify the most relevant pieces of information in response to a query—without needing to learn or fine-tune specifically for this task. Such as general-purpose Q&A, text summarization, and text translation.
Text generation
Text generation works well because the model has learned how to predict the next token based on the context of what’s already been generated. During pre-training, the model learns how to generate text that follows typical human language patterns. Supervised fine-tuning refines these abilities to be more task-specific, ensuring the model outputs more coherent, contextually accurate, and appropriate text for a given prompt. Such as code generation, creative writing/content generation, Idealization / Brainstorming, Email drafting & Text Completion, Document formatting or Rewriting.
Classification
For tasks like sentiment analysis, topic classification, or spam detection, pre-training is helpful because the model has learned to recognize patterns in text (such as positive or negative sentiment, or distinguishing between different types of content) from its exposure to a broad variety of data.
Extraction
Supervised fine-tuning is key for extraction tasks because the model needs to learn to identify and extract specific pieces of information from the input text. Fine-tuning on labeled examples of entities or structured outputs helps the model recognize the context in which certain words or phrases correspond to particular entities (e.g., recognizing that "New York" is a location or "John" is a person's name).

LLM for classification

Instruction prompt only
- Pro: Zero cost, easy to implement.
- Con: The model may drift, produce inconsistent formats, or perform poorly at scale.
- Techniques: Few-shot learning, simple labels.

SFT on label tokens
- Fine-tune the model so it always ends its response with the correct label (e.g., positive, negative), with the loss mask applied only to the label tokens.
- Pro: Consistent output, easier post-processing, and the ability to inject domain-specific nuances.
- Con: Full training or LoRA training compute required, which may be overkill for simple labels.
  - If the model is mostly accurate but inconsistent in format, use LoRA/PEFT SFT on a smaller dataset (~100-1,000 examples).
  - If domain concepts are absent or misclassified, a larger SFT dataset (thousands of examples) is needed.

LLM for regression

Reward model training
Consider using reward models like DPO (Deterministic Policy Optimization) or RLAIF (Reinforcement Learning from AI Feedback) to adjust the model’s responses based on pairwise preferences.
- Pairwise preference → scalar score: The model’s output is ranked and used to compute a score.
- Pairwise preference → scalar score

Lightweight Regressor
A simpler approach could involve training a lightweight regressor directly on the LLM embeddings. This is a less computationally expensive method and still leverages the power of the pre-trained LLM.