The complete technical guide to Large Language Models — from how raw text becomes numbers, through the transformer architecture and attention mechanism, to training, alignment, inference, RAG, and production deployment. No hand-waving; real equations and real code.
A Large Language Model (LLM) is a neural network trained to predict the next token in a sequence of text. That single objective — next-token prediction — turns out to be extraordinarily powerful: to predict well, the model must learn grammar, facts, reasoning patterns, code syntax, and much more.
The “large” refers to parameter count (billions to trillions of learned weights) and the scale of training data (trillions of tokens from the web, books, and code). At sufficient scale, models exhibit emergent capabilities — abilities not present in smaller models and not explicitly trained for, such as multi-step arithmetic, analogical reasoning, and in-context learning from a handful of examples.
Architecturally, every major LLM today is a decoder-only transformer (GPT family, Llama, Mistral, Claude, Gemini). The model takes a sequence of token IDs as input and produces a probability distribution over the vocabulary for the next token. Generation is autoregressive: the model samples one token, appends it to the sequence, and repeats.
graph LR A[Input Text] --> B[Tokenizer] B --> C[Token IDs] C --> D[Embedding Layer] D --> E[Transformer Blocks xN] E --> F[LM Head] F --> G[Logits over Vocabulary] G --> H[Softmax + Sampling] H --> I[Next Token] I -->|Autoregressive loop| C
Capabilities that appear only above certain scale thresholds — few-shot learning, chain-of-thought reasoning, instruction following — that were not explicitly trained.
The model adapts its behaviour based on examples in the prompt (few-shot) without any weight updates. The context window is the only “memory” during inference.
Model weights encode a lossy compression of the training corpus. Facts are not stored verbatim — they are distributed across billions of weights, which is why hallucination happens.
LLMs do not operate on characters or words — they operate on tokens, sub-word units produced by a tokenizer trained on the same corpus. Understanding tokenization explains cost, context length, and many quirks of model behaviour.
GPT-2, GPT-3, GPT-4o, Llama 4, Mistral Large 3
Iteratively merges the most frequent adjacent byte or character pair. Starts from individual bytes, so it handles any Unicode text without unknown tokens.
BERT, DistilBERT, ALBERT
Similar to BPE but merges are chosen to maximise the likelihood of the training data under a language model, rather than raw frequency.
T5, Gemma, Qwen
Treats tokenization as a probabilistic segmentation problem. Language-agnostic — works from raw text without pre-tokenization (spaces treated as regular characters).
| Word | GPT-4o Tokens | Token Count |
|---|---|---|
| transformer | transformer | 2 |
| tokenization | tokenization | 2 |
| def calculate_loss(logits): | def calculate_loss(logits): | 6 |
| Üniversität | Üniversität | 4 |
| hello | hello | 1 |
| (3 spaces) | 1 |
| Model | Tokenizer | Vocab Size | Note |
|---|---|---|---|
| GPT-2 | BPE | 50,257 | Byte-level BPE |
| GPT-3 / GPT-3.5 | BPE (cl100k) | 100,277 | Same as GPT-4 |
| GPT-4 / GPT-4o | BPE (o200k) | 200,019 | Better multilingual |
| Llama 3.x / 4.x | BPE (tiktoken) | 128,256 | Improved from Llama 2's 32k |
| Mistral v0.x | BPE (SentencePiece) | 32,768 | Small but efficient |
| Gemma 2 | SentencePiece | 256,000 | Very large multilingual vocab |
import tiktoken
# GPT-4o uses the o200k_base encoding
enc = tiktoken.get_encoding("o200k_base")
text = "Tokenization is the first step in every LLM pipeline."
tokens = enc.encode(text)
print(f"Token IDs: {tokens}")
# Token IDs: [5808, 2065, 374, 279, 1176, 3094, 304, 1475, 445, 11237, 15598, 13]
print(f"Token count: {len(tokens)}") # 12
print(f"Decoded: {[enc.decode([t]) for t in tokens]}")
# ['Token', 'ization', ' is', ' the', ' first', ' step', ' in', ' every', ' L', 'LM', ' pipeline', '.']
# Cost estimation: GPT-4o input = $2.50 / 1M tokens
cost_per_token = 2.50 / 1_000_000
print(f"Cost for this sentence: ${cost_per_token * len(tokens):.8f}")Introduced in “Attention Is All You Need” (Vaswani et al., 2017), the transformer replaced recurrent networks with a fully attention-based architecture. Every major LLM today is built on this foundation.
graph TD A[Input Tokens] --> B[Token Embeddings] B --> C[+ Positional Encoding] C --> D[Multi-Head Self-Attention] D --> E[Add and Layer Norm] E --> F[Feed-Forward Network] F --> G[Add and Layer Norm] G --> H[Next Block or Output]
The core insight of the transformer: each token can attend to every other token in the sequence simultaneously. Given an input matrix X, three learned projections produce queries (Q), keys (K), and values (V):
Multi-head attention runs H independent attention operations in parallel, each with different learned projections. This allows the model to jointly attend to information from different representation subspaces. GPT-3 uses 96 attention heads; Llama 4 Maverick uses grouped-query attention (GQA) for efficient serving.
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (batch, heads, seq_len, head_dim)
Returns: (batch, heads, seq_len, head_dim)
"""
d_k = Q.size(-1)
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5) # (batch, heads, seq, seq)
# Apply causal mask (decoder: attend only to past tokens)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax over key dimension
attn_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
return torch.matmul(attn_weights, V), attn_weightsEach transformer block contains a 2-layer MLP applied independently to each token position. The hidden dimension is typically 4× the model dimension (e.g., d_model=4096, d_ff=16384 for a typical 8B-class model). Modern LLMs use SwiGLU activation (a gated variant of SiLU) which empirically outperforms ReLU and GELU.
FFN layers store the bulk of factual knowledge in a model — research shows that “knowledge neurons” concentrated in the FFN can be located and surgically edited (see ROME/MEMIT). Attention handles routing and composition; FFN layers handle storage.
Every sub-layer (attention, FFN) uses a residual connection (output = x + sublayer(x)) and layer normalisation. Modern LLMs use Pre-Norm (normalise before the sub-layer) rather than Post-Norm for training stability, and RMSNorm (Root Mean Square norm, no mean-centering) for efficiency.
| Type | Examples | Attention | Best For |
|---|---|---|---|
| Encoder-only | BERT, RoBERTa, DeBERTa | Bidirectional (full attention) | Classification, NER, embeddings |
| Decoder-only | GPT-4o, Llama 4, Mistral Large 3, Claude Sonnet 4.6 | Causal (left-to-right) | Text generation, chat, reasoning |
| Encoder-Decoder | T5, FLAN-T5, BART | Full encoder + cross-attention | Translation, summarisation, seq2seq |
Pretraining is the most expensive phase — typically 95%+ of total compute. The model sees trillions of tokens and learns to predict the next one. This simple objective, at sufficient scale, produces most of the capabilities we associate with LLMs.
Given a sequence of tokens [t₁, t₂, ..., tₙ], the model is trained to maximise the log-likelihood of each token given all preceding tokens:
Each forward pass processes a full sequence and produces a loss at every position in parallel (teacher forcing). During inference, tokens are generated autoregressively, one at a time.
Hoffmann et al. (2022) showed that previous large models (GPT-3, Gopher) were undertrained — too many parameters for too few tokens. The Chinchilla optimal ratio is:
Optimal tokens ≈ 20× parameters
A 7B parameter model should train on ~140B tokens for compute-optimal training. In practice, models train on far more (Llama 3.1 8B: 15T tokens; Llama 4 models: ~40T est.) because inference cost matters — a smaller but more-trained model costs less to serve.
| Model | Params | Training Tokens | Year |
|---|---|---|---|
| GPT-2 | 117M – 1.5B | ~10B | 2019 |
| GPT-3 | 175B | ~300B | 2020 |
| Chinchilla | 70B | 1.4T | 2022 |
| Llama 2 (historical) | 7B – 70B | 2T | 2023 |
| Mistral 7B | 7.3B | ~8T (est.) | 2023 |
| Llama 3.1 8B | 8B | 15T | 2024 |
| Llama 3.1 405B | 405B | 15T | 2024 |
| Llama 4 Scout | ~17B active (MoE) | ~40T (est.) | 2025 |
| Llama 4 Maverick | ~17B active (MoE) | ~40T (est.) | 2025 |
Petabyte-scale web crawl, raw and filtered. Forms the bulk of most pretraining corpora. Requires extensive quality filtering (deduplication, language detection, toxicity removal).
825GB curated dataset spanning 22 sources including GitHub, ArXiv, PubMed, FreeLaw, DM Mathematics. Open and reproducible.
Open reproductions of LLaMA training data. DCLM (DataComp-LM, 2024) focuses on rigorous data quality ablations to find optimal filtering pipelines.
Books3, Gutenberg, ArXiv, S2ORC. High signal-to-noise ratio; critical for long-form reasoning and factual depth.
Each GPU holds a model replica; batches are split across GPUs. Gradients are averaged (AllReduce) after each backward pass. Standard for all sizes.
Individual weight matrices are split across GPUs. Requires high-bandwidth interconnect (NVLink). Used for models that exceed single-GPU VRAM.
Different layers assigned to different GPUs. Micro-batches flow through the pipeline. Efficient for very deep models; requires careful scheduling to minimise bubbles.
A pretrained base model is a powerful but unpredictable next-token predictor — it will continue any text, including harmful content. Alignment training transforms it into a helpful, harmless, and honest assistant.
The base model is fine-tuned on a dataset of (instruction, ideal response) pairs, written or curated by human annotators. This teaches the model the instruction-following format. The training objective is identical to pretraining (cross-entropy), but the dataset is small (tens of thousands of examples) and high quality. After SFT, the model can follow instructions but may still be untruthful or harmful.
graph LR A[Pretrained LLM] --> B[SFT on Instruction Data] B --> C[SFT Model] C --> D[Generate Completions] D --> E[Human Preference Labels] E --> F[Train Reward Model] F --> G[RLHF with PPO] G --> H[Aligned LLM]
Reward Model: Human annotators compare pairs of model responses and label their preference. A separate model is trained to predict the human-preferred response given a prompt. This scalar reward signal captures nuanced quality judgements that are hard to specify as a loss function.
PPO (Proximal Policy Optimisation): The SFT model (the “policy”) is optimised to maximise the reward model's score while a KL-divergence penalty prevents it from drifting too far from the SFT baseline (which would cause reward hacking). PPO is computationally expensive: it requires four models in memory simultaneously.
Rafailov et al. (2023) showed that the RLHF objective can be optimised directly on the policy model without a separate reward model or RL loop. Given pairs of preferred and rejected responses, DPO reparametrises the reward as a function of the policy and reference model log-probabilities:
DPO is simpler, more stable, and cheaper than RLHF with PPO. Most open-source aligned models (Llama 4 Instruct, Mistral Instruct) use DPO or a variant (SimPO, IPO) for the preference alignment stage.
| Method | Reward Model | RL Loop | Stability | Used By |
|---|---|---|---|---|
| RLHF (PPO) | Yes | Yes (PPO) | Moderate | InstructGPT, early ChatGPT |
| DPO | No | No | High | Llama 4, Mistral, Zephyr |
| Constitutional AI (CAI) | Self-critique | Yes (RLAIF) | High | Claude (Anthropic) |
| GRPO / DAPO | Rule-based | Group relative | High | DeepSeek-R1, Qwen |
At each generation step, the model outputs a logit vector of size|vocabulary|. Sampling strategy determines how a single token is chosen from this distribution — and has enormous impact on output quality, diversity, and coherence.
| Temperature | Effect | Use Case |
|---|---|---|
| 0.0 (Greedy) | Always pick the highest-probability token. Deterministic. | Classification, structured extraction, factual Q&A |
| 0.2 – 0.4 | Sharper distribution, mostly follows the most likely path but allows small variations. | Code generation, summarisation |
| 0.6 – 0.8 | Balanced. Good mix of coherence and diversity. | General chat, instruction following (default for most models) |
| 1.0 | Sample directly from the model distribution. More varied. | Creative writing, brainstorming |
| > 1.0 | Flattens distribution. Increases randomness and repetition. Often produces incoherent output. | Rarely useful in production |
Sort tokens by probability, take the smallest set whose cumulative probability ≥ p, then sample from that set. At p=0.9, the model only considers tokens that together account for 90% of the probability mass. Adapts the candidate set size dynamically: when confident, the nucleus is small; when uncertain, it is wider.
Truncate the distribution to the top-k most likely tokens, then sample from those. Simpler than top-p but uses a fixed k regardless of the distribution shape. Top-p is generally preferred; top-k is useful when you need strict control over the candidate set size.
from openai import OpenAI
client = OpenAI()
# Factual / structured output: low temperature, no top-p
factual = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is the capital of France?"}],
temperature=0.0,
max_tokens=50,
)
# Creative writing: higher temperature + nucleus sampling
creative = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about neural networks."}],
temperature=0.9,
top_p=0.95,
max_tokens=100,
)
# Code generation: low temp, deterministic
code = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a Python quicksort function."}],
temperature=0.2,
max_tokens=300,
)A transformer has no inherent notion of token order — attention is permutation-invariant. Positional encoding injects order information. The choice of encoding method determines how well the model generalises to sequences longer than its training length.
A learned or fixed (sinusoidal) vector is added to each token embedding at position i. GPT-2 and BERT use learned absolute embeddings. Hard limit: the model cannot generalise to positions it has never seen during training.
Encodes position by rotating the Q and K vectors in complex space by an angle proportional to position. The attention score naturally depends on the relative distance between tokens. Used by Llama 4, Mistral Large 3, Qwen, and most modern open-weight models. Enables context extension techniques like YaRN and LongRoPE.
During autoregressive generation, the model recomputes attention for every token at every step — naively O(n²) per step. The KV-cache stores the key and value tensors from all previous tokens. On each new step, only the new token's Q, K, V are computed and the cached K, V are appended. This reduces generation from O(n²) to O(n) per new token.
| Model | Context Window | Positional Encoding | Effective Length* |
|---|---|---|---|
| GPT-4o | 128K | Learned + RoPE (est.) | 128K |
| Claude Sonnet 4.6 | 200K | Undisclosed | ~150K (practical) |
| Gemini 2.5 Pro | 1M | Undisclosed | ~500K (practical) |
| Llama 4 Maverick | 1M | RoPE | 1M |
| Mistral Large 3 | 128K | Sliding Window + RoPE | 128K |
| DeepSeek-R1 | 128K | RoPE | 128K |
*Effective length: the window over which the model reliably retrieves information. “Lost in the middle” research shows performance degrades for content placed in the middle of very long contexts.
RAG addresses two fundamental limitations of LLMs: their knowledge cutoff and their tendency to hallucinate. Instead of relying on what the model memorised during training, RAG retrieves relevant documents at inference time and injects them into the prompt.
graph LR A[User Query] --> B[Embed Query] B --> C[Vector Search] C --> D[Top-k Chunks] D --> E[Inject into Prompt] A --> E E --> F[LLM Generation] F --> G[Grounded Answer]
Embed query → top-k similarity search → append chunks to prompt → generate. Simple but prone to irrelevant retrieval and context overload.
Adds query rewriting, re-ranking (cross-encoders), hypothetical document embeddings (HyDE), recursive retrieval, and context compression.
Decoupled pipeline: routing, query transformation, retrieval, scoring, filtering, and fusion can each be swapped independently. Maximum flexibility.
Split every N tokens with M-token overlap. Fast but may cut mid-sentence. Good baseline.
Split on sentence or paragraph boundaries. Preserves semantic units but chunks vary in size.
Group sentences with high embedding similarity into coherent chunks. Best retrieval quality; highest compute cost at indexing time.
Index at multiple granularities (document → section → paragraph). Retrieve at the level matching the query specificity.
| Database | Deployment | Scale | Best For |
|---|---|---|---|
| pgvector | Self-hosted (Postgres ext.) | Millions | Existing Postgres users; simpler stack |
| Qdrant | Self-hosted / Cloud | Billions | High performance, rich filtering, open-source |
| Weaviate | Self-hosted / Cloud | Billions | Multi-modal, semantic + keyword hybrid |
| Pinecone | Fully managed cloud | Billions | Managed SaaS, minimal ops |
| Chroma | Local / self-hosted | Millions | Prototyping, local development |
The gap between open-weight and proprietary frontier models has narrowed dramatically. Llama 4 Maverick matches GPT-4o on many benchmarks; DeepSeek-R1 outperforms o1 on math and coding at a fraction of the training cost. The choice is increasingly about deployment model, data privacy, and total cost of ownership.
| Model | Params | Context | License | MMLU | Best Use Case |
|---|---|---|---|---|---|
| Open Weight | |||||
| Llama 4 Scout | ~17B active (MoE) | 1M | Llama 4 Community License | 79.6 | Fast, long context, on-device |
| Llama 4 Maverick | ~17B active (MoE) | 1M | Llama 4 Community License | 85.5 | Balanced capability/cost, long context |
| Llama 4 Behemoth | ~288B active (MoE) | 128K | Llama 4 Community License | ~92 (est.) | Frontier open-weight tasks |
| Mistral Large 3 | ~123B | 128K | Mistral Research License | 84.0 | Multilingual enterprise |
| Qwen2.5 72B | 72B | 128K | Apache 2.0 | 86.0 | Code, math, multilingual |
| DeepSeek-R1 | 671B MoE | 128K | MIT | 90.8 | Reasoning, math, science |
| Gemma 2 27B | 27B | 8K | Gemma License | 75.2 | Research, fine-tuning |
| Phi-4 | 14B | 16K | MIT | 84.8 | Reasoning, STEM, small footprint |
| Proprietary | |||||
| GPT-4o | ~200B est. | 128K | Proprietary | 88.7 | General frontier, multimodal |
| Claude Sonnet 4.6 | Undisclosed | 200K | Proprietary | 88.7 | Long context, coding, safety |
| Gemini 2.5 Pro | Undisclosed | 1M | Proprietary | 85.9 | Very long context, multimodal, reasoning |
| o3 / o1 | Undisclosed | 200K | Proprietary | ~91+ | Complex reasoning, frontier research |
Serving LLMs at scale requires solving memory (VRAM is scarce), throughput (many concurrent users), and latency (users want fast first tokens). The ecosystem has developed powerful solutions for each challenge.
Reducing the numerical precision of model weights shrinks VRAM and speeds up computation. Modern techniques (GPTQ, AWQ, GGUF) apply quantisation non-uniformly, protecting the most sensitive weights.
| Format | Bits/Weight | Relative Size | Quality Retention | Tooling |
|---|---|---|---|---|
| FP32 | 32 | 100% | 100% (baseline) | PyTorch default |
| FP16 / BF16 | 16 | 50% | ~100% | Standard training/inference |
| INT8 | 8 | 25% | ~99% | bitsandbytes, TensorRT-LLM |
| INT4 (GPTQ/AWQ) | 4 | 12.5% | ~95–98% | AutoGPTQ, AutoAWQ, llama.cpp |
| GGUF Q4_K_M | ~4.5 | ~14% | ~96% | llama.cpp, Ollama |
| GGUF Q2_K | ~2.6 | ~8% | ~88% | llama.cpp (CPU focus) |
PagedAttention for efficient KV-cache management, continuous batching, tensor parallelism. The gold standard for high-throughput GPU serving. OpenAI-compatible API.
Hugging Face's production server. Flash Attention 2, continuous batching, speculative decoding. Powers the Inference API.
Dead-simple local serving via GGUF models. One command to pull and run any model. Not designed for high concurrency but excellent for development.
Pure C++ inference with GGUF quantisation. Runs on MacBook M-series CPUs and consumer GPUs. Powers Ollama under the hood.
vLLM's PagedAttention (Kwon et al., 2023) manages KV-cache memory like virtual memory in an OS — splitting it into fixed-size pages allocated non-contiguously. This eliminates fragmentation and wasted reservation, enabling 2–24× more throughput than HuggingFace Transformers at equal GPU memory.
Continuous batching allows new requests to join an in-flight batch as soon as a sequence finishes, rather than waiting for the entire batch. This dramatically improves GPU utilisation under variable-length workloads.
# Start a vLLM server for Llama 4 Scout Instruct
# Requires: pip install vllm, CUDA GPU with ≥16GB VRAM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--dtype bfloat16 \
--port 8000
# The server exposes an OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [{"role": "user", "content": "Explain attention mechanisms."}],
"temperature": 0.7,
"max_tokens": 512
}'Evaluation is one of the hardest open problems in LLM research. Standard benchmarks measure specific capabilities but correlate imperfectly with real-world usefulness. A comprehensive evaluation strategy combines automated benchmarks, human evaluation, and LLM-as-judge.
| Benchmark | Measures | Format | Limitation |
|---|---|---|---|
| MMLU | Broad knowledge (57 subjects) | 4-choice MCQ | MCQ format; contamination risk |
| HumanEval | Python code generation | Function completion + unit tests | Only Python; narrow task distribution |
| GSM8K | Grade-school math word problems | Free-form arithmetic | Saturated by frontier models (>95%) |
| HellaSwag | Commonsense NLI | 4-choice sentence completion | Saturated; adversarial but dated |
| MT-Bench | Instruction following (multi-turn) | LLM-as-judge (GPT-4) | GPT-4 judge has its own biases |
| GPQA Diamond | Graduate-level science | 4-choice MCQ by domain experts | Small dataset; hard to scale |
| MATH-500 | Competition mathematics | Exact answer match | Sensitive to format; solutions can be memorised |
| Model | MMLU | HumanEval | GSM8K | MATH |
|---|---|---|---|---|
| GPT-4o | 88.7 | 90.2 | 96.0 | 76.6 |
| Claude Sonnet 4.6 | 88.7 | 92.0 | 96.0 | 78.3 |
| Gemini 2.5 Pro | 85.9 | 84.1 | 91.7 | 67.7 |
| DeepSeek-R1 | 90.8 | 92.3 | 97.3 | 97.3 |
| Llama 4 Maverick | 85.5 | 85.4 | 95.0 | 72.0 |
| Llama 4 Scout | 79.6 | 77.0 | 89.0 | 58.0 |
| Mistral Large 3 | 84.0 | 92.0 | 93.0 | 69.0 |
| Llama 3.1 8B (2024) | 73.0 | 72.6 | 84.5 | 51.9 |
For open-ended tasks where reference answers don't exist, a powerful LLM can score responses using a structured rubric. MT-Bench and Chatbot Arena use this approach. The key risk is position bias (the judge prefers answers appearing first) and verbosity bias (longer answers score higher regardless of quality).
from openai import OpenAI
client = OpenAI()
def llm_judge(question: str, answer: str, rubric: str) -> dict:
prompt = f"""You are an expert evaluator. Score the following answer on a 1-10 scale.
Question: {question}
Answer: {answer}
Rubric: {rubric}
Respond with JSON: {{"score": <int>, "reasoning": "<str>", "strengths": ["..."], "weaknesses": ["..."]}}"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.0,
)
import json
return json.loads(response.choices[0].message.content)
result = llm_judge(
question="Explain the attention mechanism in transformers.",
answer="Attention computes a weighted sum of values...",
rubric="Accuracy (4pt), Clarity (3pt), Completeness (3pt)",
)
print(f"Score: {result['score']}/10 — {result['reasoning']}")Understanding how LLMs work is the foundation — but choosing the right model, deployment architecture, and evaluation strategy for your specific use case requires hands-on experience. Our team has built production LLM systems across RAG, agents, fine-tuning, and enterprise deployment. Book a consultation to discuss your project.