Introduction to Large Language Models (LLMs)

A deep dive into how Large Language Models work, from transformers to GPT architecture.

2 min read Nov 29, 2025

Introduction

Large Language Models (LLMs) have revolutionized how we interact with AI. From ChatGPT to Claude, these models demonstrate remarkable capabilities in understanding and generating human-like text.

The Transformer Architecture

The foundation of modern LLMs is the Transformer architecture, introduced in the 2017 paper “Attention Is All You Need.”

Key Components

Self-Attention Mechanism: Allows the model to weigh the importance of different words
Positional Encoding: Provides sequence order information
Feed-Forward Networks: Process the attended representations

import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size: int, heads: int):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads
        
        self.queries = nn.Linear(embed_size, embed_size)
        self.keys = nn.Linear(embed_size, embed_size)
        self.values = nn.Linear(embed_size, embed_size)
        self.fc_out = nn.Linear(embed_size, embed_size)
        
    def forward(self, query, key, value, mask=None):
        N = query.shape[0]
        q_len, k_len, v_len = query.shape[1], key.shape[1], value.shape[1]
        
        # Linear projections
        Q = self.queries(query)
        K = self.keys(key)
        V = self.values(value)
        
        # Reshape for multi-head attention
        Q = Q.reshape(N, q_len, self.heads, self.head_dim)
        K = K.reshape(N, k_len, self.heads, self.head_dim)
        V = V.reshape(N, v_len, self.heads, self.head_dim)
        
        # Scaled dot-product attention
        energy = torch.einsum("nqhd,nkhd->nhqk", [Q, K])
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-inf"))
        
        attention = torch.softmax(energy / (self.head_dim ** 0.5), dim=3)
        
        out = torch.einsum("nhqk,nvhd->nqhd", [attention, V])
        out = out.reshape(N, q_len, self.embed_size)
        
        return self.fc_out(out)

How LLMs Learn

Pre-training

LLMs are trained on massive text corpora using self-supervised learning:

Masked Language Modeling (BERT-style): Predict masked tokens
Causal Language Modeling (GPT-style): Predict the next token

Fine-tuning

After pre-training, models can be fine-tuned for specific tasks:

Instruction following
Code generation
Question answering

Popular LLM Families

Model Family	Organization	Parameters	Key Feature
GPT-4	OpenAI	~1T+	Multimodal
Claude	Anthropic	~100B+	Constitutional AI
Llama	Meta	7B-70B	Open weights
Gemini	Google	~1T+	Native multimodal

Conclusion

LLMs represent a paradigm shift in AI. Understanding their architecture helps us leverage them effectively and anticipate their limitations.

Documentation