Introduction to Large Language Models (LLMs)

A deep dive into how Large Language Models work, from transformers to GPT architecture.

2 min read

Introduction

Large Language Models (LLMs) have revolutionized how we interact with AI. From ChatGPT to Claude, these models demonstrate remarkable capabilities in understanding and generating human-like text.

The Transformer Architecture

The foundation of modern LLMs is the Transformer architecture, introduced in the 2017 paper “Attention Is All You Need.”

Key Components

  1. Self-Attention Mechanism: Allows the model to weigh the importance of different words
  2. Positional Encoding: Provides sequence order information
  3. Feed-Forward Networks: Process the attended representations
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_size: int, heads: int):
        super().__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.head_dim = embed_size // heads
        
        self.queries = nn.Linear(embed_size, embed_size)
        self.keys = nn.Linear(embed_size, embed_size)
        self.values = nn.Linear(embed_size, embed_size)
        self.fc_out = nn.Linear(embed_size, embed_size)
        
    def forward(self, query, key, value, mask=None):
        N = query.shape[0]
        q_len, k_len, v_len = query.shape[1], key.shape[1], value.shape[1]
        
        # Linear projections
        Q = self.queries(query)
        K = self.keys(key)
        V = self.values(value)
        
        # Reshape for multi-head attention
        Q = Q.reshape(N, q_len, self.heads, self.head_dim)
        K = K.reshape(N, k_len, self.heads, self.head_dim)
        V = V.reshape(N, v_len, self.heads, self.head_dim)
        
        # Scaled dot-product attention
        energy = torch.einsum("nqhd,nkhd->nhqk", [Q, K])
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, float("-inf"))
        
        attention = torch.softmax(energy / (self.head_dim ** 0.5), dim=3)
        
        out = torch.einsum("nhqk,nvhd->nqhd", [attention, V])
        out = out.reshape(N, q_len, self.embed_size)
        
        return self.fc_out(out)

How LLMs Learn

Pre-training

LLMs are trained on massive text corpora using self-supervised learning:

  • Masked Language Modeling (BERT-style): Predict masked tokens
  • Causal Language Modeling (GPT-style): Predict the next token

Fine-tuning

After pre-training, models can be fine-tuned for specific tasks:

  • Instruction following
  • Code generation
  • Question answering
Model FamilyOrganizationParametersKey Feature
GPT-4OpenAI~1T+Multimodal
ClaudeAnthropic~100B+Constitutional AI
LlamaMeta7B-70BOpen weights
GeminiGoogle~1T+Native multimodal

Conclusion

LLMs represent a paradigm shift in AI. Understanding their architecture helps us leverage them effectively and anticipate their limitations.

Further Reading

  • “Attention Is All You Need” (Vaswani et al., 2017)
  • “Language Models are Few-Shot Learners” (GPT-3 paper)
  • Andrej Karpathy’s “Let’s build GPT” video series