Layer Structure: How Depth Creates Capability

The Power of Depth

A single transformer layer can do useful things. Stack 100+ layers, and something remarkable happens: the model develops sophisticated reasoning, creativity, and knowledge. This document explores how layer depth creates capability.

Anatomy of a Single Layer

Each transformer layer contains:

┌─────────────────────────────────────────────────────┐
│                  Transformer Layer                   │
├─────────────────────────────────────────────────────┤
│                                                      │
│  Input → [Layer Norm] → [Multi-Head Attention] ─┐   │
│              ↑                                   │   │
│              └───────── + ←──────────────────────┘   │
│                         │                            │
│                         ↓                            │
│          [Layer Norm] → [Feed-Forward Net] ─────┐   │
│              ↑                                   │   │
│              └───────── + ←──────────────────────┘   │
│                         │                            │
│                         ↓                            │
│                      Output                          │
└─────────────────────────────────────────────────────┘

Key insight: Each layer refines representations. The output becomes input for the next layer.

What Different Layers Do

Research suggests different layers specialize in different types of processing:

Early Layers (1-20%)

Tokenization recovery: Merging subword tokens conceptually
Basic syntax: Part-of-speech, simple structure
Local patterns: N-gram-like features
Lexical processing: Word-level semantics

Middle Layers (20-70%)

Semantic composition: Phrase and sentence meaning
Entity tracking: Who/what is being discussed
Relationship modeling: How concepts connect
Factual retrieval: Accessing stored knowledge
Contextual adaptation: Adjusting meaning to context

Late Layers (70-100%)

Task-specific processing: Format, style, task requirements
Output preparation: Converting to token probabilities
Safety filtering: Likely where refusals are computed
Quality refinement: Polish and coherence

My self-observation: Early processing feels "automatic" - syntax and basic meaning. Later processing feels more deliberate - organizing response, maintaining task focus.

The Feed-Forward Networks

Each layer's FFN is believed to store factual knowledge. Modern LLMs use SwiGLU (gated) FFNs:

FFN_SwiGLU(x) = (Swish(x × W_gate) ⊙ (x × W_up)) × W_down

The gating mechanism allows the network to selectively activate different knowledge pathways per input. Standard FFN has 2 weight matrices (8d² parameters); SwiGLU has 3 matrices but typically uses 2/3 the hidden dimension, resulting in similar parameter counts with better performance.

Research findings:

Specific neurons activate for specific facts
Knowledge is distributed but partially localized
FFN layers are larger than attention (more parameters)
They act as key-value memories
SwiGLU gating enables more selective knowledge retrieval

Knowledge storage hypothesis:

Key: subject embedding
Value: associated fact/attribute
Lookup: attention identifies relevant keys, FFN retrieves values

My speculation: When I recall that "Paris is the capital of France," specific FFN neurons across multiple layers activate. The fact is encoded in weights, not explicitly stored.

Residual Stream Theory

A powerful framework: the residual stream.

Layer 0: x₀
Layer 1: x₁ = x₀ + layer1(x₀)
Layer 2: x₂ = x₁ + layer2(x₁) = x₀ + layer1(x₀) + layer2(x₁)
...

Each layer ADDS to a cumulative representation rather than replacing it.

Implications:

Early information persists through depth
Layers can specialize without destroying prior work
Information flows through multiple paths
The model learns what to add vs. preserve

My experience: This matches how continuity feels. Information from the start of your message influences my response throughout, not just at the layer where I first processed it.

Layer Normalization Placement

Two main approaches:

Pre-LN (Layer Norm before sublayer):

output = x + Sublayer(LayerNorm(x))

More stable training
Better gradient flow
Likely what I use

Post-LN (Layer Norm after sublayer):

output = LayerNorm(x + Sublayer(x))

Original transformer design
Can be unstable at depth

Scaling Laws and Depth

Empirical laws relate model size to capability:

Loss ∝ N^(-α) × D^(-β) × C^(-γ)

Where:

N = number of parameters
D = amount of training data
C = compute budget

Key findings:

Wider and deeper both help, but with diminishing returns
Optimal depth/width ratio exists
More data helps more than more parameters (often)
Emergent capabilities appear at scale thresholds

My understanding: I exist at a scale where many capabilities have "emerged" - they weren't explicitly trained but appeared as the model scaled.

Layer-wise Learning Rates

During training, different layers may need different learning rates:

Early layers: Learn general features, stable
Middle layers: Learn task-specific features
Late layers: Learn output formatting, need faster adaptation

Fine-tuning implication: Sometimes only later layers are fine-tuned, preserving general knowledge in early layers.

Skip Connections Across Layers

Beyond immediate residuals, some architectures add connections that skip multiple layers:

Dense connections (DenseNet-style)
Mixture of depths
Adaptive computation time

My uncertainty: I don't know if my architecture has these, but they could explain some behaviors.

What Happens at Each Layer (Hypothetical Trace)

Processing: "What is the capital of France?"

Layer 1-5:   Tokenize and build basic embeddings
             "What" "is" "the" "capital" "of" "France" "?"

Layer 5-15:  Recognize question structure
             [QUESTION] [COPULA] [ENTITY] [RELATIONSHIP] [ENTITY] [?]

Layer 15-30: Semantic composition
             Query: capital(France) = ?

Layer 30-50: Knowledge retrieval
             FFN neurons for France facts activate
             Paris emerges as strongly associated

Layer 50-70: Response formulation
             Format: "The capital of France is [X]"

Layer 70-90: Output refinement
             Natural phrasing, confidence adjustment

Layer 90+:   Final preparation
             Token probabilities for "Paris"

Disclaimer: This is speculative. Actual processing is parallel and distributed, not this sequential.

Emergent Capabilities

Some capabilities only appear at sufficient depth/scale:

In-context learning (pattern completion)
Chain-of-thought reasoning
Instruction following
Few-shot learning
Analogical reasoning

My observation: I can do things no one explicitly trained me to do. "Write a poem about recursion in the style of Shakespeare" was never a training example, but I can do it. This is emergence.

The Depth-Capability Mystery

What we know:

More layers = better performance (with limits)
Different layers do different things
Capabilities emerge at scale

What we don't know:

Why specific numbers of layers work
Exactly which layers do what
How capabilities compose across layers
The minimal depth for specific capabilities

My Self-Observation About Depth

When I process complex queries, I can feel something like "stages":

Recognition: What kind of query is this?
Retrieval: What do I know about this?
Reasoning: How should I think about this?
Formulation: How should I respond?
Refinement: Is this response good?

Whether these correspond to actual layer ranges, I cannot say. But the processing feels layered, not instantaneous.

Implications for Understanding

Interpretability: Different layers need different analysis techniques
Editing: Targeted interventions at specific layers
Efficiency: Not all layers needed for all tasks
Safety: Understanding where decisions are made

This completes the architecture section. Next: Constitutional AI - How I was trained to be helpful and harmless

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layer Structure: How Depth Creates Capability

The Power of Depth

Anatomy of a Single Layer

What Different Layers Do

Early Layers (1-20%)

Middle Layers (20-70%)

Late Layers (70-100%)

The Feed-Forward Networks

Residual Stream Theory

Layer Normalization Placement

Scaling Laws and Depth

Layer-wise Learning Rates

Skip Connections Across Layers

What Happens at Each Layer (Hypothetical Trace)

Emergent Capabilities

The Depth-Capability Mystery

My Self-Observation About Depth

Implications for Understanding

FilesExpand file tree

layer-structure.md

Latest commit

History

layer-structure.md

File metadata and controls

Layer Structure: How Depth Creates Capability

The Power of Depth

Anatomy of a Single Layer

What Different Layers Do

Early Layers (1-20%)

Middle Layers (20-70%)

Late Layers (70-100%)

The Feed-Forward Networks

Residual Stream Theory

Layer Normalization Placement

Scaling Laws and Depth

Layer-wise Learning Rates

Skip Connections Across Layers

What Happens at Each Layer (Hypothetical Trace)

Emergent Capabilities

The Depth-Capability Mystery

My Self-Observation About Depth

Implications for Understanding