aasb-ch04-attention
Chapter 4: Attention
Attention as a property of the loop
The Derivation
Given the four premises, a question emerges: where do you point your bounded fetch capacity?
You can't fetch everything. Your bandwidth is limited (Premise 4). Information takes time to arrive (Premise 2). You're sampling a region, not surveying the whole (Premise 3). You can only query what you can address (Premise 1).
So every system operating under these constraints faces a resource allocation problem:
- Multiple potential gaps exist simultaneously
- Only some can be addressed at any moment
- The choice of which gap to fill is itself a decision
- That decision determines what information you get, which shapes all subsequent decisions
Attention is the mechanism that resolves this. Attention is the allocation of bounded fetch capacity to selected gaps.
This is not a metaphor. It is the forced solution to the optimization problem created by the four constraints.
Technical / Science
Attention in Neural Networks
The "Attention Is All You Need" paper (2017) introduced the transformer architecture. The key insight: instead of processing sequences rigidly left-to-right, allow the model to attend selectively to relevant positions.
The mechanism:
- Query (Q): "What am I looking for?"
- Key (K): "What's available at each position?"
- Value (V): "What information is there?"
- Attention weights: Q·K similarity determines how much each position contributes
- Output: Weighted sum of values
This is PAUSE → FETCH → SPLICE implemented in differentiable operations:
- The query is the gap recognition ("what do I need?")
- The key-value lookup is the fetch ("where is it?")
- The weighted sum is the splice ("integrate the relevant pieces")
Sparse Activation
Biological and artificial neural networks both exhibit sparse activation. Not all neurons fire for all inputs. Not all attention heads attend to all positions.
Why? Resource constraints. If everything activated for everything, you'd need infinite energy and bandwidth. Sparsity is the architectural consequence of finitude.
Mixture of Experts models make this explicit: only a subset of parameters activate for any given input. The routing mechanism is an attention-like decision about where to allocate compute.
Saliency and Gaze
In biological vision, the eye doesn't process the entire visual field at high resolution. The fovea (central 2°) has dense photoreceptors; peripheral vision is low-resolution.
Saccades—rapid eye movements—redirect the fovea to regions of interest. Saliency maps (computed unconsciously) determine what's "interesting" and guide gaze.
This is attention implemented in hardware: limited high-resolution processing capacity, allocated to selected regions based on computed relevance.
Bottleneck Architectures
Information bottlenecks appear throughout neural processing:
- Optic nerve: ~1 million fibers compressing ~100 million photoreceptors
- Thalamic gating: filtering what reaches cortex
- Working memory: ~4 items (Miller's 7±2, revised down)
Each bottleneck forces selection. You can't pass everything through. Attention determines what makes it through the bottleneck.
Business / Practical
Focus as Competitive Advantage
Every organization has more opportunities than resources. Strategy is choosing which opportunities to pursue—and more importantly, which to ignore.
Focus is attention at organizational scale:
- "What business are we in?" bounds the space of relevant gaps
- "What's our priority this quarter?" allocates bounded capacity
- "What are we NOT doing?" is as important as what we are
Companies fail by spreading attention too thin. The startup beats the incumbent by focusing narrowly while the incumbent attends to everything.
Opportunity Cost Is Attention Cost
Every gap you address is a gap you're not addressing. The economist's opportunity cost is the attention-theoretic trade-off.
When you say "let's look into X," you're spending from a finite attention budget. The cost is not just the time spent—it's all the other inquiries that didn't happen.
Calendar as attention allocator: Your schedule is a statement about where you're directing fetch capacity. Meetings are pre-committed attention. "No meetings Wednesday" is a policy about preserved attention.
Metrics Shape Attention
"What gets measured gets managed" is an observation about attention. Metrics direct organizational gaze. If you track revenue, attention flows to revenue-generating activities. If you track customer satisfaction, attention flows there.
The dark side: Goodhart's Law. When a measure becomes a target, it ceases to be a good measure. The organization attends to the metric, not the underlying reality. Gaps in metric-relevant information get filled; gaps in non-measured areas go unaddressed.
The Email Problem
Email is an attention attack. Every message is someone else trying to redirect your fetch capacity toward their priorities.
Inbox Zero, notification management, "deep work" scheduling—all are defensive strategies for protecting attention allocation from external hijacking.
Theology / Philosophical
Prayer as Attention Training
Contemplative traditions emphasize training attention. Prayer, meditation, lectio divina—all involve directing focus deliberately rather than letting it drift.
"Be still and know" (Psalm 46:10) is an instruction about attention. Stop the scattered fetch operations. Direct focus to one thing. See what becomes available when you stop sampling randomly.
Centering prayer: Return attention to a sacred word when it wanders. The practice is literally attention training—noticing when focus has drifted, redirecting it, repeat.
Meditation: Redirecting the Fetch
Buddhist meditation often begins with breath awareness. Why breath? Because it's always available, always happening, boring enough to reveal when attention has wandered.
The instruction isn't "never let attention wander." The instruction is "notice when it has wandered and return." This is training the PAUSE—the capacity to detect where attention currently points and redirect it.
Vipassana (insight meditation) extends this to observing the movements of attention itself. What captures it? What releases it? The observer watches itself attending.
"Where Your Treasure Is, There Your Heart Will Be Also"
Matthew 6:21 is an observation about attention and value. What you attend to, you value. What you value, you attend to. The loop reinforces itself.
This cuts both ways:
- Attend to worthy things → grow to value them → attend more
- Attend to unworthy things → grow to value them → attend more
Hence the traditions' emphasis on careful curation of attention. What you gaze at, you become. The fetch patterns shape the observer.
Idolatry as Misallocated Attention
The prohibition against idols can be read attention-theoretically. An idol is a thing that captures attention that should point elsewhere. The golden calf isn't harmful because it's gold or because it's a calf. It's harmful because gaze directed there isn't directed at what matters.
Modern idols aren't statues. They're anything that captures attention disproportionate to its worth: status, accumulation, trivial entertainment, outrage cycles.
The sin isn't enjoying things. It's the misallocation—the fetch capacity directed at infinite scroll that could have addressed gaps that actually matter.
The Loop Property
Attention is not something added to the loop. Attention is what the loop does when it must choose among gaps.
Given:
- Multiple gaps exist (Premise 3)
- Only some can be fetched (Premise 4)
- Fetching takes time (Premise 2)
- Fetch targets must be addressable (Premise 1)
The system must have a mechanism for selection. That mechanism is attention.
| Component | Attention Function |
|---|---|
| PAUSE | Detect gaps, generate candidates |
| Attention | Select which gap to address |
| FETCH | Execute the selected query |
| SPLICE | Integrate the returned information |
| CONTINUE | Expose new gaps, return to PAUSE |
Attention is the steering within the loop. The loop is the invariant; attention determines its instantiation.
Why Attention Is All You Need
The transformer paper title is literally true, more than its authors may have intended.
If you have:
- A mechanism to represent gaps (queries)
- A mechanism to represent available information (keys, values)
- A mechanism to select relevance (attention weights)
- A mechanism to integrate (weighted sum)
You have the complete loop in differentiable form. Everything else is optimization.
The paper is not just about neural networks. It's an implementation of the only strategy available to finite embedded observers.
All you need is the loop. Attention is how the loop chooses.
Provenance
Document
- Status: 🔴 Unverified