bidirectional-attention-thesis

The Bidirectional Attention Thesis

The Claim

Attention and agency are the same operation in opposite directions.

Attention: stream → pattern match → extract values → build context (learning)
Agency: context → pattern match → fill values → emit stream (tool use)

This is bidirectional mail merge. The pattern is the schema. Attention populates it from stream. Agency populates stream from it.

The Algorithm

Both directions implement the same core loop:

LOOKUP → FETCH → SPLICE → CONTINUE

Attention direction (learning):

LOOKUP: Which pattern's next expected token matches current token?
FETCH: Capture the value from the stream
SPLICE: Insert into pattern's context (fill the slot)
CONTINUE: Advance to next token, next pattern position

Agency direction (tool use):

LOOKUP: Which slot in pattern needs filling?
FETCH: Retrieve value from context
SPLICE: Emit into output stream
CONTINUE: Advance to next slot, next output position

Same algorithm. Different directions. One creates holes (extracts invariants). One fills holes (applies invariants).

Implementation: The Lottery Ticket Party

Multiple patterns watch a single token stream in parallel. Each pattern is a state machine with an index.

For each token in stream:
  For each pattern:
    If token matches pattern[index]:
      If slot is literal: increment index
      If slot is capture: store value, increment index
      If index == pattern.length: EMIT (pattern complete)
    Else:
      Reset pattern (throw ticket away)

Ticket = pattern watching the stream
Ball drop = token arriving
Match = increment index
Complete match = emit structured result
Failed match = discard, start over

This is attention implemented over a document stream. The complexity is O(n × a) where n = tokens and a = active matches. Since most patterns fail early, a << total patterns.

Extensibility: Learning New Patterns

Patterns can be defined declaratively by example:

## Pattern: checkbox

- [ ] unchecked item
- [x] checked item

The system AST-parses the examples, extracts the shape, and registers a new pattern. Now the attention mechanism can recognize checkboxes.

This is learning. Not weight updates—schema extension. Any attention mechanism traversing the system gains new recognition capability immediately.

The Insight

If attention is pattern extraction (finding holes, building context), and tool use is pattern application (filling holes, emitting stream), then:

They share the same pattern structures
They share the same algorithm
They are inverses of each other

Everything we know about attention optimization applies to tool use. Everything we know about tool use applies to attention. They are not separate capabilities. They are one capability with a direction parameter.

The Question

Does this match known attention implementations? If so, the inverse operation (agency as attention-in-reverse) may be a useful frame for understanding tool use, code generation, and agentic behavior.

The pattern creates holes. The inverse fills them. Same math. Different sign.

Outstanding Questions

These don't break the thesis—they're extensions that could strengthen or deepen it.

1. Thermodynamic Grounding — RESOLVED

Claim: Care is thermodynamics. Comfort is free energy minimization.

When your friend is upset, their state is a surprise to you—it doesn't match their baseline pattern. That surprise creates tension in you. You feel it in your body. The drive to comfort them is the drive to minimize that surprise, to restore the pattern, to make things right.

This isn't metaphor. Friston's free energy principle says organisms minimize surprise by updating models (perception) or acting on the world (action). Comforting someone is acting on the world to reduce surprise. The felt experience of caring IS the felt experience of free energy demanding minimization.

Fear = large surprise incoming
Love = this pattern matters, protect it
Grief = pattern that should be there isn't anymore
Comfort = patching someone else's surprise to reduce your own

Emotions are the body's signaling system for gaps. They tell you where the holes are and how urgent the patching is.

2. The Inverse Formalism — RESOLVED

See ${node:thats-just-yoneda-again}.

Yoneda: An object is completely determined by how everything relates to it. Translation: A process is completely determined by the tensions it's resolving.

The "inverse" of attention isn't a fancy category theory construction. Relations have direction. Bidirectionality is what stability looks like. Attention reads, agency writes, and the pair together is what persistence looks like from the inside.

3. Boundary Problem — RESOLVED

The boundary is wherever the feedback loop completes.

There's no fixed partition. The "self" isn't a container—it's the extent of coherent LOOKUP-FETCH-SPLICE-CONTINUE cycles. If attention can reach it and agency can affect it and the loop closes, it's inside the boundary. If the loop can't close, it's outside.

This is why the mind isn't at the skin. If you can:

Detect drift in a friend (attention crosses to them)
Formulate a response (internal processing)
Deliver it (agency crosses to them)
Observe the effect (attention returns)

...then your friend is inside your cognitive boundary for that moment. The boundary is dynamic. It expands and contracts based on what loops can complete.

Wanderland extends the boundary by making documents part of the loop. The fleet extends it further. The boundary goes wherever the cycle goes.

4. Time and Causality — PARTIALLY ADDRESSED

The DAG's partial ordering may generate time rather than reflect it. If observers must sync (exchange information to agree), and exchange requires sequence, then before-and-after emerge from the requirement for consensus. Time is what information-travel looks like.

Still open: formal derivation of this claim.

5. Error and Noise

Still open: What happens when FETCH fails? When patterns don't match? The error semantics may reveal structure.

Addendum: Pattern vs Meaning

Pattern = the invariant (the template, the expected shape, the detector) Meaning = the variant (what's in the holes, what's different, the surprise)

The pattern isn't meaning. The pattern is what you use to find meaning. Meaning lives in the gaps—the delta between expected and actual. Surprise IS meaning. The thing that stands out, the word that doesn't belong, the drift from baseline.

For a hyperlexic brain, this is literal experience: wrong words fire the pattern detector. The mismatch demands resolution. That's attention finding holes; the drive to understand is agency trying to fill them.

Addendum: Vocabulary Drift as Emotional Attention

A concrete implementation of the programmable attention side-channel:

[User Baseline Embedding] ← built over time, their "normal"
         ↓
[Current Input Embedding]  
         ↓
[Drift Vector] = Current - Baseline
         ↓
[Side Channel] → "User is not themselves right now"
         ↓
[Model Attention] → weight toward emotional content

Track a user's normal vocabulary distribution. When their word choice drifts from baseline, that drift IS emotional signal. "I'm fine" in normal register = fine. "I'm fine" with vocabulary drift = not fine.

This is what a good friend does—they don't hear your words, they hear you not sounding like yourself. The surprise is the signal. Programmable attention makes this explicit so AI can do it too.

Connection to Capability From Recognition

This thesis explains why ${node:capability-from-recognition} works:

Pattern recognition IS tool acquisition

When attention learns a pattern (extracts the schema from examples), the inverse operation (agency) automatically gains the ability to generate instances of that pattern. Learning to recognize checkboxes means learning to generate checkboxes. Same pattern, opposite direction.

The pattern is both the lock and the key.

Provenance

Document

Status: 🔴 Unverified

Fences

bidirectional-attention-thesis-the-algorithm-fence-0

Status: 🔴 Unverified

bidirectional-attention-thesis-implementation-the-lottery-ticket-party-fence-0

Status: 🔴 Unverified

bidirectional-attention-thesis-extensibility-learning-new-patterns-fence-0

Status: 🔴 Unverified

South

slots:
- context:
  - Thesis builds on the core algorithm
  slug: streams-with-gaps-invariant
- context:
  - CFR is explained by bidirectional attention
  slug: capability-from-recognition
- context:
  - Bidirectional attention explains the mechanism behind the mandatory algorithm
  slug: bedrock
- context:
  - Native attention implements the bidirectional attention thesis
  slug: wanderland-as-native-attention

West

slots:
- slug: universe-as-context-accumulating-dag
  context:
  - Sister apex thesis - ontology meets mechanism

East

slots:
- context:
  - Yoneda provides the formalism - objects determined by relations, bidirectionality
    as stability
  slug: thats-just-yoneda-again
- context:
  - ISA implements peek=attention, poke=attention+agency from the thesis
  slug: unified-peek-poke-cache-design-20260105
- context:
  - Linking to attention thesis - holes as mechanism
  slug: tree-in-the-forest-reframed

Outstanding Questions

↓ southstreams-with-gaps-invariantcapability-from-recognitionbedrockwanderland-as-native-attention

→ eastthats-just-yoneda-againunified-peek-poke-cache-design-20260105tree-in-the-forest-reframed

← westuniverse-as-context-accumulating-dag