message-passing-invariant-formal

Message-Passing Invariant: Formal Statement

From observation to derivation. If this is real, it should be provable.

1. Formal Statement of the Invariant

Definitions

Let $\mathcal{G} = (V, E)$ be a graph where:

$V$ = set of nodes (agents, neurons, variables, markets)
$E$ = edges (communication channels, synapses, dependencies)

Let each node $i \in V$ have:

Local state: $x_i \in \mathcal{X}_i$ (belief, price, activation)
Local observation: $y_i \in \mathcal{Y}_i$ (evidence, demand, input)
Incoming messages: ${m_{j \to i}}$ for $j \in \mathcal{N}(i)$

The Problem

Goal: Compute a global consistent state $\mathbf{x}^* = (x_1^*, \ldots, x_n^*)$ such that:

Each $x_i^*$ is consistent with local observation $y_i$
Adjacent nodes agree: $x_i^*$ and $x_j^*$ are compatible for all $(i,j) \in E$

Constraints:

C1 (Locality): Node $i$ can only access $y_i$ and messages from neighbors
C2 (Finite Memory): $|x_i| \leq M$ for some bound $M$
C3 (Finite Bandwidth): $|m_{i \to j}| \leq B$ for some bound $B$
C4 (Causality): Updates at time $t$ depend only on information from $t-1$

The Invariant (Conjecture)

Claim: Under constraints C1-C4, any algorithm achieving global consistency must be equivalent to:

$$x_i^{(t+1)} = f_i\left(y_i, {m_{j \to i}^{(t)}}_{j \in \mathcal{N}(i)}\right)$$

$$m_{i \to j}^{(t+1)} = g_{ij}\left(x_i^{(t+1)}, {m_{k \to i}^{(t)}}_{k \neq j}\right)$$

where convergence is to a fixed point $\mathbf{x}^*$ that minimizes a functional:

$$F(\mathbf{x}) = \sum_i F_i(x_i, y_i) + \sum_{(i,j) \in E} F_{ij}(x_i, x_j)$$

This is the variational free energy (or Bethe free energy in the loopy case).

Equivalently (Information-Theoretic Form)

$$F(\mathbf{x}) = D_{KL}(q(\mathbf{x}) | p(\mathbf{x} | \mathbf{y})) - \log p(\mathbf{y})$$

The algorithm minimizes divergence from the true posterior given observations.

2. Convergence Proof (Sketch)

Why This Algorithm?

Theorem (Informal): Message passing is the UNIQUE algorithm satisfying:

Locality (C1)
Optimality (minimizes F)
Consistency (fixed point = global optimum)

Proof sketch:

(a) Necessity of iteration:

Under C1, no node has global information
Global consistency requires information to propagate across graph
Diameter $d$ graph requires at least $d$ rounds
∴ Iteration is mandatory

(b) Necessity of messages:

Node $i$ must learn about distant nodes
Under C3, can only receive finite bits per round
Must aggregate information from neighbors
∴ Messages are mandatory

(c) Form of the update:

Each node minimizes local contribution to $F$
Taking derivative and setting to zero gives the BP update equations
∴ The specific form is determined by the objective

(d) Convergence conditions:

On trees: exact convergence to global optimum (Pearl 1982)
On graphs with loops: convergence to Bethe approximation (Yedidia et al. 2005)
Sufficient condition: contraction mapping (messages get closer each iteration)
AIMD-like damping ensures contraction when vanilla BP oscillates

The Key Insight

Why free energy?

The functional $F$ has special structure:

It decomposes over nodes and edges (matches graph structure)
Its gradient decomposes into local computations
Its fixed points correspond to consistent beliefs

Any other objective would either:

Require global coordination (violating C1)
Not decompose (violating C2/C3)
Have fixed points that aren't consistent

Free energy is selected by the constraints themselves.

3. Derived Predictions (Not Observed)

Derivation Strategy

From the formal structure, we can DERIVE what happens when each constraint breaks:

3a. Breaking Conservation (Row/Column Sums ≠ 1)

From formalism: The doubly-stochastic constraint on message matrices ensures:

Total probability mass = 1 (normalization)
Each node sends/receives fair share (flow balance)

Derived prediction: Without conservation, the functional $F$ is unbounded.

$$\text{If } \sum_j m_{i \to j} \neq 1 \text{ then } F \to \pm \infty$$

Observable consequence: System cannot find fixed point. Either:

Beliefs explode to infinity (runaway)
Beliefs collapse to zero (extinction)
Oscillation without convergence

Test: Remove normalization from Sinkhorn. Predict: no convergence, values explode or collapse.

3b. Breaking Memory (No State Persistence)

From formalism: The update depends on $x_i^{(t)}$. If nodes have no memory:

$$x_i^{(t+1)} = f_i(y_i, {m_{j \to i}^{(t)}}) \text{ (no dependence on } x_i^{(t)})$$

Derived prediction: Without memory, convergence requires messages to carry ALL historical information.

$$|m_{i \to j}| \geq t \cdot |\mathcal{X}_i| \text{ (grows with time)}$$

But C3 bounds message size. Therefore:

Observable consequence: Under C3 + no memory, system CANNOT converge to complex states. Limited to states expressible in $B$ bits.

Test: Run BP with bounded message size and no node state. Predict: fails on problems requiring more than $B$ bits of solution.

3c. Breaking Hierarchy (Flat Graph)

From formalism: Convergence time depends on graph diameter $d$.

$$t_{converge} = \Omega(d)$$

For flat graphs (no hierarchy), $d = O(n)$ for $n$ nodes. For hierarchical graphs (trees), $d = O(\log n)$.

Derived prediction: Flat graphs have convergence time linear in system size. Hierarchical graphs have logarithmic convergence.

Observable consequence: Flat systems are SLOW. As $n \to \infty$, flat systems fail to converge in practical time.

Test: Compare BP convergence on flat vs. hierarchical graphs of same size. Predict: orders of magnitude difference for large $n$.

3d. Breaking Prediction (Reactive Only)

From formalism: The update can be decomposed:

$$x_i^{(t+1)} = x_i^{(t)} + \alpha \cdot \Delta_i^{(t)}$$

where $\Delta_i$ is the "surprise" (prediction error).

Purely reactive means $x_i^{(t)} = 0$ always (no prior, just respond to input).

Derived prediction: Without prediction:

Learning rate $\alpha$ must be 1 (no momentum)
No filtering of noise (every fluctuation is signal)
Convergence requires strict contraction (very slow)

Observable consequence: Reactive systems are NOISY and SLOW. They track input exactly, including noise, with no smoothing.

Test: Run BP with vs without momentum. Predict: without momentum, tracks noise, slow convergence.

4. Failure Mode Predictions (Derived)

The Failure Mode Table (Now Derived)

Constraint Broken	Mathematical Consequence	Observable Failure
Conservation (2a)	$F$ unbounded, no fixed point	Explosion, collapse, or oscillation
Memory (2b)	Messages must grow without bound	Can't represent complex states
Hierarchy (2c)	Convergence time $O(n)$ not $O(\log n)$	Fails to scale
Prediction (2d)	No noise filtering, $\alpha = 1$	Noisy and slow

Combined Failures

Memory + Hierarchy broken:

Can't store history AND can't use structure to compress
Derived: System degenerates to memoryless Markov chain
Observable: Forgets everything, can't build on past

Prediction + Conservation broken:

Tracking noise + unbounded accumulation
Derived: Random walk without bounds
Observable: Brownian explosion

5. Empirical Tests to Falsify

Test 1: Conservation → Explosion

Setup: Sinkhorn iterations without normalization Prediction: Values diverge to ±∞ Falsified if: Values stay bounded without normalization

Test 2: Memory Bound → Complexity Bound

Setup: BP with message size $B$, test on problems of varying complexity Prediction: Fails when optimal solution requires $> B$ bits Falsified if: Solves complex problems with tiny messages

Test 3: Flat vs Hierarchical Scaling

Setup: BP on random graphs vs trees, $n = 10, 100, 1000, 10000$ Prediction: Trees converge in $O(\log n)$, flat in $O(n)$ Falsified if: Scaling is the same, or flat is faster

Test 4: Reactive = Noisy + Slow

Setup: BP with momentum=0 vs momentum=0.9 Prediction: momentum=0 is noisier, slower to converge Falsified if: Momentum doesn't help or makes things worse

Test 5: Cross-Domain Transfer

Setup: AIMD damping (from TCP) applied to BP oscillation Prediction: Stabilizes loopy BP that otherwise oscillates Falsified if: AIMD makes oscillation worse

6. The Strong Falsification

The framework predicts: There is NO algorithm achieving global consistency from local information under C1-C4 that is NOT equivalent to message-passing minimizing free energy.

To falsify: Find such an algorithm. Specifically:

Input: Local observations + neighbor communication only
Output: Global consistent state
Method: Provably NOT message passing
Performance: Equals or exceeds BP

Candidate challengers:

Neural networks trained end-to-end (do they learn BP internally?)
Genetic algorithms (do they implement implicit message passing?)
Quantum algorithms (do they escape the constraints?)

If ANY of these achieves consistency without message passing, the invariant is false.

7. Status and Gaps

What We Have

Formal statement of the conjecture
Sketch of why message passing is necessary
Derived predictions for constraint violations
Testable empirical predictions

What We Need

Rigorous proof of uniqueness (not just necessity)
Precise characterization of "equivalent to message passing"
Proof that free energy is the unique decomposable objective
Empirical validation of derived predictions

Known Weak Points

"Equivalent to" is doing a lot of work - need to define formally
Bethe free energy isn't exactly free energy - is the analogy tight?
Biological systems are messier than the math - how much slop is allowed?

Provenance

Source: Formalization session, 2026-01-06
Context: Moving from observation to derivation
Status: 🔴 Conjecture - needs rigorous proof
Confidence: Medium - structure seems right, details need work

North

slots:
- context:
  - Linking formal statement to parent
  slug: higher-order-invariant-effects

East

slots:
- context:
  - Linking formal statement to experiments
  slug: falsifiable-experiments-message-passing

↑ northhigher-order-invariant-effects

→ eastfalsifiable-experiments-message-passing