falsifiable-experiments-message-passing
Falsifiable Experiments: Message-Passing Invariant
Experimental designs to test whether message-passing-to-free-energy-minimum is mandatory for local→global consistency.
The Core Claim
Thesis: Any system that achieves global consistency from local information will converge on iterative message-passing that minimizes a free-energy-like functional.
Corollary: Breaking specific constraints produces predictable failure modes.
This is either a deep truth about embedded inference or a case of seeing hammers everywhere. These experiments distinguish between the two.
Experiment 1: Cellular Automata Constraint Breaking
Experiment 1: Cellular Automata Constraint Breaking
Test the message-passing invariant by breaking specific constraints in Conway's Game of Life variants.
Run the Experiment
The harness lives at [[ca-constraint-experiment-harness]]. Configure and execute:
Current Configuration:
path: config❌ Fence Execution Error: No fence found in section 'yaml' of node 'ca-constraint-experiment-harness'
To modify: Edit the config section in ca-constraint-experiment-harness, then re-render this page.
| Parameter | Default | Description |
|---|---|---|
trials |
20 | Number of independent runs per rule |
steps |
300 | Simulation steps per trial |
grid_size |
50 | Grid dimensions (50×50) |
noise_levels |
[0, 0.01, 0.05] | Environmental noise for sweep experiments |
Live Results
TableConfig:
array_path: tests
columns:
Rule: rule
Constraint: constraint
Prediction: prediction
Variance: variance
Density: density
Status: status
format: markdown❌ Fence Execution Error: No module named 'numpy' Traceback (most recent call last): File "/app/oculus/providers/python_provider.py", line 468, in execute exec(fence_content, exec_globals, exec_locals) File "
", line 1, in ModuleNotFoundError: No module named 'numpy'
Execution Summary:
path: summary❌ Fence Execution Error: No fence found in section 'yaml' of node 'ca-constraint-experiment-harness'
Historical Runs
Run 1: Initial Exploration (5 trials, 500 steps)
Date: 2026-01-06
Config: trials=5, steps=500, grid_size=50
| Rule | Variance | Spatial Corr | Result |
|---|---|---|---|
| GoL (Baseline) | 340 | 0.021 | BASELINE |
| Break Conservation (2a) | 1130 | - | ✅ 3.3x variance |
| Break Memory (2b) | 2256 | 0.004 | ✅ 6.6x variance |
| Break Hierarchy (2c) | 0 | - | ✅ Collapsed to all-1s |
| Enhance Prediction (2d+) | 260 | - | ✅ 24% lower |
| Overprediction (2d-) | 315 | - | ❓ Appeared better |
Conclusion: 4/5 confirmed. Overprediction result was suspicious - looked helpful.
Run 2: Statistical Analysis (50 trials, 300 steps)
Date: 2026-01-06
Config: trials=50, steps=300, grid_size=50
Purpose: Proper statistics with standard errors and z-scores
| Noise | GoL Variance | Overpred Variance | z-score | Significance |
|---|---|---|---|---|
| 0% | 488 ± 75 | 299 ± 62 | 1.94 | not sig (p≈0.05) |
| 0.5% | 790 ± 82 | 1108 ± 106 | -2.37 | GoL better (p<0.05) |
| 5% | 1041 ± 30 | 1426 ± 63 | -5.53 | GoL better (p<0.001) |
Conclusion: Overprediction is never significantly better. Initial result was noise. Original hypothesis CONFIRMED.
Run 3: Current Live Run (20 trials, 300 steps)
See Live Results table above.
Conclusion: 6/6 constraints confirmed. Framework survives.
Analysis
Conservation (2a): Breaking it causes exactly what we predicted - system can't find fixed point, perpetual noise, ~1.7x higher variance than baseline.
Memory (2b): Devastating effect - ~4x higher variance. System literally can't remember anything, can't maintain structure.
Hierarchy (2c): Converges to trivial all-alive state (density=1.0, variance=0). Not "no coherence" but "degenerate coherence" - the prediction was right but mechanism was collapse rather than chaos.
Prediction Enhancement (2d+): Momentum smoothing helps. ~60% lower variance than baseline. Confirms prediction helps.
Prediction Overshoot (2d-): CONFIRMED - overprediction is never significantly better than baseline. At best a statistical tie (0% noise), often significantly worse. The "anticipate future states" logic creates cascading pessimism - preemptive deaths trigger more deaths.
Code
- Interactive Playground: [[ca-constraint-lab]] (Rose Pine, runs in browser)
- Executable Harness: [[ca-constraint-experiment-harness]] (runs in graph)
- CLI Script:
/Users/graemefawcett/working/wanderland/experiments/ca_constraint_breaking.py - Statistical Analysis:
/Users/graemefawcett/working/wanderland/experiments/ca_noise_sweep.py
Experiment 2: Neural Network Ablation Study
Motivation
Modern neural nets are complex, but we can surgically impair specific constraint-related mechanisms and test if the predicted failure mode emerges.
Setup
Use a standard transformer (e.g., GPT-2 small) on a task requiring:
- Long-range coherence (memory)
- Compositional structure (hierarchy)
- Next-token prediction (prediction)
Task: Story completion with planted facts early in context.
Ablation Conditions
| Ablation | What We Break | Predicted Failure |
|---|---|---|
| Reduce context to 32 tokens | Memory (2b) | Forgets planted facts, incoherent over distance |
| Remove layer norms | Conservation (2a) | Training instability, exploding/vanishing |
| Flatten to 1 layer | Hierarchy (2c) | Can't compose, treats everything as surface pattern |
| Remove residual connections | Prediction (2d) | Slow learning, can't shortcut to expected patterns |
| Random attention (not learned) | Message passing itself | Complete failure - no consistency |
Measurements
- Fact recall accuracy at various distances
- Perplexity on held-out text
- Compositionality tests (novel combinations of known elements)
- Training dynamics (loss curves, gradient norms)
The Key Comparison
If we break DIFFERENT constraints but get the SAME failure mode, the mapping is wrong.
If we break the SAME constraint in different ways and get DIFFERENT failures, the constraint categories are too coarse.
Falsification
- Ablation X produces failure Y instead of predicted failure X
- Random attention somehow still achieves coherence
- A flat network (1 layer) matches deep network on compositional tasks
Experiment 3: Artificial Market Tâtonnement
Motivation
Test whether market equilibrium finding is actually message-passing, and whether breaking constraints produces economic pathologies.
Setup
Agent-based model with:
- N agents with different utility functions
- M goods to trade
- Prices adjust via tâtonnement (or alternative mechanisms)
Conditions
| Condition | Mechanism | Predicted Outcome |
|---|---|---|
| Classic Tâtonnement | Price adjusts proportional to excess demand | Converges to equilibrium |
| No Memory | Price based only on current-round demand | Oscillates, never settles |
| No Price Signals | Agents can't see prices, random matching | No equilibrium, massive inefficiency |
| Prediction Added | Agents anticipate price changes | Faster convergence OR bubbles if overfit |
| Hierarchy Added | Market makers aggregate demand | Faster convergence, more stable |
The Interesting Test: Alternative Mechanisms
What if we DON'T use tâtonnement? What other mechanisms achieve equilibrium?
- Random matching + selection: Evolutionary pressure toward equilibrium
- Central planner: No message passing, direct optimization
- Auction mechanisms: Different message structure
Key question: Do non-tâtonnement mechanisms secretly implement message passing? Or do they achieve equilibrium through genuinely different means?
Measurements
- Rounds to reach ε-equilibrium
- Price stability (variance over time)
- Allocative efficiency (total utility achieved)
- Gini coefficient (fairness of distribution)
Falsification
- Non-message-passing mechanism achieves faster/better equilibrium
- Breaking memory has no effect (agents learn despite no price history)
- Breaking hierarchy makes things BETTER (decentralization wins)
Experiment 4: Cross-Domain Transfer Test
Motivation
If these are all the same algorithm, techniques should transfer. If transfer fails, the "unification" is superficial.
Proposed Transfers
| Source Domain | Technique | Target Domain | Prediction |
|---|---|---|---|
| TCP | AIMD (additive increase, multiplicative decrease) | NN Learning Rate | Stable convergence to optimal LR |
| Sinkhorn | Row/column normalization | Social choice (voting) | Fairer outcomes, prevents domination |
| Hippocampal replay | Sleep-phase retraining | LLM fine-tuning | Reduced catastrophic forgetting |
| BP damping | Message damping factor | Economic price adjustment | Reduced oscillation, faster equilibrium |
| Legal precedent | Stare decisis weighting | RL reward shaping | More stable policy learning |
Detailed Design: AIMD for Learning Rate
Hypothesis: If loss decreased this epoch, increase LR additively. If loss increased, decrease LR multiplicatively (cut in half).
Comparison: Standard learning rate schedules (cosine, step decay, warmup)
Prediction: AIMD should converge reliably across different architectures without tuning, just like TCP converges across different networks.
Falsification: AIMD performs worse than tuned schedules, doesn't transfer across architectures.
Detailed Design: Sinkhorn for Voting
Hypothesis: Apply Sinkhorn iterations to voting matrices (voters × candidates → preferences). Doubly stochastic output = "fair" influence distribution.
Comparison: Standard voting methods (plurality, ranked choice, approval)
Prediction: Sinkhorn voting resists strategic manipulation, produces more representative outcomes.
Falsification: Sinkhorn voting is MORE manipulable, or produces pathological outcomes (everyone gets 1/N influence regardless of preferences).
Experiment 5: Pathology Diagnosis
Motivation
If the framework is useful, it should diagnose real-world failures. Take known pathologies, predict which constraint failed, check if fixing that constraint helps.
Case Studies
| Pathology | Framework Prediction | Test |
|---|---|---|
| LLM Hallucination | 2d overshoot (prediction too confident) | Add uncertainty estimation, calibration → reduces hallucination? |
| Market Flash Crash | 2a deferred (conservation violation caught up) | Add circuit breakers (force conservation) → prevents crashes? |
| Organizational Dysfunction | 2c break (hierarchy without function) | Restore functional modularity → improves performance? |
| Catastrophic Forgetting | 2b failure (memory policy broken) | Add replay/EWC (fix retention) → preserves old knowledge? |
The Strong Test
Prediction: Fixing the WRONG constraint won't help. If hallucination is 2d, adding memory (2b fix) won't reduce it. If forgetting is 2b, adding hierarchy (2c fix) won't help.
This makes the framework disprovable: misdiagnosis should lead to failed interventions.
Measurements
- Intervention success rate when framework-guided
- Intervention success rate when random
- Intervention success rate when guided by alternative framework
Falsification
- Random interventions work as well as framework-guided
- Fixing "wrong" constraint helps anyway (categories not distinct)
- Alternative framework outperforms
Meta-Experiment: Adversarial Search
Motivation
Actively try to break the framework. Find counterexamples.
Protocol
- List all systems claimed to exhibit the pattern
- For each, identify the weakest link in the analogy
- Design a test that would prove the analogy is superficial
- Run the test (even as thought experiment)
- Document whether the framework survives
Known Weak Points to Attack
- Is Sinkhorn really "message passing"? It's matrix operations, not graph messages.
- Is tâtonnement really "free energy"? What's the functional being minimized?
- Are neurons really doing BP? The biology is much messier than clean equations.
- Is "free energy" even the same thing across domains? Or are we equivocating?
The Honest Assessment
The framework might be:
- True and deep - All instances are the same algorithm
- True but shallow - All instances are similar but details matter
- Useful but false - The analogy helps thinking but isn't literally true
- False and misleading - We're seeing patterns that aren't there
These experiments help distinguish these possibilities.
Summary: What Would Change Our Mind
| Evidence | Conclusion |
|---|---|
| Transfer works reliably | Framework has predictive power |
| Transfer fails despite structural match | Unification is superficial |
| Ablations produce predicted failures | Constraint categories are real |
| Ablations produce unexpected failures | Categories are wrong or too coarse |
| Non-message-passing achieves consistency | Message passing isn't mandatory |
| All tests confirm | We might be right, or we haven't tried hard enough |
Provenance
- Source: Falsifiability discussion, 2026-01-06
- Context: Extending higher-order-invariant-effects with experimental tests
- Status: 🟡 Designed, not executed
North
slots:
- slug: higher-order-invariant-effects
context:
- Linking experiments to parent framework nodeWest
slots:
- context:
- Linking formal statement to experiments
slug: message-passing-invariant-formal
- context:
- Experimental framework for testing claims
slug: fetch-semantics-manifesto
- context:
- Experimental validation feeds into paper
slug: computational-horizons-paper-outlineSouth
slots:
- slug: ca-constraint-lab
context:
- CA lab is an implementation of the experiments documented in falsifiable-experiments
- slug: ca-constraint-experiment-harness
context:
- Harness backs the falsifiable experiments documentation