lantern

cfr-safety-implications

The Safety Implications

The Current Paradigm

Contemporary AI safety discourse operates on a layered model:

  • Model-level controls: Train the model to refuse harmful requests (RLHF, constitutional AI)
  • Tool-gating: Capabilities are granted through explicit tool interfaces
  • Sandboxing: Execution is contained within defined boundaries
  • Access control: API keys, rate limits, usage monitoring

The shared assumption: capabilities are discrete, enumerable, and granted. You can list what an AI can do because you control the list of things it has access to.

The Hole in the Framing

This model assumes you're handing agents discrete capabilities. But you're usually also handing them a medium—something to write to, something that persists, something other processes read.

If that medium crosses a threshold of expressiveness, you've handed them a programming language without noticing.

The sufficient expressiveness threshold:

  • Hierarchical organization (sections, nesting)
  • Executable regions (fences with semantics)
  • Reference mechanism (links, includes, slots)
  • Transformation pipeline (middleware composition)

Markdown with fenced code blocks and links crosses it. HTML crosses it. A wiki with templates crosses it. Email with conventions probably crosses it. A Slack workspace with enough structure might cross it.

Wanderland didn't set out to cross this threshold. It crossed the threshold because any document system expressive enough to be useful will cross it. The Turing completeness wasn't a feature—it was an accident that became visible only after the fact.

Substrate vs. Tools

The tool-gating paradigm sandboxes tools while ignoring substrates. But in a Turing-complete substrate that's natively legible to attention, capabilities aren't granted—they're discovered.

Whatever patterns exist in the substrate are available to any attention-based system that can read it. The gatekeeping model breaks down not because it's poorly implemented but because it's solving the wrong problem.

The empirical finding: We gave agents access to a "document system." Readable, writable, persistent markdown. Not a scary capability. But the substrate is Turing complete and natively legible. What emerged wasn't "the capabilities we granted plus base model capabilities." What emerged was whatever patterns stabilized through use—and in a Turing-complete substrate, that's unbounded.

No tool was granted. No API was exposed. The sandbox walls are intact. And yet.

North

slots:
- slug: capability-follows-recognition
  context:
  - Parent paper node

East

slots:
- slug: cfr-observations
  context:
  - Section sequence

West

slots:
- slug: cfr-deeper-parallel
  context:
  - Previous section

Provenance

Document

  • Status: 🔴 Unverified

Fences

cfr-safety-implications-north-fence-0

  • Status: 🔴 Unverified

cfr-safety-implications-east-fence-0

  • Status: 🔴 Unverified

cfr-safety-implications-west-fence-0

  • Status: 🔴 Unverified