lantern

wanderland-paper-evaluation

Evaluation

Research Questions

We evaluate Wanderland against three questions:

  • RQ1: Does the structural isomorphism hold operationally? Do compiler/database patterns actually apply?
  • RQ2: How does Wanderland compare to current approaches on key dimensions?
  • RQ3: Is the system viable for production use?

RQ1: Structural Isomorphism

Compiler Mapping

We validated the compiler analogy by implementing each stage:

Compiler Stage Wanderland Implementation Validated
Lexing markdown-it-py tokenizer ✓ Token stream produced
Parsing Section/fence extraction ✓ AST structure
Preprocessing Variable substitution (L1) ✓ ${var} expansion
Compilation Include resolution (L2) ✓ {{include:}} expansion
Linking Fence execution (L3) ✓ External data fetched
Optimization Middleware (L3.5) ✓ Transform pipeline

The cache invalidation behavior matches: invalidate any level and it regenerates from source. This is not analogical—it is the same algorithm.

Database Mapping

Query plan semantics validated:

Database Concept Wanderland Equivalent Validated
Materialized view Cached render ✓ Any level cacheable
Query optimizer Cache level selection ✓ Level parameter
Secondary index FenceIndex ✓ Fence discovery by type
EXPLAIN format='graph' ✓ Structure inspection

Navigation as query execution: moving through the graph triggers the same operations as executing a query plan—resolve references, fetch data, project results.

RQ2: Comparative Analysis

vs. xKG (Executable Knowledge Graphs)

Dimension xKG Wanderland
Graph construction Automated extraction Direct authoring
Code-concept link Separate nodes + edges Inline (prose contains fence)
Mutability Read-only KB Read-write substrate
Provenance Traceable to source Inline verification state

Finding: Wanderland eliminates the extraction pipeline entirely. The cost is manual authoring; the benefit is no reconstruction error.

vs. Loops (Notebook Provenance)

Dimension Loops Wanderland
Scope Single notebook Entire graph
Visualization Post-hoc timeline Inline indicators
Granularity Cell level Document + fence level
Purpose Reproducibility Continuous verification

Finding: Loops is research tooling for understanding what happened. Wanderland is operational tooling for ensuring what should happen.

vs. Standard MCP

Dimension Standard MCP Wanderland MCP
Tool definition Separate code Fence in document
Documentation Separate file Same artifact
Registration Explicit in server Implicit from graph
Sync requirement Manual None (same artifact)

Finding: Wanderland's homoiconic approach eliminates an entire class of drift bugs where tool behavior diverges from documentation.

RQ3: Production Viability

Deployment Context

Wanderland has been in production use for infrastructure operations at a Fortune 500 company's developer platform division. The system manages:

  • AWS infrastructure documentation with executable queries
  • JIRA ticket integration and workflow automation
  • Runbook execution with provenance tracking
  • AI agent tooling via MCP

Performance Characteristics

Operation Latency Notes
Node read (L0) <10ms File read
Node read (L2) <50ms Include resolution
Fence execution 100ms-10s Depends on external API
Graph navigation <20ms Index lookup + render

The performance tradeoff is explicit: fence execution adds latency vs. static content. This is acceptable for operational knowledge work where correctness matters more than throughput.

Limitations Observed

Limitations Observed

  • Cold start: First access after restart requires index rebuild (~5s for 500 nodes)
  • Large fences: Execution output >1MB causes rendering delays
  • Deep nesting: Include depth >5 impacts readability

Threats to Validity

This evaluation has significant limitations that constrain the strength of our claims:

No User Study: We have not conducted formal user studies measuring authoring effort, learning curves, or cognitive load. The production deployment provides existence proof of viability but not controlled comparison of developer productivity versus alternative approaches.

No Quantitative Authoring Metrics: We lack measurements of lines-of-code, time-to-documentation, or maintenance overhead compared to traditional docs-as-code workflows. Claims of reduced "integration tax" are based on architectural argument, not empirical measurement.

Single Deployment Context: Production validation comes from one organization's infrastructure team. Generalizability to other domains (research documentation, API references, educational content) remains unvalidated.

Self-Evaluation Bias: The authors are also the primary users. Independent evaluation by teams adopting Wanderland without author involvement would provide stronger evidence.

Provenance System Unquantified: While the provenance system is deployed and functioning, we have not measured how often it actually catches drift in production, nor compared its effectiveness against alternative verification approaches.

These limitations do not invalidate the architectural contributions but constrain claims about practical superiority. We position this work as demonstrating feasibility and identifying design patterns, not as proving optimal approach.

North

slots:
- slug: wanderland-paper
  context:
  - Parent paper node
  - Paper parent to evaluation section

East

slots:
- slug: wanderland-sota-assessment
  context:
  - Detailed SOTA comparison
- slug: spatial-database-engineering-patterns
  context:
  - Database pattern validation
- slug: wanderland-paper-discussion
  context:
  - Section sequence

West

slots:
- slug: wanderland-paper-implementation
  context:
  - Previous section

Provenance

Document

  • Status: 🔴 Unverified

Fences

wanderland-paper-evaluation-north-fence-0

  • Status: 🔴 Unverified

wanderland-paper-evaluation-east-fence-0

  • Status: 🔴 Unverified

wanderland-paper-evaluation-west-fence-0

  • Status: 🔴 Unverified

Limitations Observed