wanderland-paper-evaluation

Evaluation

Research Questions

We evaluate Wanderland against three questions:

RQ1: Does the structural isomorphism hold operationally? Do compiler/database patterns actually apply?
RQ2: How does Wanderland compare to current approaches on key dimensions?
RQ3: Is the system viable for production use?

RQ1: Structural Isomorphism

Compiler Mapping

We validated the compiler analogy by implementing each stage:

Compiler Stage	Wanderland Implementation	Validated
Lexing	markdown-it-py tokenizer	✓ Token stream produced
Parsing	Section/fence extraction	✓ AST structure
Preprocessing	Variable substitution (L1)	✓ ${var} expansion
Compilation	Include resolution (L2)	✓ {{include:}} expansion
Linking	Fence execution (L3)	✓ External data fetched
Optimization	Middleware (L3.5)	✓ Transform pipeline

The cache invalidation behavior matches: invalidate any level and it regenerates from source. This is not analogical—it is the same algorithm.

Database Mapping

Query plan semantics validated:

Database Concept	Wanderland Equivalent	Validated
Materialized view	Cached render	✓ Any level cacheable
Query optimizer	Cache level selection	✓ Level parameter
Secondary index	FenceIndex	✓ Fence discovery by type
EXPLAIN	format='graph'	✓ Structure inspection

Navigation as query execution: moving through the graph triggers the same operations as executing a query plan—resolve references, fetch data, project results.

RQ2: Comparative Analysis

vs. xKG (Executable Knowledge Graphs)

Dimension	xKG	Wanderland
Graph construction	Automated extraction	Direct authoring
Code-concept link	Separate nodes + edges	Inline (prose contains fence)
Mutability	Read-only KB	Read-write substrate
Provenance	Traceable to source	Inline verification state

Finding: Wanderland eliminates the extraction pipeline entirely. The cost is manual authoring; the benefit is no reconstruction error.

vs. Loops (Notebook Provenance)

Dimension	Loops	Wanderland
Scope	Single notebook	Entire graph
Visualization	Post-hoc timeline	Inline indicators
Granularity	Cell level	Document + fence level
Purpose	Reproducibility	Continuous verification

Finding: Loops is research tooling for understanding what happened. Wanderland is operational tooling for ensuring what should happen.

vs. Standard MCP

Dimension	Standard MCP	Wanderland MCP
Tool definition	Separate code	Fence in document
Documentation	Separate file	Same artifact
Registration	Explicit in server	Implicit from graph
Sync requirement	Manual	None (same artifact)

Finding: Wanderland's homoiconic approach eliminates an entire class of drift bugs where tool behavior diverges from documentation.

RQ3: Production Viability

Deployment Context

Wanderland has been in production use for infrastructure operations at a Fortune 500 company's developer platform division. The system manages:

AWS infrastructure documentation with executable queries
JIRA ticket integration and workflow automation
Runbook execution with provenance tracking
AI agent tooling via MCP

Performance Characteristics

Operation	Latency	Notes
Node read (L0)	<10ms	File read
Node read (L2)	<50ms	Include resolution
Fence execution	100ms-10s	Depends on external API
Graph navigation	<20ms	Index lookup + render

The performance tradeoff is explicit: fence execution adds latency vs. static content. This is acceptable for operational knowledge work where correctness matters more than throughput.

Limitations Observed

Cold start: First access after restart requires index rebuild (~5s for 500 nodes)
Large fences: Execution output >1MB causes rendering delays
Deep nesting: Include depth >5 impacts readability

Threats to Validity

This evaluation has significant limitations that constrain the strength of our claims:

No User Study: We have not conducted formal user studies measuring authoring effort, learning curves, or cognitive load. The production deployment provides existence proof of viability but not controlled comparison of developer productivity versus alternative approaches.

No Quantitative Authoring Metrics: We lack measurements of lines-of-code, time-to-documentation, or maintenance overhead compared to traditional docs-as-code workflows. Claims of reduced "integration tax" are based on architectural argument, not empirical measurement.

Single Deployment Context: Production validation comes from one organization's infrastructure team. Generalizability to other domains (research documentation, API references, educational content) remains unvalidated.

Self-Evaluation Bias: The authors are also the primary users. Independent evaluation by teams adopting Wanderland without author involvement would provide stronger evidence.

Provenance System Unquantified: While the provenance system is deployed and functioning, we have not measured how often it actually catches drift in production, nor compared its effectiveness against alternative verification approaches.

These limitations do not invalidate the architectural contributions but constrain claims about practical superiority. We position this work as demonstrating feasibility and identifying design patterns, not as proving optimal approach.

North

slots:
- slug: wanderland-paper
  context:
  - Parent paper node
  - Paper parent to evaluation section

East

slots:
- slug: wanderland-sota-assessment
  context:
  - Detailed SOTA comparison
- slug: spatial-database-engineering-patterns
  context:
  - Database pattern validation
- slug: wanderland-paper-discussion
  context:
  - Section sequence

West

slots:
- slug: wanderland-paper-implementation
  context:
  - Previous section

Provenance

Document

Status: 🔴 Unverified

Fences

wanderland-paper-evaluation-north-fence-0

Status: 🔴 Unverified

wanderland-paper-evaluation-east-fence-0

Status: 🔴 Unverified

wanderland-paper-evaluation-west-fence-0

Status: 🔴 Unverified

Limitations Observed

↑ northwanderland-paper

→ eastwanderland-sota-assessmentspatial-database-engineering-patternswanderland-paper-discussion

← westwanderland-paper-implementation