wanderland-paper-evaluation
Evaluation
Research Questions
We evaluate Wanderland against three questions:
- RQ1: Does the structural isomorphism hold operationally? Do compiler/database patterns actually apply?
- RQ2: How does Wanderland compare to current approaches on key dimensions?
- RQ3: Is the system viable for production use?
RQ1: Structural Isomorphism
Compiler Mapping
We validated the compiler analogy by implementing each stage:
| Compiler Stage | Wanderland Implementation | Validated |
|---|---|---|
| Lexing | markdown-it-py tokenizer | ✓ Token stream produced |
| Parsing | Section/fence extraction | ✓ AST structure |
| Preprocessing | Variable substitution (L1) | ✓ ${var} expansion |
| Compilation | Include resolution (L2) | ✓ {{include:}} expansion |
| Linking | Fence execution (L3) | ✓ External data fetched |
| Optimization | Middleware (L3.5) | ✓ Transform pipeline |
The cache invalidation behavior matches: invalidate any level and it regenerates from source. This is not analogical—it is the same algorithm.
Database Mapping
Query plan semantics validated:
| Database Concept | Wanderland Equivalent | Validated |
|---|---|---|
| Materialized view | Cached render | ✓ Any level cacheable |
| Query optimizer | Cache level selection | ✓ Level parameter |
| Secondary index | FenceIndex | ✓ Fence discovery by type |
| EXPLAIN | format='graph' | ✓ Structure inspection |
Navigation as query execution: moving through the graph triggers the same operations as executing a query plan—resolve references, fetch data, project results.
RQ2: Comparative Analysis
vs. xKG (Executable Knowledge Graphs)
| Dimension | xKG | Wanderland |
|---|---|---|
| Graph construction | Automated extraction | Direct authoring |
| Code-concept link | Separate nodes + edges | Inline (prose contains fence) |
| Mutability | Read-only KB | Read-write substrate |
| Provenance | Traceable to source | Inline verification state |
Finding: Wanderland eliminates the extraction pipeline entirely. The cost is manual authoring; the benefit is no reconstruction error.
vs. Loops (Notebook Provenance)
| Dimension | Loops | Wanderland |
|---|---|---|
| Scope | Single notebook | Entire graph |
| Visualization | Post-hoc timeline | Inline indicators |
| Granularity | Cell level | Document + fence level |
| Purpose | Reproducibility | Continuous verification |
Finding: Loops is research tooling for understanding what happened. Wanderland is operational tooling for ensuring what should happen.
vs. Standard MCP
| Dimension | Standard MCP | Wanderland MCP |
|---|---|---|
| Tool definition | Separate code | Fence in document |
| Documentation | Separate file | Same artifact |
| Registration | Explicit in server | Implicit from graph |
| Sync requirement | Manual | None (same artifact) |
Finding: Wanderland's homoiconic approach eliminates an entire class of drift bugs where tool behavior diverges from documentation.
RQ3: Production Viability
Deployment Context
Wanderland has been in production use for infrastructure operations at a Fortune 500 company's developer platform division. The system manages:
- AWS infrastructure documentation with executable queries
- JIRA ticket integration and workflow automation
- Runbook execution with provenance tracking
- AI agent tooling via MCP
Performance Characteristics
| Operation | Latency | Notes |
|---|---|---|
| Node read (L0) | <10ms | File read |
| Node read (L2) | <50ms | Include resolution |
| Fence execution | 100ms-10s | Depends on external API |
| Graph navigation | <20ms | Index lookup + render |
The performance tradeoff is explicit: fence execution adds latency vs. static content. This is acceptable for operational knowledge work where correctness matters more than throughput.
Limitations Observed
Limitations Observed
- Cold start: First access after restart requires index rebuild (~5s for 500 nodes)
- Large fences: Execution output >1MB causes rendering delays
- Deep nesting: Include depth >5 impacts readability
Threats to Validity
This evaluation has significant limitations that constrain the strength of our claims:
No User Study: We have not conducted formal user studies measuring authoring effort, learning curves, or cognitive load. The production deployment provides existence proof of viability but not controlled comparison of developer productivity versus alternative approaches.
No Quantitative Authoring Metrics: We lack measurements of lines-of-code, time-to-documentation, or maintenance overhead compared to traditional docs-as-code workflows. Claims of reduced "integration tax" are based on architectural argument, not empirical measurement.
Single Deployment Context: Production validation comes from one organization's infrastructure team. Generalizability to other domains (research documentation, API references, educational content) remains unvalidated.
Self-Evaluation Bias: The authors are also the primary users. Independent evaluation by teams adopting Wanderland without author involvement would provide stronger evidence.
Provenance System Unquantified: While the provenance system is deployed and functioning, we have not measured how often it actually catches drift in production, nor compared its effectiveness against alternative verification approaches.
These limitations do not invalidate the architectural contributions but constrain claims about practical superiority. We position this work as demonstrating feasibility and identifying design patterns, not as proving optimal approach.
North
slots:
- slug: wanderland-paper
context:
- Parent paper node
- Paper parent to evaluation sectionEast
slots:
- slug: wanderland-sota-assessment
context:
- Detailed SOTA comparison
- slug: spatial-database-engineering-patterns
context:
- Database pattern validation
- slug: wanderland-paper-discussion
context:
- Section sequenceWest
slots:
- slug: wanderland-paper-implementation
context:
- Previous sectionProvenance
Document
- Status: 🔴 Unverified
Fences
wanderland-paper-evaluation-north-fence-0
- Status: 🔴 Unverified
wanderland-paper-evaluation-east-fence-0
- Status: 🔴 Unverified
wanderland-paper-evaluation-west-fence-0
- Status: 🔴 Unverified