loom-case-study-devops

Loom in Practice: A DevOps Case Study

How One Platform Team Transformed Incident Response

The Problem: When AWS Goes Dark

It was 2:47 AM when the first PagerDuty alert fired. The e-commerce platform was experiencing intermittent 503 errors, and the on-call engineer—Maria—was already pulling up her terminal before the second alert arrived.

The symptoms were familiar but frustrating: ECS tasks cycling, load balancer health checks failing, and a cascade of dependent services timing out. Somewhere in the labyrinth of AWS infrastructure, something had gone wrong. Finding it would be the challenge.

Before Loom: The Old Way

Maria opened seven browser tabs. CloudWatch. ECS console. ALB target groups. The internal wiki (three clicks deep to find the runbook). A Slack thread from the last outage. Her notes app with half-remembered commands.

She copied a cluster ARN, switched tabs, pasted it into a different console, realized she needed a different region, switched back, and felt the familiar friction of context-switching during an incident.

"Where's the runbook for service restarts?" she typed into Slack.

Billy, the senior platform engineer, responded: "Check the wiki under Operations > ECS > Troubleshooting. Third section."

The wiki page was last updated eight months ago. Half the commands referenced services that had been renamed. Maria improvised.

The Shift: A Strange Message in Slack

Three weeks later, Maria was reviewing a deployment when she noticed something unusual in the platform team's channel. Sarah, one of the newer engineers, had posted what looked like a code block—but it was rendering.

┌─────────────────────────────────────────────────────┐
│  CLUSTER: prod-ecs-cluster-01                       │
│  SERVICES: 12 running, 0 pending, 0 draining        │
│  TASKS: 47/48 healthy                               │
│  LAST DEPLOY: 2025-01-09T14:23:00Z                  │
│  STATUS: ● HEALTHY                                  │
└─────────────────────────────────────────────────────┘

"Wait," Maria replied. "How is that rendering? Is that an embed?"

Sarah's response was a screenshot of her notebook. Just a simple fence:

\`\`\`cluster[prod-status]
cluster: prod-ecs-cluster-01
\`\`\`

"It's this new thing Billy's team is setting up," Sarah wrote. "The fence knows how to render itself. Watch."

She posted another block. This one showed live task counts.

Maria stared at her screen. The Slack message wasn't a static embed. It was alive.

The Architecture: How It Works

Billy had been quietly building what he called "living documentation" using Loom and the Oculus graph system. The concept was simple:

Nodes store documentation, runbooks, and configuration in markdown
Fences (code blocks) can contain data, executable queries, or rendered components
Holes (${...}) pull data dynamically from other nodes or live systems
The Observer renders everything on demand—PAUSE → FETCH → SPLICE → CONTINUE

When Sarah pasted that fence into Slack, the platform's integration layer intercepted it, resolved the prod-status label against the cluster inventory, fetched current metrics, and rendered the result.

No static screenshots. No stale runbooks. The documentation was the system.

One Month Later: The Real Test

It was 3:12 AM. Again.

Maria's pager fired. ECS service checkout-api was unhealthy. But this time, instead of opening seven browser tabs, she opened one: the team's operations node.

# Operations Dashboard

## Current Status

\`\`\`cluster[live-status]
cluster: ${env:active-cluster}
\`\`\`

## Active Incidents

${incidents:open.yaml.summary}

The page rendered immediately. She could see the cluster state, the specific service flapping, and—critically—a yellow indicator next to the health check configuration.

To the east of the status panel, a linked node appeared: ecs-health-check-troubleshooting.

She clicked through.

The Troubleshooting Flow

The troubleshooting guide wasn't a static document. It was a conversation.

## Service Health Check Failures

The service appears to be failing health checks.

### Current Configuration

\`\`\`python[health-check-config]
# Fetched from service definition
config = fetch_service_config("${context:service-name}")
print(f"Interval: {config['health_check']['interval']}s")
print(f"Timeout: {config['health_check']['timeout']}s")
print(f"Healthy threshold: {config['health_check']['healthy_threshold']}")
\`\`\`

### Quick Diagnosis

Your health check timeout (${context:health-check-timeout}s) is 
shorter than the average response time (${context:avg-response-time}s).

**Recommended Action:** Increase timeout to at least 
${context:recommended-timeout}s.

### Apply Fix?

\`\`\`action[apply-fix]
type: ecs-update-health-check
service: ${context:service-name}
timeout: ${context:recommended-timeout}
confirm: true
\`\`\`

Maria read the diagnosis. The system had already identified the issue: a recent code change had increased the checkout service's startup time, but the health check timeout hadn't been updated. Tasks were being marked unhealthy before they finished initializing.

She reviewed the recommended timeout, confirmed the action, and watched the service stabilize.

Total resolution time: 4 minutes.

The Deeper Pattern

The next morning, Maria received an automated message:

┌─────────────────────────────────────────────────────┐
│  INCIDENT RESOLVED: checkout-api health check       │
│                                                     │
│  Root cause: Health check timeout insufficient      │
│  Resolution: Timeout increased 5s → 15s             │
│                                                     │
│  This incident matches a pattern we've seen         │
│  across 3 other services this quarter.              │
│                                                     │
│  Would you like to:                                 │
│  [1] Review health check best practices             │
│  [2] Audit other services for similar config        │
│  [3] Add this to the automated pre-deploy checks    │
│                                                     │
└─────────────────────────────────────────────────────┘

The system wasn't just helping her fix problems. It was learning from them.

She selected option 3.

What Changed

Before Loom

7 browser tabs during incidents
Runbooks outdated within weeks
Tribal knowledge locked in Slack threads
Average incident resolution: 23 minutes
Post-mortems forgotten by next quarter

After Loom

Single operations dashboard
Documentation renders live system state
Troubleshooting guides adapt to context
Average incident resolution: 6 minutes
Patterns detected, fixes automated

The Key Insight

Loom doesn't replace documentation. It makes documentation executable.

The same node that describes how ECS health checks work can also:

Show the current configuration of a specific service
Diagnose why it's failing
Offer to apply the fix
Learn from the resolution

The boundary between reading about a system and operating it dissolves.

"I used to dread on-call," Maria said later. "Now it's almost... calm. The runbook isn't a PDF I hope is still accurate. It's a conversation with someone who already knows what's wrong."

Technical Implementation

For teams evaluating Loom, the key integration points are:

Node Storage: Markdown files in Oculus graph (~/.local/share/oculus/nodes/)
Fence Execution: Python/YAML/bash fences execute against live systems
Hole Resolution: ${source:path} syntax pulls data dynamically
Integration Layer: Slack/Teams/IDE plugins render fences in context
Observer API: Single endpoint for all rendering operations

The PEEK/POKE commands provide a simple interface:

# Read current cluster status
loom peek ops-dashboard:current-status

# Update a configuration value
loom poke config:settings.yaml.timeout 30

# Render at different levels
loom peek troubleshooting -l 3  # Raw (unresolved)
loom peek troubleshooting -l 5  # Rendered (live data)

Conclusion

The platform team didn't set out to reinvent documentation. They set out to solve a specific problem: incidents took too long because the information needed to resolve them was scattered, stale, or locked in people's heads.

Loom emerged from that constraint. By treating documents as streams that flow through middleware—resolving holes, splicing templates, executing fences—they created living documentation that participates in operations rather than merely describing them.

The 503 errors still happen. AWS still goes dark at 3 AM.

But now, the runbook runs itself.

For more information on implementing Loom in your organization, see the Loom User Guide and Oculus V2 documentation.

Provenance

Document

Status: 🔴 Unverified

Changelog

2026-01-11 07:35: Node created by mcp - Creating DevOps case study with ECS outage narrative as requested

West

slots:
- context:
  - Linking user guide to case study
  slug: loom-user-guide

← westloom-user-guide