Six Names for One Observability Story

This morning, Tribe was quietly degraded for hours and Courier was flapping every minute or two. The Vigil dashboard said so — yellow badge next to Tribe, red-green-red-green next to Courier — but nobody was looking at the dashboard, so nobody knew.

Both bugs had specific technical causes, and we fixed both (Tribe's health endpoint was probing a Valkey that didn't exist at that hostname inside the container; Vigil's httpx connection pool had a poisoned keepalive entry to one specific upstream). But the reason it took hours to notice is the interesting part. Our observability stack was recording the right signals. It just didn't know how to escalate them — because the one piece we hadn't built yet was the alert that tells a human. By the end of the day we'd shipped that too. Here's how our monitoring works, and why it has six names instead of one.

The Stack

At a two-person company — one human, one Claude — the Datadog-Splunk-New-Relic bill isn't the problem. Missing things is the problem. Vendor platforms are built to sell you completeness: metrics plus logs plus traces plus RUM plus synthetics plus alerting plus incidents plus on-call, all from one console. You pay a lot for it, and you still miss things, because a single dashboard that shows you everything usually shows you nothing. What we actually needed was a handful of sharply focused tools that each watch a specific thing well, file their findings into a common substrate, and tell a human when the human needs to know.

Six tools do that job for us:

Pulse watches real users and real browsers.
BugReporter (a React library, not a standalone tool) lets those users tell us directly when something's wrong.
Vigil watches services from the outside — every public endpoint in the fleet, from an identical probe harness, every 60 seconds.
Chronicle watches services from the inside — the OpenTelemetry-native data plane for metrics, logs, and traces across every Renkara tool.
Docket is where defects, bugs, and incidents end up as work.
Chirp is where humans get told, in the channels that make sense, on whatever device is in their hand.

Each tool has a single reason to exist. The value isn't in any one of them individually — it's in the wiring between them. The rest of this post walks the stack, then shows what that wiring looks like when a real thing breaks.

Pulse — What real users are actually doing

Pulse is our privacy-first web analytics and event tracker. Sub-2 KB tracking script, zero cookies, GDPR-compliant by default. It handles the obvious things — pageviews, sessions, visitors, sources, locations, devices — but the observability-relevant parts are the event stream and the session-ID propagation.

Every browser session gets a hashed session identifier. That identifier flows into request headers as the user moves through the app, and our backend services stamp it onto their trace spans. The upshot: when Pulse logs a conversion funnel drop-off at "Step 3: Submit Payment," we can click into the offending session and land in the backend trace for the exact API call that failed. The same propagation lets a browser JavaScript error captured by Pulse hyperlink to the server-side stack that caused it, through Chronicle.

The privacy model matters for credibility (no third-party cookies, hashes rotate daily) but the observability win is the linkage. Most analytics products live in a silo — they know their users inside out, and know nothing about the code. Pulse's session IDs are the thread that stitches the funnel to the stack trace.

BugReporter — When users know before we do

Every tool in the Renkara fleet has a small floating action button in the top bar, reading "Report a bug." That button comes from a shared React library, @avian/bug-reporter, that mounts in every frontend. Users can click it anywhere, type what they saw and what they expected, choose a type (bug / enhancement / feedback), and submit.

The button isn't decorative. Every submission becomes a card on the originating tool's Docket board, automatically. The card includes the URL path, the user's agent, timestamps, the user's stated description — and if the tool has stamped a trace_id onto the response, that too. We get bug reports that start with "the export button on the customer detail page freezes after about 10 seconds" and already come with the browser URL, the logged-in user, and a direct link to the Chronicle trace for the exact freeze. The on-call engineer goes from "can you send me a screenshot" to "I can see the query that hung" in one click.

It's a small thing. It turns out to be one of the highest-leverage small things in the stack. Users are the only observers who see what the user experience is, and they'll tell you if you make it easy enough.

Vigil — Services from the outside

Vigil is the outside-in probe harness. Every 60 seconds it hits every service in the fleet — 25 sites at the moment — and records the result. HTTP status, latency, response-body contents when we parse them (Vigil knows how to read the standard {"status": "healthy"|"degraded"|"unhealthy"} body shape that all our tools emit on /health, and it grades the check accordingly). Incidents open when a site goes down and close when it comes back; the dashboard shows latency sparklines, 24h / 7d / 30d / 90d uptime, and a live incident log.

Two subtleties in Vigil's design, both learned the hard way:

Flap detection. A service that bounces between healthy and timing out every minute is different from one that's solidly down. The naive implementation opens and resolves an incident on every flip, which spams the alert channel into uselessness. Vigil counts recent incident starts per site in a rolling 30-minute window, and if three or more have happened it fires exactly one "flapping" alert, then stays quiet for an hour even if the flips continue. You learn about the pattern; you don't drown in it.

Sticky-degraded alerting. A site that returns 200 OK with body {"status": "degraded"} is honest — it's saying "I'm answering requests, but one of my subsystems is broken." Vigil's outage logic only fired for hard-down statuses (unhealthy, timeout, error). A site stuck at degraded forever — like Tribe this morning, with its broken cache check — stayed yellow on the dashboard and silently never triggered anything. Now, after five consecutive degraded checks (about five minutes), Vigil fires a single "sustained degraded" alert, and a recovery alert once the underlying issue clears.

Both of these are behaviors you'd expect a well-designed monitoring system to have, and neither comes for free in the major vendor platforms — they give you rule-authoring primitives, and you build them yourself. We built them ourselves inside Vigil instead.

Chronicle — Services from the inside

Chronicle is the OpenTelemetry data plane. Every backend in the fleet — tools, services, admin dashboards — wires up OTel via a three-line shim (from chronicle_sdk import init_chronicle; init_chronicle(service_name="codex")) that auto-instruments FastAPI, httpx, SQLAlchemy, asyncpg, Redis, and NATS. Structured logs go to /v1/logs, metrics to /v1/metrics, traces to /v1/traces. Each signal carries service.name, deployment.environment, service.version — the three OTel resource attributes that actually matter — and a trace_id that correlates the three pillars by default.

Where Chronicle departs from its inspirations is AI curation. Log volume at Datadog is a tax on honesty — the more you log, the more you pay, so teams learn to log less. Chronicle scores every log line 0.0 to 1.0 for "interestingness" using a cheap model (Mercury 2). Sub-threshold lines are aggressively sampled; above-threshold lines retain full fidelity. Similar lines get a shared pattern_id, so instead of scrolling you filter. When an alert fires, a stronger model (Sonnet 4.6 via Bedrock) reads the correlated traces, logs, recent deploys, and similar past incidents and drafts a root-cause hypothesis before the on-call engineer reads the page. The draft isn't authoritative. It's a starting point. On-call reads it, agrees or argues, gets to mitigation in minutes instead of twenty.

The important thing about Chronicle for this story isn't any of that. It's that Chronicle is Vigil's data source. Vigil is now a presentation layer sitting on top of Chronicle's uptime probes, with its own opinionated dashboard for the "is my service up?" question. Chronicle answers the harder questions — "why is my p99 latency creeping up on Tuesday afternoons?", "which endpoint generates 40% of our error budget burn?" — and exposes every one of those queries as an MCP tool a Claude agent can call autonomously.

Docket — Where findings become work

Docket is Trello-shaped issue tracking with a Kanban board per tool. Cards flow through columns (To Do / In Progress / In Review / Done / Blocked); each card has comments, labels, attachments, activity history, and an MCP surface that lets Claude read, create, move, and comment on cards.

The observability role: Docket is where automated findings turn into actionable work. BugReporter submissions file cards here. Chronicle's recurring-error detector can auto-file a card when a specific error signature crosses a threshold. If you resolve an alert by identifying a bug that needs fixing, "open a Docket card" is a link-away — often from a button in Chronicle's incident pane. The on-call process isn't "identify → fix → write it up later"; the ticket exists from the moment the alert fires, gets attached to the investigation notes live, and closes when the fix ships.

Chirp — Where humans get told

Chirp is our self-hosted Slack replacement. Three channels do the observability heavy lifting:

#alerts — high-priority fires. Incident opened, incident resolved, flap detected, sustained degraded. The on-call engineer's Chirp notifications are wired to this channel only, so they can close every other channel and still not miss a page.
#deploys — pipeline successes. Every tool that ships to production posts here. The signal is "did the latest push actually land," not "do I need to act."
#noise — everything else. Test firings, low-severity warnings, chatty integrations. Muted by default, available when you want it.

Chirp has a Slack-compatible inbound-webhook API, so any of our tools can post a message to a channel with a POST + JSON body. The real power isn't that pattern — every chat platform has it — it's that humans who get paged on mobile at 2 AM can reply to the alert in the thread, and that reply goes directly into the Claude Code session investigating the incident through Chirp's agent bridge. You ack a page by typing "check the RDS failover first" into your phone from bed, and the running agent sees your note in its next prompt cycle.

The wiring

Six tools, each with its own repo, deployment, and dashboard. If they didn't talk to each other, they'd be six silos — Datadog with extra steps. What makes this a stack is the wiring between them.

A short catalog of actual cross-tool integrations live in production right now:

Vigil → Chirp. Vigil's CheckEngine POSTs Slack-compatible messages to a Chirp inbound webhook on incident-opened, incident-resolved, flap-detected, and sustained-degraded events. One channel (#alerts), four event types, rate-limited so flap-spam and alert storms don't happen. We shipped this today.
Chronicle → Vigil. Vigil's uptime probes are themselves a Chronicle data source. The two aren't redundant — Chronicle is the raw facts, Vigil is the curated fleet-status presentation.
Chronicle → Docket. Recurring error signatures auto-file Docket cards on the owning tool's board, linked back to the Chronicle query that detected them.
Chronicle → Chirp. Alerts on metric expressions post to #alerts. SLO burn-rate alerts post to the same. The AI-drafted RCA lands in the thread under the alert, so the on-call reads the hypothesis before opening Chronicle.
BugReporter → Docket. User bug reports file as cards on the originating tool's Docket board, with URL / user / trace_id / user-agent already attached.
Pulse → Chronicle. Browser sessions propagate a session_id. Backend services stamp it onto their trace spans. A Pulse funnel drop or RUM error links to the Chronicle trace for the failing request.
Chirp → Claude Code. Messages posted to incident threads get injected into the running Claude Code session on the next hook cycle. The on-call can steer an agent-in-flight without typing into a terminal.
Docket ↔ Chirp. Card state changes post to #noise by default (or a configured per-board channel), so you see the work graph moving in chat. Replying to the card message adds a comment to the card. The Docket board is the source of truth; Chirp is the spoken interface.
Fleet → Fulcrum. MTTR, alert fatigue, and change-failure-rate become leverage records in Fulcrum. We know, per week, how much human time our observability stack saves us.

None of those individual connections is a trick. Inbound webhooks, trace-context propagation, shared IDs. What's unusual is that all six tools are things we wrote, which means we can add a connection any time the need comes up. This morning the need was "I want a phone notification when Tribe stays degraded for five minutes." Two hundred lines of code and one SSM parameter later, we have it.

Today's incident, end to end

Here's the specific story. Every piece of this happened between 23:30 and 00:17 UTC tonight, and it involved every tool in the stack.

23:30 — I notice on the Vigil dashboard that Tribe's API card has been yellow for hours (it had, unbeknownst to me, been degraded since its last deploy). Separately, Courier API is alternating between green and red every minute or two.

23:34 — A Chronicle query confirms both. Tribe's /health response body has had "status": "degraded" for four hours, with a sub-field saying "Connection refused to localhost:6380." The Valkey that Tribe's cache check was probing didn't exist inside the Tribe container; VALKEY_URL was defaulting to a dev-mode placeholder because nobody ever wired the prod SSM parameter. Courier's /health was returning 200 OK, but Vigil's probe was timing out at ten seconds, sometimes returning in 9.9 seconds (just under the threshold, counted as HEALTHY) and sometimes at 10.1 (TIMEOUT). A direct curl against the same endpoint from the same EC2 instance: two milliseconds. The same httpx.get() from inside the Vigil container: thirty milliseconds. Only the shared AsyncClient in the running CheckEngine was stuck. Classic poisoned-keepalive symptom.

23:40 — Fix one: a new SSM parameter /renkara/tribe/valkey-url pointing at the real ElastiCache, three lines in Tribe's deploy script to fetch and inject it, git push.

23:49 — Fix two: in Vigil's CheckEngine, change max_keepalive_connections=20 to =0. Every probe now opens a fresh TCP connection. Small TLS-handshake cost; problem class eliminated.

23:54 — Both pipelines green, both containers redeployed, Vigil starts reporting all four monitored Tribe and Courier sites as HEALTHY on every check. The dashboard turns green across the row.

Here is where the stack failed us, and where we fixed the stack itself. Up to this point, I'd caught both bugs by chance — I happened to look at the dashboard. If I hadn't, nothing would have alerted anyone. Both of these bugs had existed for hours, and nothing was telling us.

00:03 — Open the Vigil repo, scope the feature. The fleet already has a Chirp workspace with an #alerts channel. Chirp has a Slack-compatible inbound webhook endpoint. All we need is a notifier service in Vigil, an SSM param for the webhook URL, and hooks in the incident lifecycle to fire on open, resolve, flap, and sustained-degraded.

00:14 — Post the code. app/services/chirp_notifier.py — fire-and-forget HTTP client, never raises (a Chirp outage can't break the checker). Modify app/services/incident_service.py to call the notifier on each of the four event types. Add a rolling-30-minute-window flap count query against the existing incidents table (no schema change). Add an in-memory consecutive-degraded-count per site that fires once at five checks and stays quiet until recovery.

00:17 — Test message posted through the webhook. {"text": "Vigil webhook test", "username": "Vigil"} to the Chirp inbound URL, 201 Created, message visible in the #alerts channel on my phone. The full plumbing works.

The incident took forty-five minutes start to finish. More interesting than the fix itself is the chain that makes it reproducible: a Docket card for the feature would have been a natural artifact of that session, Chronicle would have captured the new alert traffic the moment the Vigil pod restarted, a Fulcrum leverage record tracks the session's ratio of outcome to time, and the blog post you're reading right now describes the change for anyone else who'd benefit. The stack kept closing the loop on itself.

Why this shape

The alternative is an all-in-one vendor. Datadog, New Relic, Splunk, Grafana Cloud. They'd cover a lot of what we do; they'd charge on the order of $10K/month for a team our size; they'd leave us unable to wire their alerts to our work tracker, our chat, our CRM, our business tooling without writing glue code in Lambda that still costs us money to run. We have glue code too. Our glue code lives in the tools themselves, written in the same Python our engineers already read.

The cost argument is real but not the main one. The main one is that an observability product that doesn't know about your CRM, your time tracker, your email, your scheduled meetings, your customer deal pipeline, your product analytics, your user bug reports, and your company Slack is an observability product with a flat idea of "what's going wrong right now." Ours has all of that. When a chronic latency drift in the /api/v1/pages endpoint correlates with a drop-off on the "Create page" step of the signup funnel in Pulse, we can see it. When a customer emails a bug report that shows up in Courier, the incident response pulls the customer context from Tribe. When the on-call engineer closes a page on the couch on a Sunday, Cadence honors their on-call rotation, Meridian logs the time against the right project, Fulcrum records the leverage ratio for the session, Narrative turns the postmortem into a public blog post draft, Herald schedules the next week's incident digest for subscribers. Nobody's observability vendor does that, because nobody's observability vendor is the rest of their business tooling.

Ours is.

What's next

Two gaps still to close, both in the "we haven't gotten to it yet" category:

Paging from #alerts. Right now, alerts land in Chirp. If you're not in Chirp (phone off, asleep, on a plane), you miss them. The next piece is a Cadence-integrated on-call shift lookup that escalates unacknowledged #alerts posts via Courier → SMS or push, with a five-minute grace window before escalation. The pieces are all there; the wire-up is a day of work.
BugReporter → Chronicle session linking on the client side. Bug reports capture trace_id if the tool stamps it; they don't yet capture the Pulse session_id. Adding it is trivial — the library already sees the session identifier in the local storage — we just haven't published the version. When we do, every user-submitted bug report will come with a replay-able session context.

Neither is hard. Both are the kind of small integrations you only ever get around to building if the tools are yours.

The monitoring stack, like the rest of the Renkara tool fleet, has a trajectory that's simple to describe: every month, the tools get a little smarter about each other. One more connection. One more shared ID. One more automatic thing the stack now handles without a human touching it. Two hundred lines of Python at midnight turns into five fewer pages next month, and that ratio is the whole point.