Sign in Request a demo
Back to blog IT Operations

Building a Runbook That's Actually Searchable

· 9 min read
Abstract visualization of an organized, searchable operations runbook

Most runbooks have the same hidden author: the senior engineer who resolved the incident at 2 AM. That engineer knows the system deeply enough that the runbook they write is full of implicit context — it assumes you know which cluster "the prod cluster" refers to, which version of the deployment script is current, and what "restart the service" means in an environment with seven services that all match that description.

The result is a document that reads perfectly well to its author and to anyone who was in the same on-call rotation. To a junior engineer hitting the same alert for the first time six months later, it reads like half a conversation. And when they search for it, they may not find it at all, because they're searching with different words than the title uses.

Writing a runbook that works for both audiences — expert and non-expert, author and searcher — requires thinking about structure and vocabulary in ways that most runbook writing guides don't address.

The two audiences a runbook must serve

The person who writes a runbook and the person who first uses it cold are almost never the same person. The author has mental context that doesn't make it onto the page: they know why the procedure works, what the normal state looks like, and what to do if the documented steps don't resolve the issue. The first-time reader has none of that.

A well-structured runbook handles this by being explicit about things the author considers obvious. Not just "restart the kafka-consumer service," but which host it runs on, which systemd unit corresponds to it, what a successful restart looks like, and how long to wait before checking whether the problem is resolved. The expert reader can skim these details. The first-time reader needs them.

The second audience is the searcher — the person trying to find the runbook at all. This person may not know the runbook exists. They know what they're experiencing: an alert, an error message, a symptom described by a user. The runbook needs to be written so that a search query starting from any of those symptom descriptions lands on it.

Symptom-first titles and aliases

The most common structural mistake in runbooks is naming them after the system rather than the problem. "Kafka Consumer Service Runbook" is accurate but does nothing to help someone searching for "consumer lag alert" or "messages not processing" or "CONSUMER_LAG_HIGH alert." Those are the queries that come in at 3 AM.

Title your runbooks from the searcher's perspective first. "What to do when consumer lag alerts fire" is findable. "Kafka Consumer Service Runbook" is not, unless the searcher already knows to look for it.

For documents you can't retitle (existing pages in shared spaces, or runbooks with established internal names), add an explicit aliases or "also triggered by" section near the top. This serves two purposes. For human searchers in a keyword-indexed wiki, having the symptom text in the document increases the chance of a match. For semantic retrieval systems, the vocabulary in the document shapes the embedding — more symptom descriptions mean more conceptual surface area, making the document retrievable from more query angles.

A practical format:

Title: Kafka Consumer Lag — Alert Response Runbook
Triggered by: CONSUMER_LAG_HIGH alert, messages not being processed,
  consumer group stuck, partition offset not advancing, dead letter queue
  filling, downstream system reporting stale data

That ten-second addition dramatically widens the retrieval surface for both keyword and semantic search.

Concrete scenario: the on-call engineer who's never seen this alert

Imagine a scenario: a DevOps team at a growing fintech company has a runbook for their payment processor reconciliation service. The runbook was written by the team's lead infrastructure engineer, who has deep context about the service. The lead engineer is on vacation. A junior engineer receives a PagerDuty alert at 11 PM for "reconciliation batch job failed." They search the company's internal wiki for "reconciliation batch job failed" — no results. They search for "payment reconciliation" — three results, none of which are the runbook. They search for "batch job failure" — seven results, all unrelated.

The runbook exists. It's titled "Recon Service Ops Playbook." The junior engineer's search vocabulary and the document's title have no overlap. They end up paging the lead engineer on vacation.

This scenario is preventable. The runbook needs to include the alert name exactly as it appears in PagerDuty ("reconciliation batch job failed" is the paging condition — it should appear verbatim in the document), plus a section covering what each category of failure looks like to a person who hasn't seen it before.

Structure that survives handoffs

Beyond findability, there's the question of usability under pressure. An engineer working an active incident at 2 AM reads differently than they would during normal working hours. Runbooks used in incident conditions need to front-load the critical information and separate "what to check first" from "background on why this happens."

A structure that works well for both findability and incident usability:

  • Alert / symptom name (verbatim). Exactly how the alert appears in your monitoring or ticketing system.
  • Severity and SLA. Is this P1 or P2? What's the expected resolution time?
  • Immediate triage steps. The first three things to check, with expected output. If you check X and see Y, continue. If you see Z, jump to section 4.
  • Remediation steps. Numbered, with commands or UI navigation paths spelled out. Include what success looks like.
  • Escalation path. If the above doesn't resolve, who do you call? With what information in hand?
  • Background / context. Why does this alert fire? What is the component doing? This section is for the person who wants to understand, not just execute.
  • Last reviewed. Date and reviewer. Stale runbooks are dangerous; making the date visible creates social pressure to update.

The maintenance problem — and why you can't fully solve it

We're not suggesting that good runbook structure eliminates the maintenance burden. It doesn't. Systems change, procedures become outdated, and the most beautifully structured runbook is actively harmful if it describes a configuration that no longer exists.

The practical approaches to runbook hygiene that actually work: tie runbook review to the on-call rotation (anyone who uses a runbook owns updating it if they find an error), set calendar reminders for quarterly review of the highest-severity runbooks, and use "last reviewed" visibility as a social signal — a runbook last touched two years ago should be treated with appropriate skepticism.

Search and retrieval improvements help here too. A semantic search layer that also surfaces document metadata — including last-modified date — lets the searcher immediately assess whether what they found is likely current. This is not a substitute for maintenance, but it gives engineers the context to decide whether to trust a retrieved document before acting on it.

Findability is not a cosmetic improvement

Runbooks that can't be found during incidents are operationally equivalent to runbooks that don't exist. The investment in writing thorough, structured procedures only pays off if the person who needs them can retrieve them under pressure, in time to act on them. That means thinking about how the document will be found — not just how it will read once it is.

Structure for the searcher. Write for the person who doesn't know the system. Make the alert name and symptom language explicit. And build a retrieval system that bridges the gap between how people describe problems and how documentation titles them.

Continue reading

Why AI Answers Without Sources Are a Liability →