Courses 0%
42
Observability · Chapter 42 of 51

Observability Driven Development

Akhil
Akhil Sharma
20 min

Observability-Driven Development (ODD) - Building Observable Systems from Day One

Audience: software engineers and architects who want to build systems that are debuggable, understandable, and operationally excellent.

This article assumes:

  • Observability is not an afterthought - it shapes how you design systems.
  • The hardest bugs to fix are the ones you can't reproduce or understand.
  • Production is the only environment where real failures happen.
  • Your code will be maintained by someone else (maybe future you) who doesn't understand it.

[CHALLENGE] Challenge: You can't debug what you can't observe

Scenario

3 AM page: "Checkout is failing for some users."

Your debugging session:

  • You: "What error do they see?"
  • On-call: "Just 'Something went wrong'"
  • You: "Check the logs"
  • On-call: "Which service? There are 12 in the checkout flow"
  • You: "Check them all"
  • On-call: "Nothing obvious in any of them"
  • You: "Is there a trace?"
  • On-call: "We don't have tracing"
  • You: "Check metrics"
  • On-call: "Which metric? We have 500"
  • You: [4 AM] Still debugging, no progress

The system works 99% of the time. But that 1% is invisible.

Interactive question (pause and think)

What failed here?

  1. The on-call engineer wasn't skilled enough
  2. The system wasn't designed to be observable
  3. Need better monitoring tools
  4. This is just how distributed systems work

Take 10 seconds.

Progressive reveal (question -> think -> answer)

Answer: (2) - the system wasn't designed with observability in mind.

Better tools (3) help, but they can't compensate for a system that doesn't emit useful signals.

Real-world analogy (car dashboard)

Imagine driving a car with no dashboard:

No observability: Car makes a weird noise. Is it engine, transmission, brakes? Open the hood and guess.

Basic observability: Dashboard shows: engine temp, RPM, speed. Better, but why is "check engine" light on?

Full observability: Dashboard shows: cylinder 3 misfiring, fuel injector clogged, O2 sensor voltage low. You know exactly what's wrong.

Observable systems are like cars with diagnostic ports - they tell you what's wrong.

Key insight box

Observability-Driven Development (ODD) means designing systems to explain their own behavior, not just work correctly.

Challenge question

If observability is so important, why do most teams add it after the system is built?


[MENTAL MODEL] Mental model - Observable systems explain themselves

Scenario

You're building a new payment processing service. Two approaches:

Traditional approach:

  1. Write code to process payments
  2. Deploy to production
  3. Add logging when bugs occur
  4. Add metrics when performance degrades
  5. Add tracing when you can't figure out why it's slow

ODD approach:

  1. Define what observability you need (before writing code)
  2. Write code that emits structured logs, metrics, traces
  3. Deploy with observability built-in
  4. Debug using the signals you designed
  5. Iterate based on what you learn

Interactive question (pause and think)

Which approach leads to better debuggability?

A. Traditional (add observability reactively) B. ODD (design observability proactively) C. Doesn't matter, same end result D. Depends on the team

Progressive reveal

Answer: B - proactive observability is exponentially better.

Reactive observability is like adding airbags after a car crash. Proactive is designing the car with airbags from the start.

Mental model

Think of ODD as:

  • Test-Driven Development for operations: Define how you'll debug before writing code
  • Design by contract for runtime: System makes promises about what it will tell you
  • Defensive programming meets observability: Anticipate what will go wrong and instrument for it

The mindset shift:

yaml

Key insight box

ODD treats observability as a first-class design concern, like security or performance, not an operational afterthought.

Challenge question

You're in a code review. The PR has perfect logic but zero observability. Do you approve it?


[WARNING] The three pillars are necessary but not sufficient

Scenario

Your team says: "We have metrics, logs, and traces. We're observable!"

Then an incident happens. You have:

  • 10,000 metrics (which one matters?)
  • 1TB of logs per day (needle in haystack)
  • 1 million traces (which trace explains the bug?)

You're drowning in data but starving for insight.

Beyond the three pillars

text

The context problem

go

The structure problem

bash

Key insight box

Three pillars give you data. Context, structure, and meaning give you understanding. ODD provides all five layers.

Challenge question

You have perfect metrics, logs, and traces. But during an incident, you still can't figure out the root cause. What's missing?


[DEEP DIVE] Designing observable code - practical patterns

Scenario

You're writing a new feature: "Apply discount code to shopping cart."

How do you make it observable?

Pattern 1: Structured logging with context

go

Pattern 2: Metrics at decision points

go

Pattern 3: Distributed tracing at boundaries

go

Pattern 4: Error enrichment

go

Key insight box

Observable code emits signals at every decision point: logs for events, metrics for aggregates, traces for causality, structured errors for debugging.

Challenge question

How much observability is too much? When does instrumentation become noise?


[PUZZLE] The correlation problem - linking signals across pillars

Scenario

User reports: "My order failed at 2:45 PM."

You have:

  • Logs showing errors at 2:45 PM (but which user?)
  • Metrics showing error rate spike at 2:45 PM (but why?)
  • Traces from 2:45 PM (but which trace is theirs?)

How do you connect the dots?

Think about it

You need a way to correlate logs, metrics, and traces for the same user request.

Interactive question (pause and think)

What's the key to correlation?

A. Use the same timestamp B. Use the same request ID / trace ID C. Store everything in one database D. Hope you get lucky

Progressive reveal

Answer: B - correlation IDs are the glue.

Correlation ID strategy

go

Unified observability with correlation

go

Exemplar: bridging metrics and traces

go

Key insight box

Correlation IDs (request_id, trace_id, user_id) are the thread that connects logs, metrics, and traces. Without them, you have isolated data islands.

Challenge question

Your microservices span 3 languages (Go, Python, Java). How do you ensure consistent correlation ID propagation across all of them?


[DEEP DIVE] ODD in practice - building an observable feature end-to-end

Scenario

You're tasked with building: "Real-time inventory reservation system."

Requirements:

  • Reserve inventory when user adds item to cart
  • Release reservation after 15 minutes if not purchased
  • Must handle 10K requests/second
  • Need to debug: reservation leaks, double bookings, performance issues

Step 1: Define observability requirements FIRST

yaml

Step 2: Implement with observability built-in

go

Step 3: Add diagnostic endpoints

go

Key insight box

ODD means designing observability INTO the feature from day one, not bolting it on after production issues surface.

Challenge question

You've built perfect observability into your feature. But other teams don't follow these patterns. How do you scale ODD across an organization?


[WARNING] Common ODD anti-patterns to avoid

Scenario

Your team claims to practice ODD, but observability is still poor during incidents.

Anti-patterns catalog

Anti-pattern 1: Log everything (noise)

go

Anti-pattern 2: Metrics without purpose

go

Anti-pattern 3: Inconsistent naming

go

Anti-pattern 4: Dropped context

go

Anti-pattern 5: PII in logs

go

Key insight box

ODD anti-patterns: excessive logging (noise), vanity metrics (unused), inconsistent naming (chaos), dropped context (broken correlation), PII exposure (compliance violation).

Challenge question

Your logs accidentally contained customer emails. How do you scrub this PII from historical data?


[SYNTHESIS] Final synthesis - ODD transformation for a team

Synthesis challenge

You're the engineering lead for 15 developers building a fintech app.

Current state:

  • No distributed tracing
  • Unstructured text logs (grep debugging)
  • Metrics exist but unused
  • Incidents take 4+ hours to debug
  • Developers hate on-call

Requirements:

  • Reduce MTTR from 4 hours to 30 minutes
  • Make on-call sustainable
  • Pass SOC2 audit (need audit trail)
  • Maintain development velocity

Your tasks (pause and think)

  1. Define ODD standards
  2. Create observability onboarding checklist
  3. Plan retrofit for existing services
  4. Measure ODD adoption
  5. Make ODD part of code review
  6. Justify investment to leadership

Write down your transformation plan.

Progressive reveal (solution)

1. ODD standards:

yaml

2. Observability checklist:

markdown

3. Retrofit plan:

yaml

4. Adoption metrics:

yaml

5. Code review culture:

yaml

6. Leadership justification:

yaml

Key insight box

ODD transformation is cultural change: measure impact (MTTR), enforce standards (code review), demonstrate ROI (business case).

Final challenge question

After 6 months, adoption is 60% (not 100%). Some teams resist, saying "observability slows us down." How do you overcome this?


Appendix: Quick checklist (printable)

ODD feature checklist:

  • Define observability requirements before coding
  • Add structured logging with context
  • Emit metrics at decision points
  • Instrument with distributed tracing
  • Propagate correlation IDs
  • Write debugging runbook
  • Create dashboard (if critical)
  • Define alerts (if SLO-critical)

Observability standards:

  • All logs structured JSON
  • All logs include request_id, trace_id
  • All metrics low-cardinality
  • All traces include business attributes
  • All errors logged with context
  • All PII redacted/hashed

Code review checklist:

  • Can I debug at 3 AM?
  • Errors logged with context?
  • Metrics emitted?
  • Context propagated?
  • No high-cardinality labels?
  • No PII exposure?

Team adoption:

  • Training for all engineers
  • Observability advocate per team
  • Automated linting
  • In definition of done
  • Monthly reviews

Measurement:

  • Track MTTR (target <30 min)
  • Track MTTD (target <5 min)
  • Track % services observable
  • Track on-call satisfaction
  • Track self-service debugging rate

Red flags:

  • MTTR increasing (observability degrading)
  • New features without observability
  • High-cardinality explosion
  • PII in logs/traces
  • Checklist bypassed
  • No log/metric/trace correlation

Key Takeaways

  1. Observability-driven development instruments code before deploying — not as an afterthought when things break
  2. Every new feature should ship with its SLIs defined — what does "working correctly" mean for this feature, in measurable terms?
  3. Structured events are more valuable than scattered log lines — one rich event per operation beats ten context-free log statements
  4. Test your observability in staging — verify that dashboards, alerts, and traces work before they need to work in a real incident
Chapter complete!

Course Complete!

You've finished all 51 chapters of

System Design Advanced

Browse courses
Up next Open Telemetry and Distributed Tracing
Continue