INTERVIEW_QUESTIONS

Microservices Interview Questions for Senior Engineers (2026)

Top microservices interview questions with detailed answer frameworks covering service decomposition, inter-service communication, data management, and operational patterns for FAANG interviews.

20 min readUpdated Apr 19, 2026
interview-questionsmicroservicessystem-designsenior-engineer

Why Microservices Knowledge Matters in Senior Engineering Interviews

Microservices architecture has become the dominant pattern for building large-scale applications at companies like Netflix, Amazon, and Uber. However, the industry has also learned hard lessons about the complexity microservices introduce. Senior engineering candidates are expected to understand not just the benefits but the costs, anti-patterns, and operational requirements of microservices.

Interviewers want to see that you can reason about when to use microservices versus a monolith, how to decompose services along the right boundaries, and how to handle the distributed systems challenges that microservices create. The best candidates demonstrate experience with the operational reality: deployment pipelines, observability, failure isolation, and the organizational dynamics that microservices enable and require.

For foundational reading, see our microservices architecture concepts, the system design interview guide, and our learning paths for structured preparation.

1. When should you use microservices versus a monolith?

What the interviewer is really asking: Can you make a pragmatic architectural decision instead of blindly following trends?

Answer framework:

Start with the controversial but correct take: most projects should start as a monolith. The monolith-first approach, advocated by Martin Fowler and practiced by companies like Shopify and Basecamp, lets you discover the right service boundaries through experience rather than guessing upfront.

Choose a monolith when: the team is small (under 10 engineers), the domain is not well understood (you do not know where the boundaries should be), you are optimizing for development speed (one deploy, one database, simple debugging), or the system does not need independent scaling of components.

Choose microservices when: you have multiple teams that need to deploy independently (organizational scaling), different components have fundamentally different scaling requirements (the video transcoding service needs GPUs, the user service needs memory), you need technology diversity (one service needs real-time processing in Go, another needs ML in Python), or you have a well-understood domain with clear bounded contexts.

The costs of microservices that are often underestimated: network latency replaces function calls (microseconds become milliseconds), data consistency across services is fundamentally harder than within a single database, debugging distributed systems requires sophisticated observability (distributed tracing, structured logging, metrics), deployment complexity increases dramatically (CI/CD for each service, environment management), and testing is harder (integration tests require running multiple services).

Discuss the modular monolith as a middle ground: a monolith with well-defined module boundaries, enforced through the build system (separate packages or modules). Each module has its own data access layer and clear APIs. This gives you the organizational benefits of clear boundaries without the operational complexity of distributed systems. When a module needs to become a service, the boundary is already defined.

Reference real-world examples: Amazon evolved from a monolith to microservices as they scaled. But Shopify runs one of the largest e-commerce platforms as a modular monolith. Netflix went all-in on microservices but has hundreds of engineers dedicated to platform tooling.

Follow-up questions:

  • How would you decompose a monolith into microservices?
  • What is the right size for a microservice?
  • How do you handle shared libraries and code reuse across microservices?

2. How do you determine service boundaries in a microservices architecture?

What the interviewer is really asking: Do you understand Domain-Driven Design and can you apply it to find natural service boundaries?

Answer framework:

The most common mistake is decomposing by technical layer (a service for the database layer, a service for the API layer) instead of by business capability (a service for orders, a service for payments, a service for inventory).

Use Domain-Driven Design (DDD) bounded contexts: identify the core domains of the business. Each bounded context has its own ubiquitous language (the same term might mean different things in different contexts: "order" means something different in the ordering context vs the shipping context). Each bounded context is a strong candidate for a service.

Evaluate coupling and cohesion: services should be loosely coupled (changing one service rarely requires changing others) and highly cohesive (everything within a service is closely related). If two services always deploy together, they should probably be one service. If a service handles too many unrelated responsibilities, it should probably be split.

Consider the team structure (Conway's Law): services should align with team boundaries. A service owned by a single team (two-pizza team, 5-8 engineers) can move at its own pace. If a service requires coordination between multiple teams for every change, the boundary is wrong.

Practical signals for splitting a service: deploys are slow because unrelated changes are bundled, different parts of the service have very different scaling needs, one part of the service changes rapidly while another is stable, or the codebase is so large that new engineers take months to become productive.

Anti-patterns in service decomposition: distributed monolith (services that are tightly coupled and must deploy in lockstep), nano-services (services so small that the overhead of inter-service communication outweighs the benefits), and shared database (multiple services reading from and writing to the same database tables, creating hidden coupling).

Discuss database design implications: each service should own its data. If service A needs data from service B, it should call service B's API, not read from service B's database. This ensures encapsulation and allows each service to evolve its data model independently.

Follow-up questions:

  • How do you handle entities that span multiple bounded contexts?
  • What do you do when you realize a service boundary is wrong?
  • How does the strangler fig pattern help with monolith decomposition?

3. How do microservices communicate, and how do you choose between communication patterns?

What the interviewer is really asking: Do you understand the trade-offs between synchronous and asynchronous communication and can you choose the right pattern for each interaction?

Answer framework:

Synchronous communication (request-response): HTTP/REST for simplicity and interoperability, gRPC for performance and type safety. Use when the caller needs an immediate response (user-facing request that must return data), the operation is short-lived (under 1 second), or the services are in the same data center (low network latency).

Asynchronous communication (event-driven): publish events to a message queue like Kafka or RabbitMQ. Use when the caller does not need an immediate response (order placed → send confirmation email), you need to decouple services (the publisher does not know or care who consumes the event), the operation is long-running (video transcoding), or you need to handle traffic spikes (the queue buffers load).

Event patterns: event notification (thin events that signal something happened, consumers call back for details if needed), event-carried state transfer (fat events that contain all the data consumers need, reducing coupling but increasing message size), and event sourcing (store the event as the source of truth, derive state from the event stream).

Service mesh for communication management: use a sidecar proxy (Envoy in Istio) to handle cross-cutting communication concerns: mutual TLS (encryption between services), retries with exponential backoff, circuit breaking (stop calling a failing service), load balancing (distribute requests across service instances), and observability (automatic metrics and tracing).

Discuss the API gateway pattern: external clients communicate with a single API gateway that routes requests to the appropriate internal services. This handles authentication, rate limiting, protocol translation, and response aggregation.

Common mistake: using synchronous communication for everything. This creates a chain of dependencies where one slow or failing service cascades failures through the entire system. Use async communication for anything that does not need an immediate response.

Follow-up questions:

  • How do you handle timeouts and retries in synchronous communication?
  • How do you ensure message ordering in asynchronous communication?
  • When would you choose gRPC over REST for inter-service communication?

4. How do you handle distributed transactions across microservices?

What the interviewer is really asking: Do you understand the data consistency challenge and the practical patterns for solving it?

Answer framework:

The fundamental problem: in a monolith, a single database transaction can atomically update multiple tables. In microservices, each service has its own database, and distributed transactions (2PC) are impractical at scale due to blocking, latency, and availability concerns.

The saga pattern is the standard solution. Decompose the distributed transaction into a sequence of local transactions, each within a single service. If any step fails, execute compensating transactions to undo previous steps.

Choreography-based sagas: each service publishes events after its local transaction. Other services react to events and perform their step. No central coordinator. Advantage: simple, decoupled. Disadvantage: hard to trace the overall flow, difficult to add new steps, circular dependencies can emerge.

Orchestration-based sagas: a central orchestrator service defines the transaction flow. It tells each service what to do and handles failures. Advantage: clear flow definition, easy to understand and modify. Disadvantage: the orchestrator is a single point of logic (not a single point of failure if properly designed).

Designing compensating transactions: not all operations can be perfectly reversed. You cannot unsend an email or unchip a physical item. In these cases, use compensating business operations: send a cancellation email, issue a return label. Design your services with compensability in mind from the start.

The outbox pattern (critical for reliability): when a service needs to update its database AND publish an event, both must succeed or both must fail. Writing to the database and publishing to Kafka are two separate operations that cannot be wrapped in a single transaction. Solution: write the event to an outbox table in the same database transaction as the business data. A separate process (CDC with Debezium, or polling) reads the outbox and publishes to the message queue.

Discuss event-driven architecture as the foundation for sagas. Events are the communication mechanism between saga participants. The event store provides durability and replayability.

Follow-up questions:

  • How do you handle sagas that involve external systems (payment gateways, shipping providers)?
  • How do you debug a saga that failed halfway through?
  • What is the difference between semantic lock and compensating transaction approaches?

5. How do you handle service discovery in a microservices environment?

What the interviewer is really asking: Do you understand how services find each other in a dynamic environment where instances come and go?

Answer framework:

The problem: in a microservices environment, services are deployed across multiple instances that are created, destroyed, and moved dynamically (auto-scaling, rolling deploys, failure recovery). Hardcoding IP addresses is not viable.

Client-side discovery: services query a service registry (Consul, etcd, ZooKeeper) to find available instances of the target service. The client chooses which instance to call (load balancing on the client side). Netflix Eureka is a well-known example. Advantage: no additional proxy hop. Disadvantage: each client must implement discovery and load balancing logic.

Server-side discovery: services call a load balancer or API gateway that queries the registry and routes the request to an available instance. AWS ELB, Kubernetes Services, and API gateways work this way. Advantage: clients are simple (just call a single address). Disadvantage: additional network hop and a potential single point of failure.

Kubernetes-native discovery: Kubernetes Services provide a virtual IP (ClusterIP) that routes to healthy pods. DNS-based discovery (service.namespace.svc.cluster.local) lets services find each other by name. The kube-proxy handles load balancing. This is the most common approach in modern deployments.

Service mesh discovery: Istio and Linkerd use sidecar proxies that intercept all network traffic. The control plane maintains a service registry and distributes configuration to sidecars. This provides discovery, load balancing, retries, circuit breaking, and mTLS without changing application code.

Health checking: the service registry must know which instances are healthy. Services register with a health check endpoint. The registry periodically probes this endpoint and removes unhealthy instances. Discuss the difference between liveness checks (is the process running?) and readiness checks (is the service ready to accept traffic?).

Discuss fault tolerance: what happens when the service registry itself is unavailable? Clients should cache the last known instance list. Registries should be highly available (replicated using consensus).

Follow-up questions:

  • How does service discovery work across multiple regions or data centers?
  • What is the difference between DNS-based and API-based service discovery?
  • How do you handle service discovery for non-Kubernetes workloads?

6. What is the circuit breaker pattern and when should you use it?

What the interviewer is really asking: Do you understand cascading failure prevention, one of the most critical patterns in microservices?

Answer framework:

The problem: in a chain of microservices (A calls B, B calls C), if service C is slow or failing, service B's threads are blocked waiting for C. B's resources are exhausted, and it becomes slow or fails. Service A, waiting for B, also fails. One failing service cascades through the entire system.

The circuit breaker pattern prevents this cascading failure by wrapping calls to external services in a circuit breaker object that monitors failures.

Three states: Closed (normal operation, requests pass through, failures are counted), Open (the circuit is tripped after a failure threshold, all requests immediately fail without calling the downstream service, returning a fallback response or error), and Half-Open (after a timeout, the circuit allows a limited number of test requests through. If they succeed, the circuit closes. If they fail, it reopens).

Configuration parameters: failure threshold (e.g., 50% of requests in the last 10 seconds), timeout duration (e.g., 30 seconds in open state before transitioning to half-open), and test request count (e.g., allow 3 requests in half-open state).

Fallback strategies when the circuit is open: return cached data (stale but available), return a default response (e.g., a generic product recommendation instead of personalized), degrade gracefully (show a simplified UI without the failing feature), or return an error with a clear message (preferable to hanging indefinitely).

Implementation: use libraries like Resilience4j (Java), Polly (.NET), or implement in the service mesh (Istio circuit breaker configuration). In a service mesh, circuit breaking is configured per-route and handled by the sidecar proxy, no application code changes needed.

Combine with other resilience patterns: retries with exponential backoff (retry transient failures before tripping the circuit), timeouts (do not wait indefinitely for a response, a prerequisite for circuit breaking), bulkheads (isolate failure domains so that a failing dependency only affects the features that use it, not the entire service), and rate limiting (protect downstream services from being overwhelmed).

Discuss fault tolerance design: circuit breakers are reactive (they respond to failures). Proactive measures include load shedding (reject requests before the system is overwhelmed), health checks (detect problems before they cause failures), and capacity planning (ensure sufficient resources for expected load).

Follow-up questions:

  • How do you test circuit breaker behavior in a staging environment?
  • How do you set appropriate thresholds when the failure rate varies?
  • How does the circuit breaker pattern interact with retry logic?

7. How do you implement distributed tracing across microservices?

What the interviewer is really asking: Do you understand observability in a microservices architecture and how to debug problems that span multiple services?

Answer framework:

Distributed tracing tracks a request as it flows through multiple services, creating a trace that shows the full journey. Each service adds a span to the trace with timing information, metadata, and any errors.

Core concepts: a trace is an end-to-end journey of a request, a span is a unit of work within a service (HTTP request, database query, cache lookup), trace context (trace ID, span ID, parent span ID) is propagated through request headers (W3C Trace Context standard), and sampling determines which requests are traced (trace all errors, sample 1% of successful requests to manage volume).

Instrumentation approaches: manual (add tracing code around each operation, most control but most effort), auto-instrumentation (libraries automatically instrument HTTP clients, database drivers, and framework handlers), and service mesh (the sidecar proxy captures network-level spans automatically, no code changes needed).

The OpenTelemetry standard: the convergence of OpenTracing and OpenCensus into a single vendor-neutral standard. Use OTel SDKs for instrumentation, OTel Collector for data processing and routing, and any compatible backend for storage and visualization (Jaeger, Zipkin, Grafana Tempo, Datadog).

Practical usage: identify slow services in the request path, find the root cause of errors (which service returned the error?), understand service dependencies (build a service dependency graph from trace data), and detect performance regressions (compare trace durations across deployments).

Discuss the cost of tracing: traces consume storage (each span is a data point), network bandwidth (propagating context and sending spans), and processing (the collector pipeline). Sampling strategies manage this: head-based sampling (decide at the start of the trace, simple but misses important traces), tail-based sampling (decide at the end, can keep all error traces and slow traces, but requires buffering), and adaptive sampling (adjust sampling rate based on traffic volume).

Combine with metrics and logs (the three pillars of observability): correlate traces with metrics (identify which traced requests are causing metric anomalies) and logs (include the trace ID in every log entry so you can jump from a trace span to the relevant logs).

Follow-up questions:

  • How do you handle trace context propagation across asynchronous messaging?
  • How do you implement tail-based sampling?
  • What is the storage cost of distributed tracing at scale?

8. How do you handle configuration management across microservices?

What the interviewer is really asking: Can you manage the operational complexity of configuring dozens or hundreds of services consistently?

Answer framework:

The challenge: each microservice needs configuration (database URLs, feature flags, API keys, timeouts, retry policies). With dozens of services across multiple environments (dev, staging, production), configuration management becomes a significant operational concern.

Configuration sources (in priority order): environment-specific overrides, central configuration service, environment variables, configuration files, and hardcoded defaults. Higher priority sources override lower ones. This allows a base configuration with environment-specific adjustments.

Centralized configuration services: Consul, Spring Cloud Config, AWS AppConfig, or a custom service backed by etcd or ZooKeeper. Centralized configuration enables dynamic updates without redeployment, audit logging of all changes, access control (who can change production config), and consistency across all instances of a service.

Dynamic configuration (feature flags and runtime tuning): some configuration should be changeable without redeployment. Feature flags (enable/disable features for specific users or percentages), rate limits, timeout values, and log levels are common examples. Use a push model (service subscribes to changes) for low-latency updates. Cache configuration locally with a reasonable TTL as a fallback.

Secrets management: API keys, database passwords, and certificates must be stored securely, not in configuration files or environment variables. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager, Kubernetes Secrets encrypted at rest). Rotate secrets automatically. Inject secrets at runtime, never bake them into container images.

Configuration validation: validate configuration at startup. If a required config value is missing or invalid, the service should fail to start with a clear error message (fail fast). Implement configuration schemas that define types, ranges, and required fields. Run validation in CI/CD to catch configuration errors before deployment.

Discuss fault tolerance: what happens when the configuration service is unavailable? Services should cache their last known good configuration. If a service starts without reaching the configuration service, it should use the cached config or fail to start (depending on the criticality of the config).

Follow-up questions:

  • How do you handle configuration for canary deployments?
  • How do you roll back a bad configuration change across all services?
  • How do you manage configuration for multi-region deployments?

9. How do you handle API versioning and backward compatibility across microservices?

What the interviewer is really asking: Can you manage the contract evolution between services without causing coordination nightmares?

Answer framework:

The core principle: each service must be deployable independently. This means that a change to service A's API should not require simultaneous deployment of all services that depend on A.

Backward-compatible changes (safe to deploy without coordinating with consumers): adding new fields to responses, adding new optional request fields, adding new endpoints, and adding new event types. Design all changes to be additive.

Breaking changes (require coordination): removing fields, changing field types, changing URL paths, changing event schemas, and changing error formats. Avoid breaking changes. When unavoidable, use a two-phase approach.

For synchronous APIs (REST, gRPC): use semantic versioning for service APIs. Support the current and previous version simultaneously. Give consumers a migration window (typically one sprint or one month for internal services). Use header-based versioning (Accept: application/vnd.service.v2+json) to keep URLs stable.

For gRPC specifically: Protocol Buffers has built-in backward compatibility. Never reuse field numbers. Mark removed fields as reserved. New fields are optional by default. This makes additive changes safe. For breaking changes, create a new service definition (v2.proto).

For asynchronous events: use a schema registry (Confluent Schema Registry for Kafka). Define schemas in Avro, Protobuf, or JSON Schema. The registry enforces compatibility rules (backward, forward, or full). Producers cannot publish events that break compatibility with registered consumers.

Contract testing: use Pact or similar tools to define consumer-driven contracts. Each consumer specifies what it expects from a provider. The provider runs these contracts in CI/CD to ensure changes do not break any consumer. This catches incompatibilities before deployment.

Discuss the expand/contract pattern for database schema changes within a microservices context: apply the same principles to inter-service API evolution. Expand the API (add new fields/endpoints), migrate consumers, contract the API (remove old fields/endpoints).

Follow-up questions:

  • How do you handle versioning for event-driven communication?
  • What is the difference between provider-driven and consumer-driven contracts?
  • How do you manage a breaking change that affects 20 downstream services?

10. How do you implement the strangler fig pattern for monolith decomposition?

What the interviewer is really asking: Can you plan a realistic, incremental migration from a monolith to microservices?

Answer framework:

The strangler fig pattern (named after the strangler fig tree that grows around a host tree, eventually replacing it) is an incremental migration strategy. Instead of a risky big-bang rewrite, you gradually extract functionality from the monolith into microservices while the monolith continues to serve traffic.

Phase 1 - Intercept: place a routing layer (API gateway or reverse proxy) in front of the monolith. All traffic flows through this layer. Initially, it forwards everything to the monolith. This is a zero-functional-change deployment that establishes the infrastructure for incremental migration.

Phase 2 - Extract: identify a bounded context to extract first. Choose one that is (a) well-defined with clear boundaries, (b) has minimal data sharing with the rest of the monolith, (c) would benefit from independent deployment or scaling, and (d) is low risk (not the most critical revenue path). Build the new microservice and its own database.

Phase 3 - Route: update the routing layer to send requests for the extracted functionality to the new microservice. Keep the monolith's implementation intact as a fallback. Use feature flags to gradually shift traffic (1%, 10%, 50%, 100%). Compare responses from both systems (shadow testing) to verify correctness.

Phase 4 - Iterate: repeat Phases 2 and 3 for additional bounded contexts. Each iteration teaches you more about service boundaries, data migration, and operational requirements.

Data migration considerations: the new service needs its own database. Initially, it might share the monolith's database (pragmatic but creates coupling). The end goal is full data ownership. Use the CDC pattern to sync data from the monolith's database to the new service's database. See our database design guide for migration strategies.

Common pitfalls: trying to extract too much at once (start small), not investing in observability (you need to compare behavior between monolith and microservice), ignoring organizational changes (the team structure needs to evolve alongside the architecture), and not setting clear success criteria (how do you know the extraction is complete?).

Discuss event-driven architecture as an enabler: the monolith publishes events for its domain operations. New microservices can subscribe to these events to build their own data stores. This reduces the need for direct data migration.

Follow-up questions:

  • How do you handle database references between extracted services and the remaining monolith?
  • What is the typical timeline for a monolith-to-microservices migration?
  • How do you handle shared authentication between the monolith and new microservices?

11. How do you implement the sidecar pattern in microservices?

What the interviewer is really asking: Do you understand the operational patterns that make microservices manageable at scale?

Answer framework:

The sidecar pattern deploys a helper process alongside each service instance. The sidecar handles cross-cutting concerns (networking, observability, security) so the service can focus on business logic.

In Kubernetes, the sidecar runs as a container in the same pod as the main container. They share the network namespace (communicate via localhost) and can share volumes. The sidecar starts before the main container and stops after it.

Common sidecar use cases: service mesh proxy (Envoy in Istio, handles mTLS, retries, circuit breaking, load balancing), logging agent (Fluentd or Filebeat that ships logs to a central system), metrics collector (Prometheus sidecar that scrapes metrics and forwards them), and configuration agent (watches a config service and updates local config files).

Service mesh deep dive: the data plane consists of sidecar proxies deployed with each service. The control plane (Istiod) configures all sidecars with routing rules, retry policies, and certificates. The sidecar intercepts all inbound and outbound network traffic, providing: automatic mTLS (encryption without code changes), traffic management (canary deployments, blue-green, A/B testing), resilience (retries, timeouts, circuit breakers), and observability (distributed traces, metrics, access logs).

Trade-offs of the sidecar pattern: added latency (each request passes through the sidecar proxy, adding 1-2ms), resource overhead (each sidecar consumes CPU and memory, multiplied by the number of instances), operational complexity (managing the sidecar lifecycle alongside the main service), and debugging difficulty (the sidecar adds a layer of indirection).

Alternatives to sidecars: shared libraries (the cross-cutting logic is compiled into the service, no network hop but requires the same language and coordinated library updates), proxyless service mesh (gRPC's xDS API allows the service to participate in the mesh without a sidecar), and ambient mesh (Istio Ambient uses per-node proxies instead of per-pod sidecars, reducing overhead).

Discuss when the sidecar pattern is worth the overhead: at scale (dozens of services), in polyglot environments (services in different languages), and when you need to enforce policies consistently across all services without trusting each service to implement them correctly.

Follow-up questions:

  • What is the latency overhead of a service mesh sidecar?
  • How does the sidecar handle failures and restarts?
  • When would you choose a shared library approach over a sidecar?

12. How do you handle testing in a microservices architecture?

What the interviewer is really asking: Do you understand the testing pyramid in a distributed system and the unique challenges of testing across service boundaries?

Answer framework:

The testing pyramid for microservices (bottom to top):

Unit tests: test business logic within a single service in isolation. Mock external dependencies (databases, other services). Fast, reliable, catch most bugs. These should constitute the majority of your tests.

Integration tests: test a single service with its real dependencies (database, cache). Use testcontainers to spin up real databases and caches in Docker containers. These catch issues with database queries, serialization, and schema mismatches.

Contract tests: verify that the interface between two services is compatible. Consumer-driven contracts (Pact): each consumer defines what it expects from a provider. The provider runs these contracts in its CI/CD pipeline. This catches breaking API changes without running the full system.

Component tests: test a single service end-to-end with mocked external services. Send HTTP requests and verify responses. Use WireMock or similar tools to simulate dependent services. These test the service's behavior as a black box.

End-to-end tests: test the full system with all services running. Expensive to set up, slow to run, and brittle (any service failure breaks the test). Minimize these. Focus on critical business flows (user registration, order placement, payment processing).

Additional testing strategies for microservices: chaos testing (inject failures, latency, and partitions to test resilience, tools like Chaos Monkey, Gremlin, Litmus), shadow testing (send production traffic to a new version alongside the current version, compare responses without affecting users), canary testing (deploy the new version to a small percentage of traffic, monitor metrics, gradually increase), and load testing (test individual services and the full system under expected and peak load).

Discuss the testing anti-pattern: trying to test everything with end-to-end tests. This creates a slow, brittle test suite that nobody trusts. Invest in contract tests instead, which provide confidence in inter-service compatibility without the overhead of running the full system.

Follow-up questions:

  • How do you manage test data across multiple services?
  • How do you test eventual consistency scenarios?
  • What is the role of feature flags in testing?

13. How do you handle observability across microservices?

What the interviewer is really asking: Do you understand the three pillars of observability and how they work together in a distributed system?

Answer framework:

Observability is the ability to understand the internal state of a system from its external outputs. In microservices, this is critical because no single service's logs tell the full story.

The three pillars:

Metrics: numerical measurements over time. Use RED metrics for services (Rate of requests, Error rate, Duration of requests) and USE metrics for infrastructure (Utilization, Saturation, Errors). Collect with Prometheus, store in a TSDB, visualize with Grafana. Metrics tell you WHAT is happening.

Logs: structured event records from each service. Use structured logging (JSON) with consistent fields across all services: timestamp, service name, trace ID, log level, message, and context-specific fields. Ship to a centralized system (ELK stack, Loki). Logs tell you WHY something happened.

Traces: the path of a request through multiple services. Each service adds a span with timing and metadata. Use OpenTelemetry for instrumentation, Jaeger or Tempo for storage and visualization. Traces tell you WHERE time is spent and WHERE errors occur.

Correlation: the real power of observability is connecting the three pillars. Include trace IDs in logs so you can jump from a trace to the relevant logs. Link traces to dashboards so you can jump from a metric anomaly to example traces. Create alerts based on metrics that include links to relevant dashboards and runbooks.

Service Level Objectives (SLOs): define measurable reliability targets. Example: 99.9% of requests complete in under 200ms. Measure SLIs (Service Level Indicators) continuously. Calculate error budgets (if your SLO is 99.9%, your error budget is 0.1% of requests). Use error budgets to balance reliability with development velocity.

Discuss the cost of observability: metrics, logs, and traces generate significant data. For a system processing 100K requests per second, storing all logs and traces is expensive. Use log levels (debug off in production), sampling (trace 1% of requests, 100% of errors), and retention policies (keep detailed data for 7 days, aggregated data for 90 days).

Follow-up questions:

  • How do you implement alerting that avoids alert fatigue?
  • How do you handle observability for async operations (event consumers, batch jobs)?
  • What is the cost of running an observability platform at scale?

14. How do you handle deployment strategies in microservices?

What the interviewer is really asking: Can you deploy changes safely in a system where one bad deploy can cascade across services?

Answer framework:

Deployment strategies from simplest to most sophisticated:

Rolling update: replace old instances with new ones, one at a time. No downtime if at least one instance is always running. Default in Kubernetes. Risk: during the rollout, both old and new versions are serving traffic. The new version must be backward compatible with the old version's data and APIs.

Blue-green: run two identical environments (blue = current, green = new). Deploy the new version to green. Switch traffic from blue to green. If issues occur, switch back to blue. Advantage: instant rollback. Disadvantage: requires double the infrastructure (temporarily).

Canary: deploy the new version to a small subset of instances (e.g., 5%). Route a small percentage of traffic to canary instances. Monitor error rates, latency, and business metrics. If healthy, gradually increase traffic. If unhealthy, roll back immediately. This is the safest strategy for high-risk changes.

Feature flags: deploy the new code to all instances but hide it behind a feature flag. Enable the flag for internal users first, then 1% of users, then 10%, then 100%. This separates deployment (code is on servers) from release (feature is enabled for users). If the feature causes issues, disable the flag without redeploying.

Progressive delivery: combine canary deploys with automated analysis. Deploy canary, automatically compare canary metrics with baseline, and automatically promote or roll back based on metric thresholds. Tools like Argo Rollouts and Flagger implement this. This removes human judgment from the deployment decision for well-defined metrics.

GitOps: define the desired state of each service in a Git repository (Kubernetes manifests, Helm charts). A controller (ArgoCD, Flux) continuously reconciles the cluster state with the Git state. Changes are made through Git pull requests, which provide audit trails, reviews, and easy rollback (revert the commit).

Discuss high availability during deployments: ensure health checks are configured correctly (readiness probes that verify the service can handle traffic), use pod disruption budgets (never drain more than N instances simultaneously), and implement graceful shutdown (drain in-flight requests before terminating).

Follow-up questions:

  • How do you handle database schema changes during a rolling deployment?
  • How do you implement automated rollback based on metrics?
  • How do you handle deployment dependencies between services?

15. What is the CQRS pattern and when should you use it?

What the interviewer is really asking: Do you understand advanced data management patterns and can you identify when the additional complexity is justified?

Answer framework:

CQRS (Command Query Responsibility Segregation) separates the write model (commands: create, update, delete) from the read model (queries: fetch data). Each has its own data store optimized for its specific purpose.

The write store is optimized for transactional integrity: normalized relational schema, ACID transactions, strong consistency. The read store is optimized for query performance: denormalized, pre-joined, indexed for specific query patterns. Could be a different database entirely (write to PostgreSQL, read from Elasticsearch or a materialized view in Redis).

Synchronization: when a command modifies the write store, an event is published. Event handlers update the read store(s). This introduces eventual consistency between writes and reads.

When CQRS is appropriate: read and write workloads have very different scaling requirements (read-heavy system where the write model is a bottleneck for reads), read queries are complex and do not map well to the write model (many joins, aggregations, full-text search), different consumers need different views of the same data, or you are already using event-driven architecture and event sourcing.

When CQRS is overkill: simple CRUD applications, systems with low read complexity (the write model is fine for reads), or teams unfamiliar with eventual consistency and event-driven patterns.

Combine with event sourcing: instead of storing current state, store events. The write side appends events to an event store. The read side builds materialized views by replaying events. This gives you a complete audit trail, time-travel queries, and the ability to build new read models by replaying the event history.

Discuss the consistency challenge: after a user creates a record, they might immediately query for it. If the read model has not been updated yet, the user sees stale data. Solutions: read-your-own-writes consistency (after a write, the next read for that user goes to the write store), optimistic UI (show the local change immediately before the server confirms), or use a synchronous read model update for critical paths.

Follow-up questions:

  • How do you handle the eventual consistency gap in a user-facing application?
  • How do you rebuild a read model when the schema changes?
  • What is the relationship between CQRS and the event sourcing pattern?

Common Mistakes in Microservices Interviews

  1. Defaulting to microservices without justification. Not every system needs microservices. Show that you understand when a monolith is the right choice.

  2. Ignoring data ownership. Sharing databases between services is the most common microservices anti-pattern. Each service should own its data.

  3. Underestimating operational complexity. Microservices require investment in CI/CD, monitoring, logging, tracing, and incident response. If you do not discuss these, you are showing inexperience.

  4. Not discussing organizational alignment. Microservices work best when service boundaries align with team boundaries. Conway's Law is a real force.

  5. Treating all communication as synchronous. Over-reliance on synchronous calls creates brittle systems. Understand when to use async patterns.

How to Prepare for Microservices Interviews

Study real-world microservices architectures: Netflix (Zuul, Eureka, Hystrix), Uber (domain-oriented microservices), and Amazon (team-per-service). Understand why they made their architectural choices.

Build a small microservices system: 3-4 services that communicate via REST and events. Deploy on Kubernetes. This hands-on experience teaches you the operational challenges that reading alone cannot.

Review our microservices architecture guide, event-driven architecture patterns, and the system design interview guide. Explore learning paths for structured preparation. For staff-level roles, see the senior to staff engineer transition guide.

Related Resources

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.