INTERVIEW_QUESTIONS

Site Reliability Engineering Interview Questions for Senior Engineers (2026)

Top SRE interview questions with detailed answer frameworks covering SLOs, error budgets, incident response, capacity planning, reliability architecture, and production excellence practices used at Google, Netflix, and other leading technology companies.

20 min readUpdated Apr 21, 2026
interview-questionssresenior-engineerreliabilityproduction-engineering

Why SRE Skills Matter in Senior Engineering Interviews

Site Reliability Engineering, pioneered at Google and now adopted across the industry, represents a fundamental shift in how organizations think about operating production systems. SRE applies software engineering principles to infrastructure and operations problems, replacing manual toil with automation and replacing reactive firefighting with proactive reliability engineering. In 2026, SRE principles have become so widely adopted that senior engineering candidates at most technology companies are expected to demonstrate SRE competency regardless of their primary role.

SRE interviews assess a unique combination of skills: deep systems knowledge (networking, operating systems, distributed systems), software engineering ability (building automation and tooling), and operational judgment (making decisions under pressure during incidents). Interviewers are looking for candidates who understand the economics of reliability, who can quantify the cost of downtime versus the cost of additional engineering investment, and who can design systems that balance feature velocity with operational stability.

The distinction between a mid-level and senior SRE candidate is the ability to think in systems. Senior candidates do not just know how to configure a monitoring alert; they design observability strategies that give teams confidence in their SLOs. They do not just respond to incidents; they redesign systems so that entire categories of incidents become impossible. For a comprehensive view of how SRE fits into broader engineering preparation, see our system design interview guide and the learning paths designed for senior reliability engineers.

1. How do you define and implement SLOs for a complex distributed service?

What the interviewer is really asking: Do you understand the SLI/SLO/SLA hierarchy, can you choose meaningful indicators, and do you know how to use error budgets to balance reliability with velocity?

Answer framework:

Start with the hierarchy: SLIs (Service Level Indicators) are the metrics that matter to users, SLOs (Service Level Objectives) are the targets for those metrics, and SLAs (Service Level Agreements) are the contractual obligations with financial penalties. SREs primarily work with SLIs and SLOs.

Choosing the right SLIs is the most critical step. Bad SLIs lead to meaningless SLOs. For a user-facing web service, the key SLIs are availability (percentage of requests that return a successful response), latency (measured at specific percentiles: p50, p95, p99), and correctness (percentage of requests that return the correct result). For a data processing pipeline, use freshness (how old is the most recent processed data) and completeness (percentage of input records that were successfully processed).

Define SLOs that reflect actual user expectations. Do not set SLOs at the maximum achievable reliability because this leaves no room for deploying changes. A 99.9 percent availability SLO allows 43 minutes of downtime per month. A 99.99 percent SLO allows only 4.3 minutes. The tighter the SLO, the more engineering investment is required, and the slower feature development becomes. The right SLO is the lowest reliability target that keeps users satisfied.

Error budgets operationalize SLOs. If your SLO is 99.9 percent availability, your error budget is 0.1 percent of requests per month. When the error budget is healthy, deploy aggressively. When the budget is nearly exhausted, freeze deployments and focus on reliability improvements. This removes the subjective debate about whether to ship features or fix reliability by replacing it with a data-driven policy.

For implementation, measure SLIs at the point closest to the user. For web services, measure at the load balancer or API gateway, not at the application level, because infrastructure failures between the application and the user would be invisible. Use a dedicated SLO monitoring tool or build custom Prometheus recording rules that compute error budget consumption in real-time.

Set up alerting based on error budget burn rate. A fast burn (consuming the entire monthly budget in one hour) triggers immediate paging. A slow burn (consuming the budget over several days) creates a ticket for investigation during business hours.

Follow-up questions:

  • How do you handle SLOs for services with variable user expectations across different markets?
  • What do you do when stakeholders push for an unrealistically high SLO?
  • How do you attribute SLO violations to specific services in a microservices architecture?

2. Describe your approach to capacity planning for a service that experiences seasonal traffic spikes.

What the interviewer is really asking: Can you predict future resource needs, plan for growth, and ensure the system can handle peak loads without over-provisioning during normal periods?

Answer framework:

Capacity planning is the process of determining the production capacity needed to meet future demand. The challenge is balancing headroom (enough capacity for unexpected spikes) with cost efficiency (not over-provisioning).

Start with demand modeling: collect historical traffic data and identify patterns. Most services have daily patterns (peaks during business hours), weekly patterns (lower on weekends for B2B services), and seasonal patterns (Black Friday for e-commerce, tax season for financial services). Use time-series forecasting (exponential smoothing, ARIMA, or Prophet) to project future demand based on historical trends.

Convert demand forecasts into resource requirements using load testing data. Establish the relationship between request rate and resource consumption: at 1,000 requests per second, each pod uses 400 millicores of CPU and 512 MB of memory. From this, calculate how many pods are needed for projected peak traffic, then add headroom.

Define the headroom buffer based on risk tolerance: N+1 redundancy (one extra instance per group), N+2 for critical services, or a percentage buffer (20-30 percent above projected peak). The buffer must account for the time needed to scale up, which includes procurement time for reserved instances, node provisioning time, and pod startup time.

For cloud environments, use a tiered approach: reserved instances for baseline load (cheapest per hour), autoscaling with on-demand instances for daily peaks, and spot or preemptible instances for burst capacity. This optimizes cost while ensuring capacity.

Conduct regular capacity reviews (monthly or quarterly) that compare actual traffic against projections, update forecasts, and adjust capacity accordingly. Flag services that are trending toward capacity limits weeks before they hit the wall.

For organic growth planning, work with product and business teams to understand upcoming launches, marketing campaigns, and market expansion plans that will drive traffic changes outside of historical patterns. A capacity planning model that only looks at history will miss step-function changes from new feature launches.

Follow-up questions:

  • How do you handle capacity planning for a new service with no historical data?
  • What is your approach when a capacity increase requires a 3-month lead time for hardware procurement?
  • How do you manage capacity across multiple regions with different traffic patterns?

3. How do you design a system for graceful degradation under overload?

What the interviewer is really asking: Do you understand load shedding, circuit breaking, priority-based traffic management, and how to maintain partial service when full service is impossible?

Answer framework:

Graceful degradation means maintaining core functionality when the system is under stress, even if non-critical features are disabled. The alternative, where the entire system fails under overload, is unacceptable for production services.

Start with load shedding: when incoming requests exceed capacity, reject excess requests early with a clear signal (HTTP 503 with Retry-After header) rather than accepting them and timing out, which wastes resources and provides a worse user experience. Implement load shedding at the edge (API gateway or load balancer) where it is cheapest. Use adaptive load shedding that measures actual backend capacity (CPU utilization, request latency, queue depth) rather than a static rate limit.

Implement priority-based traffic management: classify requests into priority tiers. For an e-commerce site, checkout and payment are highest priority, product browsing is medium, and recommendation loading is lowest. Under overload, shed low-priority requests first. Use request headers or routing rules to classify traffic.

Circuit breakers prevent cascading failures between services. When a downstream dependency starts failing, the circuit breaker trips and immediately returns an error or cached response instead of waiting for timeouts. This prevents the calling service from exhausting its connection pool and thread pool. Implement circuit breakers with three states: closed (normal operation), open (all requests immediately fail), and half-open (allow a few test requests to check if the dependency has recovered).

Design feature flags for degradation: identify which features can be disabled without breaking core user flows. For a streaming service like Netflix, during overload, disable personalized recommendations and show a static trending list, reduce video quality tiers, and disable autoplay. Each degradation level should be toggleable independently.

For back-pressure propagation: when an internal service is overloaded, it should signal upstream callers to slow down rather than accepting requests and queuing them indefinitely. Use flow control mechanisms: HTTP 429 (Too Many Requests) with rate limit headers, gRPC flow control, or message queue consumer lag metrics.

Test degradation behavior regularly. Include overload scenarios in load testing and chaos engineering experiments. Verify that the system degrades predictably rather than failing in unexpected ways.

Follow-up questions:

  • How do you decide which features to degrade first?
  • How do you prevent load shedding from disproportionately affecting certain user segments?
  • How do you communicate graceful degradation status to users?

4. Explain how you would implement a comprehensive on-call program.

What the interviewer is really asking: Can you build sustainable on-call practices that effectively protect production systems without burning out the engineers involved?

Answer framework:

A well-designed on-call program balances production reliability with engineer well-being. The Google SRE book established that on-call should consume no more than 25 percent of an SRE's time, with the remaining 75 percent spent on engineering work that reduces future on-call burden.

For rotation structure: minimum two people on-call at all times (primary and secondary) for any critical service. Rotate weekly rather than daily (daily handoffs lose context) or biweekly (too long, leads to burnout). Time-zone-aware rotations avoid waking engineers at night: a US team covers US business hours, an APAC team covers APAC hours. Ensure the on-call pool is large enough that each engineer is on-call no more than one week per month.

For alert quality: this is the most important factor in on-call sustainability. Every page must be actionable (the on-call engineer can do something about it), urgent (it requires response within minutes, not hours), and real (it represents actual user impact, not a noisy threshold). Audit alerts monthly. Any alert that is consistently ignored, acknowledged without action, or not correlated with user impact should be deleted or downgraded to a non-paging notification.

For escalation paths: define clear escalation criteria. If the primary cannot resolve within 15 minutes, escalate to the secondary. If neither can resolve within 30 minutes, escalate to the team lead or subject matter expert. For cross-team issues, maintain an incident management process with clear communication channels.

For runbooks: every paging alert should link to a runbook that describes what the alert means, why it matters, diagnostic steps, and remediation actions. Runbooks should be living documents updated after every incident. The goal is to make on-call accessible to any team member, not just the engineers who built the system.

For compensation and recognition: on-call duty outside business hours should be compensated (either financially or with time off). Track on-call burden per engineer and ensure it is distributed equitably. Recognize engineers who improve on-call experience through better automation and alert tuning.

Measure on-call health with metrics: pages per shift, pages by hour of day, mean time to acknowledge, mean time to resolve, and percentage of pages that were actionable. Review these metrics monthly and set improvement targets.

Follow-up questions:

  • How do you handle on-call for a service that is too complex for any single engineer to fully understand?
  • What is your approach when on-call burden consistently exceeds sustainable levels?
  • How do you onboard new engineers to the on-call rotation effectively?

5. How do you approach toil reduction in an SRE context?

What the interviewer is really asking: Can you identify repetitive manual work, quantify its cost, and build automation that eliminates it permanently?

Answer framework:

Toil is defined as work that is manual, repetitive, automatable, tactical (reactive rather than strategic), without enduring value (the system is not better after you do it), and scales linearly with service growth. Examples include manually restarting failed processes, manually provisioning resources for each new team, running routine database maintenance, and manually generating reports.

The first step is measurement: track how engineers spend their time. Use time tracking or categorize completed work items. Quantify toil as a percentage of each engineer's week. The SRE principle is that toil should not exceed 50 percent of an engineer's time. If it does, the team is effectively an operations team, not an SRE team.

Prioritize toil reduction by impact: multiply the frequency of the task by the time per occurrence by the number of engineers affected. A task that takes 10 minutes but occurs 20 times per week across 5 engineers costs over 80 engineer-hours per month. That justifies significant investment in automation.

For automation approaches: self-service tooling (a CLI or web UI that lets developers provision their own resources), declarative configuration (define the desired state and let automation achieve it), auto-remediation (monitoring detects the problem and automation fixes it before a human is paged), and elimination (redesign the system so the manual step is no longer needed).

The highest-leverage toil elimination is auto-remediation. When you identify that 30 percent of pages are resolved by the same sequence of steps (restart the service, clear a queue, increase a connection pool), encode that sequence in a remediation bot that executes automatically when the condition is detected. The on-call engineer is notified that remediation occurred but is not woken up.

Track toil reduction over time. Set quarterly goals for reducing toil percentage. Celebrate engineering work that eliminates entire categories of manual work. Make toil reduction a valued engineering contribution that is recognized in performance reviews.

Follow-up questions:

  • How do you prevent automation from masking underlying systemic issues?
  • How do you handle organizational resistance to investing in toil reduction versus feature work?
  • What is your approach to automating tasks that require human judgment?

6. How would you design a multi-region active-active architecture for a critical service?

What the interviewer is really asking: Do you understand the distributed systems challenges of running services across geographic regions, including data replication, consistency trade-offs, and failover mechanisms?

Answer framework:

Multi-region active-active means that every region independently serves user traffic simultaneously, with automatic failover if one region becomes unhealthy. This is the gold standard for critical services but introduces significant complexity.

For traffic routing, use DNS-based global load balancing (AWS Route 53, Google Cloud DNS, or Cloudflare) to route users to the nearest healthy region. Implement health checks that probe each region's full request path (not just whether the load balancer responds, but whether the application can serve requests end-to-end). Configure failover behavior: when a region fails health checks, DNS removes it and traffic is redistributed to healthy regions.

The hardest challenge is data layer consistency. For eventually consistent data (user profiles, content, configuration), use asynchronous cross-region replication. Writes go to the local region and propagate to others within seconds. This means a user might see stale data if they are routed to a different region, but for most use cases this is acceptable.

For strongly consistent data (financial transactions, inventory counts), you have two options. First, single-leader replication: designate one region as the write leader and replicate to followers. All writes route to the leader region regardless of user location. This adds cross-region latency to writes but guarantees consistency. Second, multi-leader with conflict resolution: allow writes in any region and resolve conflicts using last-writer-wins, application-specific merge logic, or CRDTs. This is complex but eliminates the write latency penalty.

For the application layer, design services to be region-aware. Each service instance knows its region and can make routing decisions accordingly. Use a service mesh or service discovery system that understands region topology.

Test regional failover regularly. Conduct controlled failover drills where you intentionally take a region offline and verify that traffic shifts correctly, data remains consistent, and the service stays within its SLOs. Netflix popularized this approach with their region evacuation drills, which are a form of chaos engineering at the largest scale.

Address the thundering herd problem during failover: when one region fails, the remaining regions suddenly receive double the traffic. Ensure each region has sufficient headroom capacity or can autoscale rapidly enough to absorb the additional load.

Follow-up questions:

  • How do you handle user sessions during a regional failover?
  • What is your approach to deploying changes across multiple regions safely?
  • How do you debug issues that only occur in specific regions?

7. How do you measure and improve the reliability of a legacy system with no observability?

What the interviewer is really asking: Can you improve reliability incrementally without rewriting the system from scratch, and do you know how to prioritize when everything seems broken?

Answer framework:

This is a common real-world scenario. Legacy systems often generate the most revenue and the most incidents, yet have the least observability. The approach must be incremental and risk-conscious because breaking the existing system while trying to improve it is the worst possible outcome.

Phase 1 (External observation, week 1-2): before touching the application, instrument at the boundaries. Add monitoring at the load balancer or reverse proxy to capture request rate, error rate, and latency. Set up synthetic monitoring (scheduled HTTP checks from external probes) to detect outages from the user's perspective. Analyze access logs to understand traffic patterns, most-used endpoints, and error distribution. This gives you baseline SLIs without modifying the application.

Phase 2 (Infrastructure instrumentation, week 2-4): deploy node-level monitoring (CPU, memory, disk, network) using agents that require no application changes. Monitor database metrics (query latency, connection pool usage, replication lag, slow query log). Monitor queue depths if the system uses message queues. These metrics often reveal the root causes of reliability issues: disk filling up, connection pool exhaustion, or replication lag causing stale reads.

Phase 3 (Application-level instrumentation, month 2-3): add structured logging and metrics incrementally. Start with the most critical request paths identified from the access log analysis. Add timing instrumentation around database queries, external API calls, and cache lookups. Introduce distributed tracing using OpenTelemetry auto-instrumentation, which provides basic tracing with minimal code changes.

Phase 4 (Define SLOs and improve, month 3+): with sufficient observability, define SLOs based on the baseline data. Identify the top sources of error budget consumption and address them in priority order. Often, 80 percent of reliability problems come from 20 percent of components.

Throughout this process, resist the temptation to rewrite. Each incremental improvement delivers immediate value. A rewrite takes months, carries high risk, and delivers no value until complete. Focus on making the existing system observable, understandable, and incrementally more reliable.

Follow-up questions:

  • How do you build organizational support for investing in legacy system reliability?
  • What do you do when the legacy system has no test coverage?
  • How do you handle a legacy system that is too fragile to add instrumentation to safely?

8. How do you implement change management to reduce the risk of production deployments?

What the interviewer is really asking: Can you design processes and tooling that enable fast, safe deployments, reducing change failure rate without slowing down feature delivery?

Answer framework:

Change management in an SRE context is about making deployments safe by default rather than imposing heavyweight approval processes that slow down engineering velocity. The goal is to make small, frequent, well-tested changes rather than large, infrequent, risky ones.

For deployment safety mechanisms: implement canary deployments as the default deployment strategy. Every change goes through a canary phase where a small percentage of traffic exercises the new version. Automated canary analysis compares error rates, latency, and business metrics between the canary and the baseline. If the canary degrades any metric beyond a statistical significance threshold, the deployment automatically rolls back.

For change velocity management: enforce progressive rollout schedules based on risk level. Low-risk changes (config tweaks, minor bug fixes) deploy through automated canary with no human approval. Medium-risk changes (new features behind feature flags, dependency upgrades) require a team lead review of the canary metrics before full rollout. High-risk changes (database migrations, infrastructure changes, security patches) require an explicit approval and a scheduled maintenance window.

For blast radius limitation: use feature flags to decouple deployment from activation. Deploy code to production without enabling the new feature, then activate it for a small percentage of users, monitor, and expand. This separates the risk of deployment (will the process crash?) from the risk of activation (does the feature work correctly for users?).

For automated safety checks: block deployments during active incidents (do not add another variable during troubleshooting), block deployments that fail automated test suites or security scans, block deployments outside business hours unless explicitly overridden, and limit the number of concurrent deployments across the organization.

For rollback readiness: every deployment must have a tested rollback path. For application deployments, this means keeping the previous version available for instant redeployment. For database changes, this means using the expand-and-contract pattern. For infrastructure changes, this means using Terraform plan to preview changes and maintaining the ability to revert. Practice rollbacks regularly; the first time you execute a rollback should not be during an incident.

Follow-up questions:

  • How do you handle deployments that cannot be easily rolled back?
  • What metrics do you use to measure change management effectiveness?
  • How do you balance deployment safety with the need for emergency hotfixes?

9. How do you conduct and improve post-incident reviews?

What the interviewer is really asking: Do you treat incidents as learning opportunities, and can you lead a blameless post-incident process that produces systemic improvements?

Answer framework:

Post-incident reviews (PIRs), also called postmortems or retrospectives, are the primary mechanism through which organizations learn from failures. The quality of the PIR process directly correlates with whether the same classes of incidents recur.

For the review structure, hold the PIR within 48 hours of the incident while details are fresh. Include everyone involved in the response plus relevant stakeholders. The PIR document should contain: a summary (one paragraph), impact statement (users affected, duration, revenue impact), detailed timeline (every action taken, including timestamps), root cause analysis, contributing factors, what went well, what could have improved, and action items with owners and due dates.

For root cause analysis, use the Five Whys technique to push past surface-level causes. If a deployment caused an outage, the root cause is not the bad deployment; it is the lack of automated testing that would have caught the bug, or the lack of canary analysis that would have detected the impact before full rollout. Keep asking why until you reach systemic factors that, if addressed, would prevent the entire class of incidents.

The blameless culture is non-negotiable. The purpose of the PIR is to improve the system, not to punish individuals. When the investigation reveals that an engineer made an error, the question is: what about the system made that error easy to make and hard to detect? A blame-oriented culture causes engineers to hide errors, which makes the system less reliable.

For action items, categorize by impact and effort. Quick wins (alert tuning, runbook updates) should be completed within a week. Medium efforts (adding monitoring, implementing circuit breakers) should be completed within the sprint. Large efforts (architectural changes, new tooling) should be planned into the roadmap with clear timelines.

Track action item completion rates. If action items consistently go uncompleted, the PIR process is performative rather than effective. Report completion rates to leadership and advocate for prioritizing reliability improvements.

Conduct periodic meta-reviews: every quarter, analyze all incidents and action items to identify recurring themes. Are most incidents caused by deployment failures? Invest in deployment automation. Are they caused by dependency failures? Invest in circuit breakers and graceful degradation. This systemic analysis is more valuable than any individual PIR.

Follow-up questions:

  • How do you handle a situation where the same type of incident recurs despite previous PIR action items?
  • How do you ensure PIRs are read and internalized by the broader engineering organization?
  • How do you conduct PIRs for near-misses that did not result in user-facing impact?

10. How do you design a reliable alerting strategy that minimizes false positives and alert fatigue?

What the interviewer is really asking: Can you build an alerting system where every page represents real user impact and requires immediate human action?

Answer framework:

Alert fatigue is the single biggest threat to on-call effectiveness. When engineers are paged for non-actionable alerts, they develop alert blindness and start ignoring or delaying responses to all alerts, including the critical ones.

The fundamental principle: alert on symptoms (what users experience) not causes (what might be wrong internally). Page when the user-facing error rate exceeds the SLO threshold, not when a single server's CPU is high. The server might be handling the load fine. Symptom-based alerting dramatically reduces false positives because it directly measures what matters.

Implement multi-window, multi-burn-rate alerting as described in the Google SRE book. Fast burn alert: if the current error rate would exhaust the entire monthly error budget within one hour, page immediately. This catches acute incidents. Slow burn alert: if the current error rate would exhaust the error budget within three days, create a ticket for business-hours investigation. This catches gradual degradation.

For alert design principles: every alert must have a clear ownership (which team is responsible), a runbook link (what to do when it fires), a severity level (page vs ticket vs informational), and a defined escalation path. Delete any alert that does not meet all four criteria.

Reduce noise with alert aggregation and deduplication. When a single root cause triggers alerts across multiple services, the alerting system should group them into a single incident rather than paging the on-call engineer 50 times. Use tools like PagerDuty's intelligent alert grouping or build custom aggregation rules.

For threshold tuning, use historical data to set alert thresholds. Analyze the distribution of the metric and set the threshold above the normal variance. For example, if p99 latency normally varies between 200ms and 400ms, setting the alert at 350ms will cause false positives during normal variance. Set it at 500ms with a sustained duration of 5 minutes.

Conduct monthly alert reviews: analyze every alert that fired, categorize it as actionable or noise, and tune or delete noisy alerts. Track the signal-to-noise ratio over time and set a target (greater than 80 percent actionable). Compare your observability tooling choices: Datadog vs New Relic each offer different approaches to alert management and noise reduction.

Follow-up questions:

  • How do you handle alerts for services that have different reliability requirements during business hours versus off-hours?
  • What is your approach to alerting during known maintenance windows?
  • How do you alert on the absence of expected events rather than the presence of errors?

11. How do you implement effective load testing to validate system reliability?

What the interviewer is really asking: Can you design load tests that accurately simulate production traffic patterns and reveal system limits before they cause incidents?

Answer framework:

Load testing is a reliability practice, not just a performance practice. The goal is to understand system behavior under various load conditions and identify breaking points before users find them.

Design realistic load profiles: do not simply blast the system with maximum request rate. Model actual user behavior including request mix (what percentage are reads vs writes, which endpoints are most popular), timing patterns (requests per user session, think time between requests), and data patterns (realistic payload sizes, query parameters that exercise different code paths).

Implement multiple test types: baseline test (normal traffic, validate performance meets SLOs), stress test (gradually increase load until the system degrades, identify the breaking point), spike test (sudden traffic increase, validate autoscaling and load shedding), soak test (sustained load for hours, identify memory leaks, connection pool exhaustion, and other time-dependent failures), and chaos test (combine load with fault injection for the most realistic scenario).

For the testing environment: ideally, run load tests against a production-identical environment. If cost prohibits this, use a scaled-down environment but scale the test traffic proportionally. Test against production data volumes since many performance issues only manifest at production scale.

For results analysis, focus on these metrics: throughput (requests per second at each load level), latency distribution (p50, p95, p99 as separate signals since averages hide tail latency), error rate by type (distinguish client errors from server errors), resource utilization (CPU, memory, connections, file descriptors), and downstream dependency behavior (does increasing load on your service cause cascading pressure on dependencies?).

Integrate load testing into the CI/CD pipeline: run a lightweight performance test on every deployment to catch regressions. Run comprehensive load tests weekly or before major releases. Store results historically so you can track performance trends over time.

Address the production load testing question: some organizations run load tests directly in production using dark traffic (replaying production requests to a shadow service) or synthetic traffic during low-usage periods. This provides the most realistic results but requires careful safeguards to prevent impacting real users.

Follow-up questions:

  • How do you load test a service that depends on third-party APIs with rate limits?
  • What is your approach when load testing reveals that the breaking point is below the projected peak traffic?
  • How do you ensure load test results are reproducible?

12. How do you approach reliability for stateful services like databases?

What the interviewer is really asking: Do you understand the unique reliability challenges of data persistence, including replication lag, backup recovery, failover, and data integrity?

Answer framework:

Stateful services are fundamentally harder to make reliable than stateless services because you cannot simply restart them or spin up new instances without considering the data they hold.

For high availability, implement primary-replica replication with automated failover. The primary handles all writes and replicates to one or more replicas. If the primary fails, an orchestrator (MySQL Orchestrator, Patroni for PostgreSQL) promotes a replica to primary. The critical design decision is synchronous vs asynchronous replication. Synchronous guarantees zero data loss but adds write latency. Asynchronous provides better performance but can lose the most recent writes during failover (replication lag becomes data loss).

For backup strategy, implement continuous backups (WAL archiving for PostgreSQL, binary log streaming for MySQL) that enable point-in-time recovery. Store backups in a different region from the primary database. Test backup restoration monthly by actually restoring to a temporary instance and verifying data integrity. Measure recovery time: if restoring a 2TB database takes 6 hours, your RTO for a catastrophic database failure is at least 6 hours.

For connection management, use connection pooling (PgBouncer, ProxySQL) between applications and databases. Without pooling, a microservices architecture can exhaust database connection limits. Configure pool sizes based on database capacity, not application demand. Implement connection health checks and automatic reconnection.

For query performance reliability, monitor slow query logs and set alerts on query latency degradation. Use query analysis tools to identify problematic queries before they cause outages. Implement query timeouts to prevent runaway queries from consuming all database resources. Separate read traffic to replicas using a load balancer or application-level routing.

For schema management, use online DDL tools for schema changes that avoid locking production tables. Test schema migrations against production-sized datasets to understand execution time. Implement the expand-and-contract pattern for zero-downtime schema evolution.

Address data integrity: implement checksums for critical data, monitor replication consistency with tools like pt-table-checksum, and set up alerts for data anomalies (unexpected NULL rates, value distribution shifts).

Follow-up questions:

  • How do you handle a split-brain scenario where two database nodes both believe they are the primary?
  • What is your approach to database capacity planning for storage growth?
  • How do you manage database reliability across multiple microservice databases versus a shared database?

13. How do you implement and leverage feature flags for reliability?

What the interviewer is really asking: Can you use feature flags as a reliability tool for progressive rollouts, instant rollbacks, and graceful degradation, not just for A/B testing?

Answer framework:

Feature flags are one of the most powerful reliability tools available. They decouple deployment (shipping code to production) from release (enabling functionality for users), which fundamentally changes the risk profile of deployments.

For reliability use cases, implement three categories of flags. Release flags: wrap new features in a flag that is off by default. Deploy the code, then gradually enable for 1 percent, 10 percent, 50 percent, and 100 percent of users while monitoring error rates and performance. If problems emerge at any stage, disable the flag instantly without a deployment. Operational flags (kill switches): wrap calls to non-critical dependencies in flags. When a recommendation service is slow, disable personalized recommendations and fall back to a cached default. This is the core of graceful degradation. Experiment flags: enable A/B testing of different implementations.

For flag infrastructure, use a centralized flag management service (LaunchDarkly, Unleash, or a custom solution) that supports real-time flag evaluation, targeting rules (enable for specific users, percentages, or segments), and audit logging. The flag evaluation must be fast (sub-millisecond) since it is in the hot path of every request. Cache flag values locally with a short TTL and subscribe to change events for near-real-time updates.

For reliability of the flag service itself, this is a critical dependency. If the flag service is unavailable, applications must fall back to cached values or sensible defaults. Never make the flag service a single point of failure. Design the SDK to be resilient: local cache, default values, and a circuit breaker on the flag service connection.

Manage flag lifecycle rigorously. Flags that are permanently on or permanently off are technical debt. Establish a process to clean up flags after the release is complete: remove the flag and the old code path. Track flag age and alert when flags exceed a maximum age (for example, 90 days for release flags).

For incident response, maintain a dashboard of all operational flags and their current state. During an incident, the incident commander should be able to toggle kill switches without a deployment pipeline. This requires the flag service to have an administrative UI with appropriate access controls.

Follow-up questions:

  • How do you test all flag combinations when the number of flags is large?
  • How do you handle feature flags for backend services that do not have a concept of users?
  • What is your strategy for preventing flag conflicts where enabling one flag breaks another feature?

14. How do you design effective runbooks for production operations?

What the interviewer is really asking: Can you create operational documentation that enables any on-call engineer to diagnose and resolve issues quickly, even if they did not build the system?

Answer framework:

Runbooks are the bridge between an alert firing and an engineer taking effective action. Good runbooks reduce mean time to resolution and make on-call accessible to the entire team rather than depending on tribal knowledge from the original system builders.

Structure every runbook with these sections: alert description (what this alert means in plain language), impact assessment (what users are experiencing and the severity level), diagnostic steps (specific commands and queries to understand the current state, with expected outputs for both healthy and unhealthy states), remediation steps (step-by-step instructions to resolve the issue, including rollback procedures), escalation criteria (when to escalate and who to contact), and related documentation (architecture diagrams, dependency maps, historical incidents).

For diagnostic steps, provide copy-pasteable commands. Instead of writing check the database replication lag, write: run the specific query, then state that the result should be below 5 seconds and if it is above 30 seconds escalate to the DBA team. Specificity prevents confusion during the high-stress context of incident response.

Maintain runbooks as living documents. After every incident where a runbook was used, update it based on what was unclear, what steps were missing, and what additional information would have helped. Track which runbooks are used most frequently and invest in making those the most polished.

Automate runbook steps wherever possible. The ideal progression for each runbook entry is: manual steps documented, then semi-automated (script that executes the diagnostic steps and presents results), then fully automated (auto-remediation with human notification). Track what percentage of runbook steps are automated versus manual.

Store runbooks in a searchable location that is accessible during incidents (not behind a VPN that might be affected by the incident). Link each alerting rule directly to its corresponding runbook. Conduct regular runbook drills where team members walk through a scenario using only the runbook, which is the most effective way to identify gaps. This practice connects to broader preparation in distributed systems reasoning.

Follow-up questions:

  • How do you keep runbooks up to date as the system evolves?
  • How do you handle situations where the runbook does not cover the specific failure mode?
  • What is your approach to runbooks for rare but catastrophic scenarios?

15. How do you balance reliability investment with feature development?

What the interviewer is really asking: Can you articulate the business case for reliability work and navigate the organizational dynamics of competing priorities?

Answer framework:

This is perhaps the most important question for a senior SRE because the technical work is only as valuable as your ability to ensure it gets prioritized. The error budget model provides the data-driven framework for this balance.

Start by quantifying the cost of unreliability. Calculate the revenue impact of downtime (if the service generates a certain amount of revenue per hour, each hour of downtime costs that amount), the engineering cost of incident response (multiply average incident duration by average number of responders by their hourly cost), the productivity cost of context switching (engineers pulled from project work to fight fires), and the customer trust cost (churn rate correlation with reliability).

Use error budgets as the negotiation mechanism. When the error budget is healthy (service is exceeding its SLO), the team can deploy aggressively and take risks with new features. When the error budget is depleted or at risk, the team shifts focus to reliability improvements. This removes the subjective debate: the data determines the priority, not organizational politics.

For long-term reliability investment, frame reliability work as enabling faster feature delivery. A robust CI/CD pipeline enables daily deployments instead of weekly. Comprehensive observability reduces debugging time from hours to minutes. Self-healing infrastructure eliminates classes of incidents entirely. These are not costs; they are productivity multipliers.

Present reliability investment as a portfolio. Allocate a fixed percentage (typically 20-30 percent) of engineering capacity to reliability work each quarter. Within that budget, prioritize by error budget impact: which investments will prevent the most SLO violations?

Address the common objection that we do not have time for reliability work by reframing it: you do not have time not to invest in reliability. Every incident consumes engineering hours that could have been spent on features. Every manual operational task that could be automated is an ongoing tax on velocity. Track the total engineering hours spent on reactive operational work and present it as the cost of not investing in reliability.

For organizational alignment, ensure that reliability metrics (SLO attainment, incident frequency, error budget consumption) are visible to leadership alongside feature delivery metrics. When leadership sees both metrics together, they can make informed trade-off decisions.

Follow-up questions:

  • How do you handle a situation where product leadership consistently overrides reliability priorities?
  • How do you measure the return on investment for reliability engineering work?
  • How do you decide when a system is reliable enough and further investment has diminishing returns?

Common Mistakes in SRE Interviews

  1. Focusing on tools instead of principles. Candidates who describe their experience purely in terms of tools (Prometheus, Grafana, PagerDuty) without explaining the underlying SRE principles (SLO-based alerting, error budget policies, toil reduction) demonstrate operator-level thinking. Tools change; principles endure.

  2. Treating reliability as a binary state. Saying a system should be 100 percent reliable reveals a misunderstanding of SRE economics. Every nine of availability (99.9 to 99.99 percent) costs roughly 10x more to achieve. Senior SRE candidates understand that the right reliability target depends on user expectations and business impact.

  3. Describing incident response without systemic improvement. Telling a story about heroically debugging a 3 AM outage is impressive, but an SRE interviewer wants to hear what you changed afterward to prevent that class of incident from recurring. The hero narrative without follow-up suggests reactive rather than proactive engineering.

  4. Ignoring the human side of reliability. SRE is as much about organizational design as system design. Candidates who do not discuss on-call sustainability, team culture, cross-team collaboration, and change management miss a critical dimension of the role.

  5. Over-indexing on prevention while neglecting detection and response. Perfect prevention is impossible in complex systems. Senior SRE candidates design for resilience: systems that detect problems quickly, contain blast radius, and recover automatically.

How to Prepare for SRE Interviews

Read the foundational texts: the Google SRE book and the SRE Workbook. These define the vocabulary and principles that interviewers expect you to know. Pay particular attention to chapters on SLOs, error budgets, toil, and incident management.

Build operational experience: if you do not have production on-call experience, set up a personal project with monitoring, alerting, and SLOs. Break it intentionally and practice debugging. The interview will probe for real operational intuition that cannot be faked.

Study distributed systems fundamentals: consensus algorithms, replication strategies, CAP theorem trade-offs, and failure modes. SRE interviews frequently venture into distributed systems territory because reliability at scale requires deep systems knowledge.

Prepare incident stories with the STAR format: Situation (what broke), Task (your role), Action (what you did), Result (what improved). Have stories that demonstrate both incident response skills and proactive reliability engineering.

Practice system design with a reliability lens. For any system design problem, ask: what are the SLIs and SLOs? What are the failure modes? How does it degrade gracefully? What is the monitoring and alerting strategy? This reliability-first thinking is what distinguishes SRE candidates. Study architectures from Netflix and Google to understand how the industry's leaders approach reliability at scale.

For structured preparation, explore our learning paths and check pricing for premium resources that include mock SRE interviews with experienced practitioners.

Related Resources

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.