INTERVIEW_QUESTIONS

Infrastructure as Code Interview Questions for Senior Engineers (2026)

Top Infrastructure as Code interview questions with detailed answer frameworks covering Terraform, Pulumi, CloudFormation, state management, modules, drift detection, and immutable infrastructure patterns used at companies like Google, Amazon, and Netflix.

20 min readUpdated Apr 25, 2026
interview-questionsinfrastructure-as-codeterraformpulumicloudformationdevopssenior-engineer

Why Infrastructure as Code Expertise Matters in Senior Engineering Interviews

Infrastructure as Code has moved from a best practice recommendation to an absolute requirement at every company operating cloud infrastructure at scale. Senior and staff engineering candidates are expected to understand not just how to write Terraform or CloudFormation templates, but the deep architectural decisions behind state management strategies, module design patterns, blast radius containment, and the organizational workflows that make IaC sustainable across hundreds of engineers and thousands of resources.

Interviewers asking IaC questions at the senior level are probing for judgment that goes beyond syntax. They want to hear you articulate why a particular state backend architecture prevents team conflicts, how module versioning strategies affect deployment velocity, or why drift detection is not just a nice-to-have but a critical operational concern that has caused real production incidents. They expect you to connect IaC decisions to broader concerns like CI/CD pipelines, cloud architecture, and the downstream effects on incident response and disaster recovery.

The questions in this guide are drawn from real interviews at companies like Google, Amazon, and other top-tier organizations. Each answer framework provides the structure to demonstrate senior-level reasoning: articulate the mechanism, explain the trade-offs, reference production experience, and address failure modes. For a broader preparation strategy, see our system design interview guide and explore the learning paths tailored to senior engineers.


1. Compare Terraform, Pulumi, and CloudFormation. When would you choose each?

What the interviewer is really asking: Do you understand the fundamental design philosophies behind each tool, and can you make a reasoned recommendation based on organizational context rather than personal preference?

Answer framework:

Start by establishing the core distinction: CloudFormation is a cloud-provider-native declarative service, Terraform is a cloud-agnostic declarative tool with its own state management, and Pulumi uses general-purpose programming languages to define infrastructure with imperative constructs that compile to declarative state.

CloudFormation is the right choice when your organization is fully committed to AWS, wants tight integration with AWS-native services like Service Catalog and Control Tower, and values the implicit state management that AWS handles for you. Its limitations become apparent with complex logic, cross-cloud requirements, or large stack sizes that hit the 500-resource limit.

Terraform excels in multi-cloud or hybrid environments. Its provider ecosystem is the broadest in the industry, covering not just cloud providers but SaaS platforms, DNS providers, monitoring tools, and virtually any API-driven service. The HCL language strikes a balance between readability for operations teams and expressiveness for complex configurations. The trade-off is explicit state management, which requires careful backend configuration and locking strategies.

hcl

Pulumi is strongest when your team has deep software engineering expertise and wants to leverage existing programming language features like loops, conditionals, type systems, and testing frameworks directly in infrastructure definitions. It eliminates the learning curve of a domain-specific language but introduces the risk of over-engineering infrastructure definitions with unnecessary abstractions.

typescript

A senior answer acknowledges that the choice often depends more on team composition, existing tooling, and organizational maturity than on technical superiority. For a detailed comparison, see Terraform vs Pulumi.


2. Explain Terraform state in depth. What problems does it solve and what problems does it create?

What the interviewer is really asking: Do you understand why state exists as a concept, how it affects team workflows, and what operational risks it introduces?

Answer framework:

Terraform state is a JSON file that maps your declared resources to real-world infrastructure objects. It exists because cloud provider APIs do not natively understand the concept of desired state in the way Terraform defines it. The state file serves as the source of truth for what Terraform has created, enabling it to compute diffs between desired and actual configuration.

State solves three critical problems. First, it provides resource identity mapping. When you rename a resource in your configuration, Terraform uses state to know that the old name and the new name refer to the same physical resource, preventing unnecessary destruction and recreation. Second, it caches resource attributes so that terraform plan does not need to query every resource from the cloud API on every run, which matters when you manage thousands of resources. Third, it tracks dependency metadata that enables correct ordering during creation and destruction.

State creates equally significant problems. It contains sensitive data in plaintext, including database passwords, API keys, and private keys for any resource that generates them. It becomes a single point of contention in team environments because concurrent modifications can corrupt it. It can drift from reality when changes are made outside of Terraform, creating a false sense of security about what actually exists in your environment.

hcl

A senior engineer should discuss mitigation strategies: remote backends with locking via DynamoDB or Consul, state encryption at rest and in transit, state file segmentation to reduce blast radius, and the use of terraform state subcommands for surgical state manipulation during refactoring. Mention that you have personally dealt with state corruption or state lock issues in production and how you resolved them.


3. How do you structure Terraform modules for a large organization with multiple teams?

What the interviewer is really asking: Can you design module architectures that balance reusability with team autonomy, and do you understand the versioning and dependency challenges that emerge at scale?

Answer framework:

Module architecture at scale requires thinking about three layers: foundational modules that encode organizational standards, composition modules that assemble infrastructure patterns, and root modules that represent deployable units owned by specific teams.

Foundational modules are thin wrappers around individual resources that enforce company-wide policies. A foundational S3 bucket module, for example, ensures that encryption, versioning, access logging, and tagging standards are always applied. These modules are owned by a platform team and versioned semver-style in a private module registry.

hcl

Composition modules combine foundational modules into opinionated application patterns. A "web service" composition module might create a load balancer, an ECS service, a database, and the associated networking and IAM resources, all wired together with sensible defaults that teams can override.

Root modules are the entry points that teams actually apply. They reference composition modules with pinned versions, supply environment-specific variables, and define the state backend configuration. Each root module should have a blast radius that the owning team is comfortable with. A common mistake is making root modules too large, which means a single terraform apply touches hundreds of resources across multiple concerns.

Version pinning is critical. Use exact version constraints for production (version = "2.3.1") and allow minor version ranges for development (version = "~> 2.3"). Publish modules through a private registry with automated testing on every change. For more on designing scalable platform infrastructure, see platform engineering concepts.


4. What is infrastructure drift and how do you detect and remediate it?

What the interviewer is really asking: Do you understand the operational reality that infrastructure changes outside of IaC are inevitable, and do you have a systematic approach to maintaining IaC as the source of truth?

Answer framework:

Infrastructure drift occurs when the actual state of your infrastructure diverges from what your IaC definitions declare. This happens through manual console changes during incident response, automated scaling or patching by cloud services, changes made by other tools or scripts, or even cloud provider behavior changes that modify default settings.

Drift detection has three approaches. The simplest is running terraform plan on a schedule and alerting when the plan shows changes. This works but is slow for large estates and requires valid credentials and provider configuration. More sophisticated approaches use cloud-native tools like AWS Config rules or Azure Policy that continuously monitor resource configurations against desired baselines. The most comprehensive approach combines both: IaC-level drift detection for resources Terraform manages and cloud-native policy engines for resources that might exist outside any IaC definition.

yaml

Remediation strategy matters as much as detection. Not all drift should be corrected by reverting to the IaC definition. Sometimes the manual change was correct and the IaC should be updated to match reality. A senior engineer establishes a triage process: classify drift as intentional (update the code), accidental (revert via apply), or emergent (investigate the cause before acting). Drift that affects security posture, like security group changes or IAM policy modifications, should trigger immediate alerts regardless of cause.


5. How do you handle secrets and sensitive values in Infrastructure as Code?

What the interviewer is really asking: Do you understand the tension between IaC's desire for declarative completeness and the security requirement to never store secrets in version control, and do you have practical patterns for resolving it?

Answer framework:

The fundamental challenge is that infrastructure often requires sensitive values, such as database passwords, API keys, TLS certificates, but IaC definitions live in version control where secrets must never appear. Senior engineers recognize that there is no single solution; the approach depends on the secret type, the IaC tool, and the organization's security maturity.

The most common pattern separates secret storage from secret reference. Secrets are stored in a dedicated secrets manager like AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault. IaC definitions reference secrets by path or ARN rather than by value. The infrastructure creates the secret container and configures rotation policies, but the actual secret value is injected through a separate, audited process.

hcl

For Terraform specifically, discuss the sensitive variable attribute that prevents values from appearing in plan output, the risk that sensitive values still appear in state files (which is why state encryption is mandatory), and tools like sops or git-crypt for encrypting variable files that must live in version control. Mention that Pulumi has built-in secret management that encrypts sensitive values in state automatically, which is a meaningful advantage. For more on securing infrastructure at scale, see security architecture patterns.


6. Explain the concept of immutable infrastructure. How does it relate to IaC?

What the interviewer is really asking: Do you understand the philosophical shift from mutable to immutable infrastructure, can you articulate the practical benefits and costs, and have you implemented it in production?

Answer framework:

Immutable infrastructure is the practice of never modifying running infrastructure components in place. Instead of updating a server with a new application version or configuration change, you build a completely new server image, deploy it alongside the old one, verify it works, and then destroy the old one. The server is treated like a compiled binary: once built, it is never changed.

This approach solves configuration drift at the most fundamental level. When servers are modified in place over time, each one accumulates a unique history of patches, configuration changes, and ad-hoc fixes. This creates snowflake servers that are impossible to reproduce reliably. Immutable infrastructure eliminates this by ensuring that every server is built from the same automated pipeline and is identical to every other server from the same build.

hcl
hcl

The relationship to IaC is symbiotic. IaC defines what infrastructure should exist; immutable infrastructure ensures that each component is built consistently. Together, they create a system where the entire infrastructure can be destroyed and recreated from code with high confidence. The trade-off is increased deployment complexity and longer deployment times because building a new image takes longer than updating a configuration file. Containers have largely adopted this philosophy by default, which is one reason Kubernetes has become so dominant. For a deeper comparison, see mutable vs immutable infrastructure.


7. How do you implement a CI/CD pipeline for Terraform in a team environment?

What the interviewer is really asking: Can you design a workflow that enables multiple engineers to make infrastructure changes safely, with appropriate review gates, and without stepping on each other's changes?

Answer framework:

A production-grade Terraform CI/CD pipeline must solve four problems: plan visibility so reviewers can see exactly what will change, concurrency control so two engineers cannot apply conflicting changes simultaneously, policy enforcement so changes comply with organizational standards, and rollback capability when an apply goes wrong.

The pipeline begins with a pull request. When a PR is opened that modifies Terraform files, the CI system runs terraform init and terraform plan, then posts the plan output as a PR comment. This gives reviewers a clear diff of what will actually change in the infrastructure, not just what changed in the code. Tools like Atlantis, Spacelift, and Terraform Cloud automate this pattern.

hcl

Concurrency control is handled at two levels. The state backend's locking mechanism (DynamoDB for S3 backends) prevents concurrent applies on the same state. At the workflow level, the CI system should queue applies and execute them serially per workspace. Atlantis handles this with its built-in locking: once someone runs atlantis plan on a directory, no one else can plan or apply that directory until the first person's changes are merged or discarded.

Policy enforcement uses tools like Open Policy Agent, Sentinel (Terraform Enterprise), or Checkov to evaluate plans against organizational rules before apply is allowed. These policies can enforce tagging standards, prevent overly permissive IAM policies, require encryption on storage resources, or limit instance sizes to approved types.

A mature pipeline also includes post-apply verification: automated tests that confirm the infrastructure is functioning as expected after changes are applied. This might include HTTP health checks, DNS resolution verification, or smoke tests against newly created endpoints. For more on CI/CD design, see CI/CD pipeline patterns.


8. What are Terraform workspaces and when should you use them versus other environment isolation strategies?

What the interviewer is really asking: Do you understand the different approaches to managing multiple environments, and can you articulate when workspaces are appropriate versus when they introduce unnecessary risk?

Answer framework:

Terraform workspaces provide lightweight environment isolation by maintaining separate state files within the same backend configuration. When you switch workspaces, Terraform reads and writes a different state file but uses the same configuration files. Environment-specific differences are handled through conditional logic or variable files keyed to the workspace name.

hcl

Workspaces work well for simple differences like instance sizes, replica counts, or feature flags. They break down when environments have fundamentally different architectures. Production might need multi-region deployment, a read replica database, and a WAF configuration that development does not need at all. Cramming these differences into workspace conditionals creates brittle, hard-to-read configurations full of ternary expressions.

The alternative is directory-based isolation, where each environment has its own directory with its own root module, state backend, and variable files. The directories share code through versioned module references. This approach provides complete isolation: a change to staging cannot accidentally affect production even if there is a bug in the conditional logic. The trade-off is code duplication in the root modules, which tools like Terragrunt mitigate by generating common boilerplate.

A senior answer recommends directory-based isolation for production workloads and acknowledges that workspaces can be appropriate for ephemeral environments like feature branch deployments or short-lived testing environments. The key insight is that workspace isolation is logical (same backend, different state key) while directory isolation is physical (different backend configuration entirely), and production infrastructure deserves physical isolation.


9. How do you manage cross-stack or cross-service dependencies in IaC?

What the interviewer is really asking: Do you understand how to decompose infrastructure into manageable units while maintaining the ability for those units to reference each other's outputs safely?

Answer framework:

Cross-stack dependencies arise naturally when infrastructure is decomposed into logical units. A networking stack creates VPCs and subnets, a database stack creates RDS instances in those subnets, and an application stack deploys services that connect to those databases. Each stack needs to reference outputs from other stacks without creating tight coupling.

Terraform provides several mechanisms. Remote state data sources allow one configuration to read outputs from another configuration's state. This works but creates an implicit dependency on the state backend and requires read access to potentially sensitive state files.

hcl

A more decoupled approach uses an intermediate data store. Stack A writes its outputs to AWS SSM Parameter Store, Consul, or a similar key-value store. Stack B reads from that store. This eliminates the need for direct state file access and creates a clear contract between stacks: as long as the parameter exists at the expected path with the expected format, the consuming stack works regardless of how the producing stack is implemented.

The most sophisticated approach uses a service catalog or infrastructure registry that tracks all deployed infrastructure components, their endpoints, and their health status. Teams register their infrastructure outputs in the catalog, and consuming teams discover dependencies through a standardized API. This is the pattern that large organizations like Google and Amazon use internally. For more on distributed systems architecture, see microservices architecture patterns.


10. Explain the Terraform plan and apply lifecycle. What can go wrong between plan and apply?

What the interviewer is really asking: Do you understand that terraform plan is not a guarantee, and do you know the specific scenarios where the applied result can differ from the planned result?

Answer framework:

The plan-apply lifecycle is Terraform's core workflow. Plan reads the current state, queries cloud provider APIs for actual resource status, compares both against the desired configuration, and produces a diff showing what will be created, modified, or destroyed. Apply executes that diff, making real API calls to cloud providers.

The critical insight is that the plan is a snapshot in time, and multiple factors can cause the actual apply to diverge from what was planned. The most common scenario is concurrent modification: between the time you run plan and the time you run apply, someone else (or an automated process) changes the infrastructure. Terraform will attempt to apply changes against an infrastructure state that no longer matches what was planned.

hcl

Other failure modes include provider API rate limiting that causes some resources to be created while others fail, resulting in a partial apply that leaves infrastructure in an inconsistent state. There are also resources with eventual consistency where the cloud provider reports creation as complete but the resource is not yet available for dependent resources to reference. Some resources have attributes that can only be determined at apply time (like auto-generated IDs or ARNs) that might conflict with existing resources.

A senior engineer mitigates these risks by using saved plan files (the -out flag), implementing state locking, keeping blast radius small through state decomposition, and designing configurations that are idempotent so that re-running apply after a partial failure converges to the correct state rather than creating duplicates. The create_before_destroy lifecycle rule and the prevent_destroy lifecycle rule are important tools for managing specific resource types safely.


11. How do you test Infrastructure as Code? What testing strategies are effective?

What the interviewer is really asking: Do you treat infrastructure code with the same engineering rigor as application code, and do you understand the unique testing challenges that infrastructure presents?

Answer framework:

Infrastructure testing operates at multiple levels, each with different trade-offs between speed, cost, and confidence. Static analysis is the fastest and cheapest. Tools like terraform validate, tflint, and Checkov analyze configuration files without making any API calls. They catch syntax errors, invalid references, deprecated resource arguments, and policy violations. These run in seconds and should execute on every commit.

python

Unit testing in the IaC context means testing module logic in isolation. Terraform's built-in test framework (introduced in version 1.6) allows you to define test scenarios with mock providers. Pulumi's testing story is stronger here because you can use standard language testing frameworks to verify that your infrastructure code produces the expected resource graph without provisioning anything.

Integration testing deploys real infrastructure in an ephemeral environment, validates it, and then destroys it. Tools like Terratest (Go) and Kitchen-Terraform (Ruby) automate this workflow. These tests are expensive and slow (minutes to hours) but provide the highest confidence. The key practice is ensuring complete cleanup: every integration test must destroy everything it creates, and you should have a scheduled cleanup job that destroys any test resources that leaked due to failed teardowns.

Contract testing verifies that module interfaces are stable: inputs and outputs have not changed in breaking ways between versions. This is particularly important for shared modules consumed by multiple teams.


12. What is blast radius in the context of IaC, and how do you minimize it?

What the interviewer is really asking: Do you understand that the most dangerous risk in IaC is not failing to create something but accidentally destroying or modifying the wrong thing, and do you have concrete strategies for containment?

Answer framework:

Blast radius is the maximum scope of damage that a single IaC operation can cause. If a single terraform apply manages your entire production infrastructure, including networking, databases, compute, DNS, and CDN, then the blast radius is your entire production environment. A bug in a variable, a misconfigured provider, or a corrupted state file could destroy everything.

Minimizing blast radius requires decomposition at multiple levels. At the state level, split infrastructure into independent state files by concern and by risk profile. Networking, databases, compute, and edge infrastructure should each have their own state. A database state file change should never affect your CDN configuration.

hcl

At the resource level, use lifecycle rules to protect critical resources. prevent_destroy = true on databases and encryption keys prevents accidental destruction even if the resource is removed from configuration. create_before_destroy = true on stateless resources ensures zero-downtime replacements.

At the process level, require separate approvals for high-risk changes. A change to a security group rule might auto-apply, but a change to a database instance class requires a senior engineer's explicit approval. Implement terraform plan output analysis that flags destructive operations (destroy or recreate) and blocks automatic apply when they are detected.

A senior engineer also discusses recovery. Even with blast radius minimization, incidents happen. Infrastructure should be rebuildable from code within a defined recovery time objective. This means testing the full rebuild regularly, not just incremental changes. For more on disaster recovery architecture, see our system design interview guide.


13. How does Pulumi's approach to state management differ from Terraform, and what are the implications?

What the interviewer is really asking: Can you compare tooling at an architectural level, not just a feature level, and do you understand how state management design decisions affect team workflows?

Answer framework:

Pulumi offers three state management options: the Pulumi Cloud service (SaaS), self-managed backends (S3, Azure Blob, GCS), and local file storage. This mirrors Terraform's options but with important differences in implementation and philosophy.

The most significant difference is Pulumi's built-in secret management. When you mark a value as secret in Pulumi, it is encrypted in the state file using either the Pulumi Cloud's encryption service, a cloud KMS key, or a passphrase. Terraform stores all values in state as plaintext and relies on backend encryption (like S3 server-side encryption) for protection. This means that anyone with access to Terraform's state file can read every sensitive value, while Pulumi's state file is safe to inspect even if the backend encryption is compromised.

typescript

Pulumi Cloud provides additional capabilities that have no direct Terraform equivalent: a deployment history with full audit trail, a resource search API that queries across all stacks, and built-in RBAC at the stack level. Terraform Cloud offers similar features but with different design trade-offs.

The implication for teams is that Pulumi's state management requires less defensive configuration. You do not need to separately configure state encryption, worry about sensitive values leaking in plan output, or set up a separate locking mechanism. The trade-off is vendor dependency if you use Pulumi Cloud, though the self-managed backend option mitigates this. For a detailed feature comparison, see Terraform vs Pulumi.


14. How do you handle IaC for multi-region and multi-account deployments?

What the interviewer is really asking: Can you design IaC architectures that work at organizational scale, dealing with the real complexity of multiple AWS accounts, regions, and environments without creating unmaintainable sprawl?

Answer framework:

Multi-region and multi-account deployments are where IaC complexity escalates significantly. The naive approach of duplicating configurations per region and account creates maintenance nightmares. The over-abstracted approach of building a single configuration that handles every region and account combination creates brittle complexity. The right approach depends on how different your regions and accounts actually are.

For multi-account, AWS Organizations with Terraform requires separate provider configurations per account. The recommended pattern uses provider aliases and assumes roles into target accounts from a central management account.

hcl

Terragrunt is particularly effective for multi-region deployments because it supports hierarchical configuration. A root terragrunt.hcl defines account-level settings, region directories inherit from it and add region-specific settings, and individual component directories inherit from both and add component-specific configuration. The dependency block lets you reference outputs from other components in the same environment without reaching across region or account boundaries.

Global resources like IAM roles, Route53 hosted zones, and CloudFront distributions need special handling. They exist once per account, not once per region. Create a separate global directory at the account level that manages these resources with their own state. Regional stacks reference global outputs through the intermediate data store pattern described earlier.

A senior answer also addresses the operational challenge: deploying changes across multiple regions and accounts safely. Rolling deployments that start in a single region, validate, and progressively expand to other regions are essential. Never apply the same change to all regions simultaneously. For more on multi-region architecture, see distributed systems design.


15. What are the most important operational practices for running IaC at scale?

What the interviewer is really asking: Beyond the tooling, do you understand the people and process aspects that determine whether IaC succeeds or fails in a large organization?

Answer framework:

IaC at scale fails more often due to process and cultural issues than technical ones. The most important operational practices address governance, education, and feedback loops.

First, establish a clear ownership model. Every piece of infrastructure must have a team that owns its IaC definitions. This means every Terraform directory or Pulumi project has a CODEOWNERS entry, and changes require review from the owning team. Without ownership, shared infrastructure drifts because everyone assumes someone else is maintaining it.

Second, implement progressive delivery for infrastructure changes. Never apply changes directly to production. Every change flows through a pipeline: development, staging, production canary, production full rollout. Each stage has validation gates. This mirrors application deployment best practices but is often overlooked for infrastructure.

Third, invest in developer experience for IaC. If writing Terraform is significantly harder than clicking buttons in the console, engineers will circumvent the IaC process during incidents and never come back to codify their changes. Provide scaffolding tools that generate new module structures, pre-configured CI pipelines, and documentation templates. Make the right thing the easy thing.

hcl

Fourth, monitor IaC health metrics: how many resources are managed by IaC versus manually created, how frequently drift is detected, how long it takes to resolve drift, and how often applies fail. These metrics reveal whether your IaC practice is healthy or degrading.

Fifth, plan for disaster recovery of the IaC system itself. Your state backend, module registry, CI/CD pipeline, and secrets management are all critical infrastructure. If your Terraform state bucket is deleted, can you recover? If your module registry goes down, can teams still deploy? These meta-infrastructure concerns are often neglected until they cause an incident. For comprehensive interview preparation, check our pricing page to explore structured learning plans.


How to Practice

Infrastructure as Code skills are best developed through hands-on practice with real cloud resources. Start with a free-tier cloud account and build progressively complex infrastructure.

  1. Build a complete environment from scratch. Create a VPC, subnets, security groups, an EC2 instance, an RDS database, and a load balancer entirely in Terraform. Do not touch the console. When something breaks, fix it in code.

  2. Practice state surgery. Intentionally create state issues and resolve them. Import existing resources into state, move resources between state files, remove resources from state without destroying them. These skills are critical during production incidents.

  3. Implement a module with tests. Build a reusable module, publish it to a private registry, consume it from another configuration, and write both static analysis checks and integration tests for it.

  4. Simulate drift and remediate it. Deploy infrastructure with Terraform, make manual changes through the console, then detect and resolve the drift through code. Practice the decision process of whether to update code or revert the manual change.

  5. Build a CI/CD pipeline. Set up Atlantis or a GitHub Actions workflow that runs plan on PRs and apply on merge. Experience the workflow that production teams use daily.

  6. Try multiple tools. Build the same infrastructure in Terraform and Pulumi. Notice what each tool makes easy and what each makes hard. This comparative experience gives you credibility in tool selection discussions.

For structured practice with feedback, explore algoroq's learning paths for infrastructure and DevOps topics, and review our system design interview guide for broader architectural preparation.


Common Mistakes to Avoid

  1. Treating IaC as a one-time migration project. IaC is an ongoing discipline, not a project with an end date. Organizations that import existing infrastructure into Terraform and then stop investing in IaC practices quickly find themselves back in manual-change territory.

  2. Over-abstracting modules too early. Building highly parameterized, generic modules before you have enough use cases to understand the real abstraction boundaries leads to modules that are harder to use than raw resources. Start specific, refactor to generic when the pattern repeats three times.

  3. Ignoring state security. State files contain sensitive data. Treating them as regular files without encryption, access control, and audit logging is a security vulnerability that interviewers will probe.

  4. Monolithic state files. Managing hundreds of resources in a single state file creates long plan times, high blast radius, and frequent state lock contention. Decompose early and aggressively.

  5. Not testing infrastructure code. Skipping tests because infrastructure is expensive to test is a false economy. The cost of a production incident caused by an untested change far exceeds the cost of ephemeral test environments.

  6. Conflating environment differences with IaC complexity. If your staging and production configurations diverge significantly, the problem is not your IaC tooling. The problem is that staging does not accurately represent production, which undermines its purpose.

  7. Manual changes during incidents without follow-up. Making manual changes during an incident is sometimes necessary. Not codifying those changes in IaC afterward is how drift becomes permanent and trust in IaC erodes.

  8. Forgetting to plan for IaC tool upgrades. Terraform, Pulumi, and CloudFormation all release breaking changes. Pin your tool versions, test upgrades in lower environments, and budget time for version migrations. Running outdated versions accumulates security and compatibility debt.

For more interview preparation topics, explore our guides on Kubernetes, Docker, CI/CD, and cloud architecture. Check out algoroq's pricing for structured interview preparation plans.

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.