INTERVIEW_QUESTIONS

Data Engineering Interview Questions for Senior Engineers (2026)

Master advanced data engineering interview questions covering data pipeline architecture, warehouse design, batch and stream processing, data quality, and scalable data infrastructure for senior roles.

20 min readUpdated Apr 19, 2026
interview-questionsdata-engineeringsenior-engineer

Why Data Engineering Matters in Senior Engineering Interviews

Data engineering has evolved from a niche specialty into a core competency that every senior engineer encounters. Companies like Airbnb, Spotify, and Uber have built their competitive advantages on sophisticated data infrastructure, and the demand for engineers who can design, build, and operate data platforms at scale continues to outpace supply. Even if your title isn't "Data Engineer," senior-level interviews increasingly include questions about data pipeline design, warehouse architecture, and data quality.

At the senior level, interviewers expect you to reason about end-to-end data flows — from ingestion through transformation to serving — with deep understanding of trade-offs around latency, cost, reliability, and data quality. They want to see that you can choose the right tools for specific problems rather than defaulting to whatever's trending.

This guide covers the most demanding data engineering interview questions with frameworks that demonstrate architectural thinking. For related topics, see our ETL pipeline concepts, how data warehouses work, and system design interview guide.


1. Design a data pipeline architecture for a company processing 10 billion events per day.

What the interviewer is really asking: Can you think end-to-end about data flow, from ingestion to analytics, with appropriate technology choices at each layer?

Answer framework:

10 billion events/day = ~115K events/second average, with peaks potentially 3-5x higher (350-575K events/sec). At an average event size of 500 bytes, that's ~5TB/day of raw data.

Architecture layers:

Ingestion layer: Use Apache Kafka as the central event bus. Events arrive from application servers, mobile SDKs, and third-party integrations via HTTP endpoints. An API gateway validates and routes events to Kafka topics. Kafka handles the peak throughput with horizontal partitioning — 100+ partitions per topic for high-throughput topics.

Stream processing layer: Apache Flink or Kafka Streams processes events in real-time for use cases requiring low latency: fraud detection, real-time dashboards, alerting. Flink provides exactly-once semantics, event-time processing, and windowed aggregations. Output goes to operational stores (Redis for real-time metrics, Elasticsearch for real-time search).

Batch processing layer: Events are also written from Kafka to object storage (S3/GCS) in Parquet format using a connector like Kafka Connect with the S3 Sink connector. A batch processing engine (Spark or dbt on a warehouse) runs scheduled transformations — hourly, daily aggregations, ML feature computation, and data quality checks.

Storage layer: Raw events in object storage (data lake), processed data in a data warehouse (BigQuery, Snowflake, or Redshift) for analytics, and derived datasets in serving stores (Redis, PostgreSQL) for application consumption.

Orchestration: Apache Airflow schedules and monitors batch pipelines. DAGs define dependencies between tasks with retry logic, alerting, and SLA monitoring.

Data quality: Implement schema validation at ingestion (schema registry for Kafka), data quality checks between pipeline stages (Great Expectations or dbt tests), and end-to-end data reconciliation (compare event counts between source, Kafka, and warehouse).

This architecture implements the Lambda pattern (parallel batch and stream paths) or can be simplified to a Kappa architecture (stream-only) if batch reprocessing requirements are minimal. See our Kafka deep dive and Kafka vs RabbitMQ comparison for technology selection context.

Follow-up questions:

  • How would you handle schema evolution as event formats change over time?
  • What happens when a Kafka consumer falls behind? How do you handle backpressure?
  • How would you ensure exactly-once delivery from source to warehouse?

2. Explain the difference between a data lake, data warehouse, and data lakehouse.

What the interviewer is really asking: Do you understand the evolution of data architecture and can you choose the right approach for different requirements?

Answer framework:

Data warehouse (Snowflake, BigQuery, Redshift): Structured, schema-on-write storage optimized for analytical queries. Data is cleaned, transformed, and loaded into a predefined schema (star or snowflake schema). Strengths: fast query performance (columnar storage, query optimization), strong governance (schema enforcement, access control), excellent for BI/reporting. Weaknesses: expensive for large data volumes, requires upfront schema design, limited support for unstructured data.

Data lake (S3/GCS + Spark/Presto): Raw data stored in open formats (Parquet, ORC, JSON) on object storage. Schema-on-read — schema is applied when data is queried, not when stored. Strengths: cheap storage, handles any data format (structured, semi-structured, unstructured), supports ML workloads on raw data. Weaknesses: no query optimization (query performance depends on file organization), easy to become a "data swamp" without governance, limited ACID transactions.

Data lakehouse (Databricks/Delta Lake, Apache Iceberg, Apache Hudi): Combines data lake economics with warehouse capabilities. Uses open table formats (Delta, Iceberg, Hudi) that add ACID transactions, schema evolution, time travel, and efficient query planning on top of object storage. Strengths: single copy of data for both analytics and ML, warehouse-grade performance with data lake economics, open formats prevent vendor lock-in. Weaknesses: newer technology (less mature tooling), requires expertise to tune performance.

Decision framework: If your primary workload is BI/reporting and team prefers SQL, start with a warehouse. If you have diverse data types and ML workloads, consider a lakehouse. Pure data lakes are increasingly rare as lakehouse formats provide better governance at minimal additional cost.

See our how data warehouses work and distributed systems guide.

Follow-up questions:

  • How does Apache Iceberg solve the metadata management problem of data lakes?
  • When would you use both a data lake and a data warehouse in the same architecture?
  • What is the query performance difference between Iceberg tables and native warehouse tables?

3. How do you design a dimensional model for an analytics warehouse?

What the interviewer is really asking: Do you understand star schemas, slowly changing dimensions, and the modeling decisions that determine query performance and analyst productivity?

Answer framework:

Dimensional modeling (Kimball methodology) organizes data into fact tables (measurements/events) and dimension tables (descriptive context). This structure optimizes for analytical queries by minimizing joins and enabling intuitive exploration.

Fact tables contain quantitative measurements — revenue, click counts, order amounts, page views. Each row represents an event or measurement at the lowest grain (individual transactions, hourly metrics). Fact tables have foreign keys to dimension tables and numeric measure columns. Types: transaction facts (one row per event), periodic snapshot facts (one row per entity per period), and accumulating snapshot facts (one row per process lifecycle).

Dimension tables contain descriptive attributes — customer demographics, product categories, store locations, time hierarchies. They are wide (many columns) and relatively small. They enable slicing and dicing facts by different perspectives.

Star schema: Fact table in the center with foreign keys to denormalized dimension tables. Simple, fast queries, and easy for analysts to understand. This is the standard approach.

Slowly Changing Dimensions (SCD): How to handle dimension attributes that change over time.

  • Type 1: Overwrite the old value. Simple, but loses history.
  • Type 2: Add a new row with effective dates and a current flag. Preserves history, allows point-in-time analysis. Most common for important dimensions.
  • Type 3: Add a "previous" column. Limited history (only one change).

Example for an e-commerce company: a fact table fct_orders (order_id, customer_key, product_key, date_key, quantity, revenue, discount) with dimensions dim_customer (customer_key, name, segment, city, state), dim_product (product_key, name, category, subcategory, brand), dim_date (date_key, date, day_of_week, month, quarter, year, is_holiday).

Modern approaches: the dbt community has popularized a layered approach — staging (rename/type-cast), intermediate (business logic), and mart (final dimensional model) layers, enabling testing and version control for data transformations.

See our how data warehouses work and system design interview guide.

Follow-up questions:

  • When would you use a snowflake schema instead of a star schema?
  • How would you handle a fact table with billions of rows in terms of partitioning and clustering?
  • What is the One Big Table (OBT) pattern and when is it appropriate?

4. How do you ensure data quality in production data pipelines?

What the interviewer is really asking: Do you treat data quality as a first-class concern with systematic validation, or just hope the data is correct?

Answer framework:

Data quality encompasses six dimensions: accuracy, completeness, consistency, timeliness, uniqueness, and validity. A production data quality strategy addresses all of these.

Prevention (schema-level): Enforce schemas at ingestion using a schema registry (Confluent Schema Registry for Kafka). Reject invalid events at the API gateway. Use strict typing in pipeline code.

Detection (testing):

  • Schema tests: Column types, not-null constraints, unique constraints, referential integrity (foreign keys exist in dimension tables).
  • Volume tests: Row count within expected range (±20% of yesterday), no empty tables after pipeline runs.
  • Freshness tests: Data arrived within expected time window. The latest timestamp is within the last hour.
  • Distribution tests: Column value distributions haven't shifted dramatically (detecting data drift). Percentile values are within expected ranges.
  • Custom business rules: Revenue is never negative, order dates are not in the future, customer age is between 0 and 150.

Tools: dbt tests (built-in and custom), Great Expectations (Python-based data validation), Monte Carlo / Bigeye (data observability platforms), and Soda Core (open-source data quality).

Response (alerting and recovery): When data quality checks fail, alert the data team (PagerDuty, Slack), halt downstream pipelines to prevent bad data propagation, and provide clear information about which check failed and the impact.

Data contracts: Formal agreements between data producers and consumers defining schema, SLAs (freshness, completeness), and ownership. When a producer changes their schema, the contract ensures consumers are notified and migration is coordinated.

Lineage tracking: Know where data comes from and where it goes. When a quality issue is detected, lineage helps identify the root cause and all affected downstream assets. Tools: dbt docs, OpenLineage, DataHub, Apache Atlas.

See our monitoring system design and backend development crash course for related observability patterns.

Follow-up questions:

  • How would you implement data quality checks without slowing down pipeline execution?
  • What is a data contract, and how do you enforce it across teams?
  • How would you handle a scenario where 5% of events have corrupted data?

5. Compare batch processing frameworks: Spark, Flink, and dbt.

What the interviewer is really asking: Can you choose the right processing framework based on requirements rather than defaulting to Spark for everything?

Answer framework:

Apache Spark: General-purpose distributed processing engine. Strengths: massive ecosystem, handles batch and micro-batch streaming, supports SQL (Spark SQL), Python (PySpark), Scala, and Java. ML libraries (MLlib). Best for: large-scale data transformations requiring custom code, ML feature engineering, processing diverse data formats. Weaknesses: complex cluster management, high latency for streaming (micro-batch), expensive for simple SQL transformations.

Apache Flink: Stream-first processing engine that also handles batch. Strengths: true streaming with exactly-once semantics, event-time processing with watermarks, low latency (milliseconds). Best for: real-time analytics, CEP (complex event processing), streaming ETL. Weaknesses: smaller ecosystem than Spark, steeper learning curve, fewer SQL users.

dbt (data build tool): SQL-based transformation framework that runs inside the warehouse. Strengths: pure SQL (accessible to analysts), version-controlled transformations, built-in testing and documentation, incremental models. Best for: warehouse-centric transformations, analytics engineering, when the team is SQL-proficient. Weaknesses: limited to what SQL can express (no ML, complex algorithms), depends on warehouse compute (can be expensive at scale), no streaming support.

Decision framework:

  • Need real-time processing? → Flink
  • Need complex transformations on diverse data formats with custom code? → Spark
  • Need SQL transformations within a warehouse for analytics? → dbt
  • Need both batch and simple streaming? → Spark Structured Streaming
  • Building an analytics engineering practice? → dbt

Many organizations use multiple tools: dbt for warehouse transformations, Spark for data lake processing, and Flink for real-time streams. The key is matching the tool to the use case rather than standardizing on one.

See our how Spark works, Kafka vs Flink comparison, and guides for distributed systems.

Follow-up questions:

  • How would you decide between running Spark on your own cluster vs using a managed service?
  • What are dbt's incremental models and when would you use them?
  • How does Flink achieve exactly-once semantics in streaming?

6. How would you design a data platform for a company transitioning from monolithic to microservices?

What the interviewer is really asking: Do you understand the data challenges of microservices — specifically, how to aggregate data from distributed services for analytics?

Answer framework:

In a monolith, analytics queries join tables directly. In microservices, each service owns its database, and direct cross-service joins are impossible. This requires a data integration strategy.

Event-driven data integration: Each service publishes domain events to Kafka (OrderCreated, PaymentProcessed, UserRegistered). A data platform team consumes these events and integrates them in a central data store.

Change Data Capture (CDC): Use Debezium to capture changes from each service's database and publish to Kafka. This is less invasive than requiring services to publish events — the data platform taps into the database changelog directly.

Central data store: Events and CDC data flow into a data lake (object storage) and/or data warehouse. The data platform team builds integrated models that join data across service boundaries — something individual services cannot do.

Data mesh consideration: For large organizations, a centralized data platform becomes a bottleneck. The data mesh pattern decentralizes data ownership — each domain team owns their data products (well-defined, quality-assured datasets) while a central team provides the platform infrastructure (storage, processing, governance tools).

Key challenges:

  • Schema evolution: Services evolve independently. Use a schema registry to manage compatibility. Choose backward-compatible evolution (new fields are optional, old fields are deprecated not removed).
  • Data consistency: Eventual consistency between services means the central store may temporarily have inconsistent cross-service data. Build reconciliation processes.
  • Data ownership: Define clear ownership for each dataset. The data platform team integrates but doesn't own domain data.
  • PII and compliance: Centralized data stores must respect per-service access controls. Implement column-level encryption and access policies.

See our how Kafka works, microservices architecture, and backend development crash course.

Follow-up questions:

  • How would you handle a service that changes its database schema without notifying the data team?
  • What is the difference between the data mesh and the traditional centralized data platform?
  • How do you handle PII from multiple services in a central data store?

7. Explain partitioning and clustering strategies for data warehouses.

What the interviewer is really asking: Do you understand how physical data organization affects query performance and cost in modern data warehouses?

Answer framework:

Partitioning divides a table into segments based on a column value (usually a date). Queries that filter on the partition column only scan relevant partitions, dramatically reducing data processed and cost.

BigQuery: partition by date column (daily partitions), integer range, or ingestion time. Partitions can be required in queries (require_partition_filter = true) to prevent expensive full-table scans.

Snowflake: uses micro-partitions (automatically partitioned into 50-500MB compressed chunks). Snowflake's pruning automatically skips micro-partitions based on metadata (min/max values per partition). Clustering keys define the sort order within micro-partitions, improving pruning for specific query patterns.

Redshift: uses distribution styles (KEY, EVEN, ALL, AUTO) to distribute data across nodes, and sort keys (compound or interleaved) to optimize range-restricted queries.

Clustering (sort order within partitions) optimizes queries that filter on specific columns. In BigQuery, clustering on [region, product_category] sorts data by these columns within each partition, enabling efficient range pruning. In Snowflake, clustering keys serve the same purpose.

Design guidelines:

  • Partition by the column most commonly used for time-range filters (usually a date)
  • Cluster by columns frequently used in WHERE clauses and JOIN conditions
  • Order clustering columns by cardinality (low to high for BigQuery, most-queried first for Snowflake)
  • Monitor query scan volumes to identify partitioning/clustering opportunities

Anti-patterns: partitioning on high-cardinality columns (creates millions of tiny partitions), not including partition filters in queries (expensive full scans), and clustering on columns rarely used in filters.

See our how data warehouses work and system design interview guide.

Follow-up questions:

  • How does BigQuery's automatic re-clustering work?
  • What is the cost impact of poorly chosen partition/clustering keys?
  • How would you handle a table that's queried by both date and customer_id equally?

8. How do you handle late-arriving data in a data pipeline?

What the interviewer is really asking: Do you understand the practical challenges of time-based processing in distributed systems where events arrive out of order?

Answer framework:

Late-arriving data is inevitable in distributed systems. Events may be delayed due to network issues, mobile devices being offline, third-party integrations with batch delivery, or timezone/clock skew issues.

Stream processing approach: Apache Flink uses watermarks to handle late data. A watermark declares that no events earlier than a certain timestamp will arrive. Events arriving after the watermark are "late." Strategies:

  • Drop late events: Simplest but loses data. Acceptable when precision isn't critical.
  • Allow lateness: Flink's allowedLateness parameter accepts late events within a window (e.g., 1 hour) and updates results. Downstream consumers must handle updates.
  • Side output: Route late events to a separate stream for special handling (manual review, separate processing).

Batch processing approach: Design pipelines for reprocessing.

  • Idempotent writes: Pipelines can be re-run without creating duplicates. Use MERGE/UPSERT operations or write to partitions that overwrite completely.
  • Restatement windows: Process data in overlapping windows. The daily pipeline processes the last 3 days, not just today, catching any late events from previous days.
  • Event time vs processing time: Always use event timestamps (when the event occurred) for partitioning and windowing, not processing timestamps (when the pipeline processed it).

Warehouse approach: Use MERGE (upsert) operations to handle late-arriving facts. For Type 2 SCD dimensions, late-arriving dimension changes require retroactive adjustments — insert the correct historical row and adjust effective dates.

Data quality implications: late-arriving data means dashboard numbers may change retroactively. Communicate this to stakeholders. Define SLAs for when numbers are considered "final" (e.g., T+3 days for daily metrics).

See our stream processing concepts and distributed systems guide.

Follow-up questions:

  • How do you decide the watermark delay for a streaming pipeline?
  • What is the impact of late-arriving data on aggregate metrics that have already been served to dashboards?
  • How would you design a pipeline that needs to be reprocessed from scratch?

9. Describe your approach to data modeling in a modern data stack.

What the interviewer is really asking: Do you follow a structured modeling methodology, or do you create ad-hoc tables without a strategy?

Answer framework:

The modern data stack typically uses a layered modeling approach, popularized by dbt:

Staging layer (stg_): 1:1 with source tables. Light transformations — rename columns to consistent conventions, cast types, standardize values (lowercase, trim). No business logic. Each staging model maps to one source._

Intermediate layer (int_): Business logic transformations. Joins between staging models, deduplication, business rule application, aggregation preparation. These are internal building blocks, not exposed to end users._

Mart layer (mart_ or fct_/dim_): Final tables exposed to analysts and dashboards. Follows dimensional modeling — fact tables for metrics and dimension tables for attributes. Optimized for query patterns._

Metrics layer: dbt Metrics or a semantic layer (Looker, Cube.js, MetricFlow) defines business metrics centrally. Instead of each dashboard calculating "revenue" differently, the metrics layer provides a single definition. This ensures consistency across all consumers.

Modeling principles:

  • DRY (Don't Repeat Yourself): Business logic appears in one place. Downstream models reference it.
  • Incremental models: For large tables, process only new/changed rows instead of rebuilding the entire table. dbt's incremental materialization handles this with a merge strategy.
  • Snapshot models: Capture slowly changing dimension history automatically. dbt snapshots check for changes and create SCD Type 2 records.
  • Testing at every layer: Schema tests (not null, unique, accepted values), data tests (custom SQL assertions), and freshness tests.

Version control and CI/CD: all transformations in Git, PR reviews for model changes, CI pipelines that test on a development copy of data before merging, and documentation auto-generated from model configs.

See our how data warehouses work and guides for system design interviews.

Follow-up questions:

  • How would you handle a transformation that's too complex for SQL?
  • What is the difference between a view, table, and incremental materialization in dbt?
  • How do you handle breaking changes in the staging layer without affecting downstream models?

10. How do you approach data pipeline orchestration and monitoring?

What the interviewer is really asking: Do you understand the operational aspects of keeping data pipelines reliable, or just the initial build?

Answer framework:

Orchestration defines when pipelines run, in what order, and what happens when they fail.

Apache Airflow is the industry standard. Key concepts:

  • DAGs (Directed Acyclic Graphs): Define task dependencies. Task B runs only after Task A succeeds.
  • Sensors: Wait for external conditions (file arrival, partition creation, API availability) before proceeding.
  • Retry policies: Automatic retry with exponential backoff for transient failures.
  • SLAs: Define expected completion times. Alert when pipelines miss their SLAs.
  • Idempotency: Every task should be safely re-runnable. Use date-partitioned outputs and upsert logic.

Alternatives: Prefect (modern Python-native), Dagster (asset-centric, strong testing), and Mage (collaborative notebook-style). Choose based on team experience and requirements.

Monitoring ensures pipelines are healthy and data is trustworthy:

  • Pipeline health: Task success/failure rates, execution duration trends (detect drift), queue depth (tasks waiting to run), resource utilization (Spark cluster CPU/memory).
  • Data health: Row counts per pipeline run (detect drops), freshness (time since last update), schema changes (detected automatically), and data quality test results.
  • Alerting: PagerDuty for critical pipeline failures (revenue-impacting), Slack for warnings (SLA approaching). Avoid alert fatigue — only page for actionable issues.
  • Observability: End-to-end lineage from source to dashboard. When a dashboard shows wrong numbers, trace back through the lineage to find where the data went wrong.

See our monitoring system design and system design interview guide.

Follow-up questions:

  • How would you handle a pipeline that occasionally fails due to source API rate limits?
  • What is the difference between task-centric (Airflow) and asset-centric (Dagster) orchestration?
  • How do you handle backfilling when a pipeline logic change requires reprocessing historical data?

11. How do you handle PII and data privacy in data pipelines?

What the interviewer is really asking: Do you consider compliance (GDPR, CCPA, HIPAA) as part of pipeline design, or as an afterthought?

Answer framework:

Data privacy must be built into pipeline architecture from the start, not bolted on later.

Classification: Catalog all data fields by sensitivity level. PII (names, emails, phone numbers), sensitive PII (SSN, financial data), and non-sensitive data. Use automated scanning tools (AWS Macie, Google DLP) to discover PII in data lakes.

Anonymization techniques:

  • Hashing: One-way hash (SHA-256) for pseudonymization. Allows joining across datasets without exposing the original value. Add salt to prevent rainbow table attacks.
  • Tokenization: Replace PII with a random token, storing the mapping in a separate, tightly controlled vault. Reversible when needed (customer support).
  • Generalization: Replace specific values with ranges (age: 25 → age_group: 20-30, zip: 10001 → region: NYC metro).
  • Differential privacy: Add calibrated noise to aggregate statistics to prevent individual identification while preserving statistical utility.

Access control: Column-level access policies in the warehouse (BigQuery column-level security, Snowflake column masking). Analysts see masked values; only authorized roles see raw PII. Row-level security filters data by tenant/region.

Right to erasure (GDPR Article 17): Design pipelines that can delete a specific user's data across all systems. This is extremely difficult in data lakes (immutable files). Strategies: use Delta Lake / Iceberg for mutable data lake files, maintain a deletion log, run periodic deletion jobs.

Retention policies: Automated data lifecycle management. Raw data deleted after 90 days, aggregated data retained for 2 years, PII removed from cold storage after the retention period.

See our cryptography concepts and healthcare tech considerations.

Follow-up questions:

  • How would you handle a GDPR deletion request that affects data already processed into aggregate tables?
  • What is the difference between anonymization and pseudonymization?
  • How do you audit who accessed PII in your data warehouse?

12. Explain the concept of data mesh and when it makes sense.

What the interviewer is really asking: Do you understand organizational patterns for data, or just technical patterns? Data mesh is as much about team structure as technology.

Answer framework:

Data mesh (coined by Zhamak Dehghani) proposes four principles for decentralizing data ownership:

  1. Domain-oriented ownership: Each business domain (orders, payments, inventory) owns its data end-to-end, including the analytical data products derived from it. No central data team bottleneck.

  2. Data as a product: Domain teams treat their data outputs as products — with SLAs, documentation, discoverability, and quality guarantees. A data product is a well-defined, trustworthy dataset that other teams can consume.

  3. Self-serve data platform: A central platform team provides the infrastructure (storage, processing, governance tools) as a platform. Domain teams use this platform to build and serve their data products without needing deep infrastructure expertise.

  4. Federated computational governance: Global policies (naming conventions, access control, data quality standards) are defined centrally but enforced computationally through the platform. This ensures interoperability across data products.

When data mesh makes sense: large organizations (100+ engineers) where a centralized data team is a bottleneck, companies with clearly defined business domains, and organizations that have already invested in platform engineering.

When it doesn't make sense: small teams (under 50 engineers), early-stage companies where domains are not well-defined, organizations without strong engineering culture in domain teams, or when a centralized team can handle the workload without bottlenecks.

Common implementation mistakes: treating data mesh as a technology choice (it's an organizational pattern), decentralizing without providing the self-serve platform (domain teams flounder), and not investing in governance (data products become incompatible).

See our microservices architecture concepts and system design interview guide.

Follow-up questions:

  • How do you maintain data quality standards across decentralized teams?
  • What tools and platforms support a data mesh architecture?
  • How does data mesh interact with regulatory requirements like GDPR?

13. How do you handle schema evolution in data pipelines?

What the interviewer is really asking: Schema changes are inevitable. Do you have a strategy that prevents pipeline breakage?

Answer framework:

Schema evolution is the process of modifying data schemas while maintaining backward (or forward) compatibility with existing consumers.

Schema registry: For streaming pipelines (Kafka), use a schema registry (Confluent Schema Registry) that enforces compatibility rules. Four compatibility modes:

  • Backward compatible: New schema can read old data. Safe for consumers to upgrade first. (Adding optional fields, removing fields consumers don't use.)
  • Forward compatible: Old schema can read new data. Safe for producers to upgrade first. (Adding fields, removing optional fields.)
  • Full compatible: Both backward and forward. (Only adding optional fields.)
  • None: No compatibility checks. Dangerous in production.

Data lake format evolution: Apache Iceberg, Delta Lake, and Hudi support schema evolution natively. You can add columns, rename columns, widen types (int → long), and reorder columns without rewriting data files. Iceberg tracks schema changes in metadata, and historical data is read using the schema version at query time.

Warehouse evolution: dbt handles schema changes through model definitions. When a source adds a column, update the staging model. When a column is renamed, update the staging model and let downstream models adapt through the DAG. Use on_schema_change: "sync_all_columns" for incremental models.

Pipeline resilience patterns:

  • Lenient parsing: Ignore unknown fields rather than failing on unexpected columns.
  • Default values: When a required field is missing, use a documented default rather than failing.
  • Schema validation: Validate schemas at the boundary (when data enters your system) and reject invalid records to a dead-letter queue for investigation.
  • Contract testing: Producers run schema compatibility tests in CI before deploying changes.

See our backend development crash course and distributed systems guide.

Follow-up questions:

  • How would you handle a breaking schema change (column type change) in a production pipeline?
  • What is the difference between schema evolution support in Iceberg vs Delta Lake?
  • How do you communicate schema changes to downstream consumers?

14. How do you optimize data pipeline costs?

What the interviewer is really asking: Do you think about cost as a design constraint? Can you balance performance, reliability, and cost?

Answer framework:

Data infrastructure is often one of the largest cloud expenses. Optimization spans compute, storage, and query costs.

Compute optimization:

  • Right-size Spark clusters (autoscaling, spot instances for fault-tolerant workloads — 60-80% savings)
  • Use incremental processing instead of full refreshes (process only changed data)
  • Schedule non-critical pipelines during off-peak hours (cheaper compute in some platforms)
  • Use serverless options (BigQuery, Athena) for sporadic workloads instead of always-on clusters

Storage optimization:

  • Use columnar formats (Parquet, ORC) for 50-90% compression vs CSV/JSON
  • Implement tiered storage (hot/warm/cold) with lifecycle policies
  • Remove unused tables and intermediate datasets
  • Compact small files (the "small file problem" in data lakes wastes storage and degrades query performance)

Query optimization:

  • Partition and cluster tables appropriately (reduce data scanned per query)
  • Materialize commonly queried aggregations (pre-compute instead of re-computing)
  • Monitor query costs (BigQuery slot usage, Snowflake credit consumption) and set alerts for expensive queries
  • Use query result caching and avoid unnecessary full table scans

Organizational practices:

  • Implement chargeback/showback — attribute costs to teams so they're motivated to optimize
  • Set up cost anomaly detection — alert when daily spend exceeds historical patterns
  • Review and decommission unused pipelines, datasets, and dashboards quarterly
  • Use reserved capacity (Snowflake pre-purchased credits, BigQuery flat-rate pricing) for predictable workloads

See our system design interview guide and learning paths for cost-aware architecture patterns.

Follow-up questions:

  • How would you reduce a Snowflake bill that's 3x over budget?
  • What is the small file problem in data lakes, and how do you solve it?
  • How do you balance cost optimization with data freshness requirements?

15. Describe how you would build a feature store for machine learning.

What the interviewer is really asking: Do you understand the intersection of data engineering and ML engineering? Can you design infrastructure that serves both batch training and real-time inference?

Answer framework:

A feature store is a centralized system for managing, serving, and sharing ML features. It solves the problem of training-serving skew (features computed differently for training vs inference) and feature duplication (multiple teams computing the same features).

Architecture:

  • Feature registry: Metadata catalog of all features — name, description, owner, data type, freshness SLA, and lineage (how it's computed).
  • Offline store: Stores historical feature values for model training. Typically backed by a data warehouse or data lake (Parquet files on S3). Supports point-in-time joins (retrieve feature values as they were at a specific timestamp, preventing data leakage).
  • Online store: Stores the latest feature values for real-time inference. Backed by a low-latency store (Redis, DynamoDB, Bigtable). Updated by streaming pipelines or periodic batch jobs.
  • Feature computation: Batch features computed by Spark/dbt (daily user aggregates, 7-day rolling averages). Streaming features computed by Flink/Spark Streaming (real-time click counts, session features).

Key capabilities:

  • Point-in-time correctness: When training a model, features must reflect what was known at prediction time, not future data. The offline store supports time-travel queries.
  • Feature sharing: Teams discover and reuse existing features instead of recomputing. This reduces compute costs and ensures consistency.
  • Monitoring: Track feature freshness (is the online store up to date?), feature drift (has the distribution changed?), and serving latency (are features served within SLA?).

Tools: Feast (open-source, Kubernetes-native), Tecton (managed, real-time focus), Databricks Feature Store (integrated with Delta Lake), and SageMaker Feature Store (AWS-native).

See our ML pipeline design, how recommendation engines work, and AI engineering guide.

Follow-up questions:

  • How do you handle feature backfilling when a new feature is added?
  • What is training-serving skew and how does a feature store prevent it?
  • How would you handle features that require real-time computation with sub-10ms latency?

Common Mistakes in Data Engineering Interviews

  1. Jumping to tools without understanding requirements — Suggesting Spark or Kafka without analyzing the data volume, latency requirements, and team capabilities shows tool-first thinking.

  2. Ignoring data quality — Building pipelines without discussing testing, validation, and monitoring reveals a lack of production experience.

  3. Not considering cost — Designing architectures without considering compute and storage costs shows a gap in senior-level thinking.

  4. Treating batch and stream as separate problems — Modern architectures often need both. Discuss how they complement each other.

  5. Forgetting about schema evolution — Data formats change. Not discussing schema management and backward compatibility is a red flag.

  6. Overlooking data governance and privacy — PII handling, access control, and compliance are first-class concerns, not afterthoughts.

How to Prepare

Week 1: Set up a local data pipeline with Kafka, Spark, and PostgreSQL. Ingest sample events, transform them, and load into a warehouse.

Week 2: Implement data quality checks with Great Expectations or dbt tests. Build an Airflow DAG with error handling and SLA monitoring.

Week 3: Study real-world data architectures from companies like Airbnb (data mesh), Netflix (real-time processing), and Spotify (event-driven pipelines).

Week 4: Practice end-to-end system design for data platforms, including cost estimation, technology selection, and operational monitoring.

For comprehensive preparation, see our system design interview guide and explore the learning paths. Check out our pricing plans for full access.

Related Resources

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.