INTERVIEW_QUESTIONS
ETL/Data Pipeline Interview Questions for Senior Engineers (2026)
Master advanced ETL and data pipeline interview questions covering pipeline architecture, ELT patterns, data transformation strategies, error handling, and production reliability for senior engineering roles.
Why ETL/Data Pipeline Knowledge Matters in Senior Engineering Interviews
ETL (Extract, Transform, Load) and data pipelines form the circulatory system of modern data-driven organizations. Every business decision powered by data, every ML model prediction, and every analytics dashboard depends on reliable data pipelines moving and transforming data from source systems to consumable formats. At companies like Netflix, Airbnb, and Uber, data pipeline engineering is a core infrastructure function that directly impacts product quality and business outcomes.
Senior engineers are expected to design pipelines that handle terabytes of daily data with strict SLAs, recover gracefully from failures, evolve with changing business requirements, and operate within cost budgets. Interviewers test whether you can think beyond the happy path — handling schema drift, late-arriving data, partial failures, and data quality issues.
This guide covers the most challenging ETL/data pipeline interview questions with frameworks for structured answers. For related concepts, see our ETL pipeline concepts, how Kafka works, and system design interview guide.
1. Explain the difference between ETL and ELT, and when you would choose each.
What the interviewer is really asking: Do you understand modern data architecture patterns and can you choose the right approach based on the tools and requirements?
Answer framework:
ETL (Extract, Transform, Load): Data is extracted from source systems, transformed in a separate processing engine (Spark, custom code), and then loaded into the target system. The transformation happens outside the target database, typically on dedicated compute infrastructure.
When to use ETL: when the target system lacks compute power for transformations (flat files, legacy databases), when transformations require complex logic beyond SQL (ML feature engineering, image processing), when you need to reduce data volume before loading (filter, aggregate, deduplicate), or when source and target systems are on different networks and you want to minimize data transfer.
ELT (Extract, Load, Transform): Data is extracted from sources, loaded raw into the target system (typically a cloud data warehouse like BigQuery, Snowflake, or Redshift), and then transformed inside the target using its native compute engine (SQL).
When to use ELT: when the target is a modern cloud warehouse with elastic compute (Snowflake, BigQuery), when transformations are expressible in SQL, when you want to preserve raw data for audit and reprocessing, when the team is SQL-proficient and uses dbt for transformations. ELT has become the dominant pattern in modern data stacks because cloud warehouses provide virtually unlimited compute and storage separation.
The modern pattern is often "EtLT" — light extraction transforms (schema normalization, deduplication), load into the warehouse, then heavy transformations in SQL. This balances raw data preservation with practical data quality at ingestion.
See our how data warehouses work and data engineering interview guide for deeper context.
Follow-up questions:
- How does the choice between ETL and ELT affect data pipeline testing strategies?
- What are the cost implications of ELT in a consumption-based pricing model like BigQuery?
- How would you handle a transformation that's too complex for SQL in an ELT pipeline?
2. How do you design an idempotent data pipeline?
What the interviewer is really asking: Can you build pipelines that are safe to retry and rerun without creating duplicates or data corruption? This is a fundamental production requirement.
Answer framework:
Idempotency means running the same pipeline multiple times with the same input produces the same result — no duplicates, no missed records, no side effects. This is critical because pipelines fail and must be retried.
Strategies for idempotent pipelines:
Partition overwrite: Write outputs to date-partitioned targets. Re-running the pipeline for a specific date completely replaces that partition's data. This is inherently idempotent — the output is always the same regardless of how many times you run it. Example: INSERT OVERWRITE TABLE target PARTITION (date='2026-04-19') SELECT ....
Upsert/MERGE: Use MERGE (SQL) or upsert operations that insert new records and update existing ones based on a natural key. Even if the pipeline runs twice, the second run updates to the same values. Example: MERGE INTO target USING source ON target.id = source.id WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ....
Deduplication: Include deduplication logic in the pipeline. Assign each source event a unique ID (event_id, or hash of content). Use window functions or GROUP BY to keep only the latest version per ID.
Transactional writes: Use atomic write operations — either all records are written or none are. Delta Lake, Iceberg, and Hudi support atomic commits. If a pipeline fails mid-write, the partial write is rolled back.
Checkpointing: For streaming pipelines, checkpoint consumer offsets after successful processing. On restart, resume from the last checkpoint. Kafka's consumer group offsets and Flink's checkpointing mechanism provide this.
Common idempotency mistakes: using INSERT without deduplication (re-runs create duplicates), using auto-increment IDs as the deduplication key (each run generates new IDs), and relying on processing timestamps instead of event timestamps for partitioning.
See our distributed systems guide and backend development crash course.
Follow-up questions:
- How would you make a pipeline that calls an external API idempotent?
- What is the difference between idempotent and exactly-once processing?
- How do you handle idempotency when the pipeline writes to multiple targets?
3. How do you handle errors and failures in production data pipelines?
What the interviewer is really asking: Do you design for failure, or do you only handle the happy path? Production pipelines fail regularly.
Answer framework:
Error handling in data pipelines operates at multiple levels:
Record-level errors: Individual records that fail validation or transformation. Never fail the entire pipeline for one bad record. Route invalid records to a dead-letter queue (DLQ) or error table with the error reason, timestamp, and original data. Process errors separately — alert if error rate exceeds a threshold (e.g., >1% of records), investigate and fix, then replay from the DLQ.
Task-level errors: A pipeline step fails (API timeout, out of memory, temporary network issue). Implement automatic retry with exponential backoff. Set a maximum retry count (3-5 attempts). After max retries, alert and pause downstream tasks (don't propagate partial data).
Pipeline-level errors: The entire pipeline fails (source system down, schema change, credentials expired). Airflow/orchestrator marks the run as failed and sends alerts. The pipeline must be designed for manual re-runs — idempotent operations ensure safe retry.
Partial failure handling: In a multi-step pipeline (A → B → C), if step B fails after processing 50% of data, you need a strategy. Options: reprocess from the beginning (safe with idempotent writes, but slow), resume from checkpoint (faster, but requires checkpointing infrastructure), or use transactional writes (atomic commit only on full success).
Monitoring and alerting:
- Data freshness alerts: target table hasn't been updated in X hours
- Volume anomaly alerts: row count is significantly different from the expected range
- Error rate alerts: percentage of records in DLQ exceeds threshold
- Duration alerts: pipeline taking significantly longer than usual (may indicate upstream issues)
Circuit breaker pattern: if a source system is consistently failing, stop retrying and alert rather than overwhelming it with retry requests. Resume automatically when the source recovers.
See our monitoring system design and system design interview guide.
Follow-up questions:
- How do you decide between failing fast vs. continuing with partial data?
- How would you implement a dead-letter queue for a Spark pipeline?
- What is the circuit breaker pattern and how does it apply to data pipelines?
4. Design a real-time data pipeline for fraud detection.
What the interviewer is really asking: Can you design an end-to-end streaming pipeline with strict latency requirements, enrichment, and ML inference?
Answer framework:
Fraud detection requires processing transactions in real-time (sub-second latency) with enriched features and ML model inference.
Architecture:
-
Event ingestion: Payment events published to Kafka from the payment service. Each event contains transaction ID, user ID, amount, merchant, timestamp, device info, IP address.
-
Stream enrichment (Flink): Enrich each event with:
- User profile (from Redis/feature store): account age, historical transaction patterns, risk score
- Merchant profile: category, fraud rate, location
- Velocity features (computed in Flink): transaction count in last 5 min, 1 hour, 24 hours for this user. Amount velocity. Unique merchant count.
-
Rule engine: Apply deterministic rules first (blocklisted merchants, amount exceeds user limit, geographic impossibility). Rules provide instant decisions without ML latency.
-
ML inference: For transactions passing rules, invoke the ML model (served via a model serving platform like TensorFlow Serving or SageMaker endpoints). The model scores the transaction risk based on enriched features.
-
Decision and action: Based on rule results and ML score, the pipeline outputs one of: approve (score below threshold), review (score in middle range — queue for human review), or block (score above threshold or rule triggered). Decisions are published to a Kafka topic consumed by the payment service.
-
Feedback loop: Human review outcomes and confirmed fraud cases are fed back as training data for model retraining.
Latency budget: Total end-to-end must be under 500ms. Kafka ingestion: 10ms. Stream enrichment: 50ms. Rule engine: 10ms. ML inference: 100ms. Decision publishing: 10ms. Budget for enrichment data lookup (Redis): 5ms.
See our fraud detection system design, stream processing concepts, and ML pipeline design.
Follow-up questions:
- How would you handle ML model updates without pipeline downtime?
- What happens if the enrichment data store (Redis) is temporarily unavailable?
- How do you evaluate the fraud model's performance in production?
5. How do you handle backfilling and reprocessing in data pipelines?
What the interviewer is really asking: Do you design pipelines that can be rerun for historical data? This is essential for bug fixes, new feature rollouts, and data recovery.
Answer framework:
Backfilling is reprocessing historical data through a pipeline, typically needed when: a bug is discovered and fixed (reprocess affected dates), a new derived table or feature is added (compute for all historical data), source data is corrected retroactively, or a pipeline logic change requires recomputing historical outputs.
Design principles for backfill-friendly pipelines:
Parameterized date ranges: Every pipeline accepts start_date and end_date parameters. Normal runs process yesterday. Backfills specify a custom range. The pipeline reads source data filtered by the date range and writes to the corresponding output partitions.
Idempotent writes: Backfills overwrite the same partitions as normal runs. Re-running for any date range is always safe.
Incremental vs full refresh: Incremental pipelines (append-only, merge-based) are efficient for normal runs but tricky for backfills. Full-refresh pipelines (rebuild entire partition) are simpler for backfills but more expensive for normal runs. A good pattern is to support both modes — incremental by default, full-refresh for backfills.
Resource management: Backfilling years of data can overwhelm the system. Implement: throttling (process N days concurrently, not all at once), priority (backfills run at lower priority than production pipelines), and progress tracking (checkpoint which dates have been processed, resume from there on failure).
Airflow backfill: airflow dags backfill --start-date 2025-01-01 --end-date 2025-12-31 --reset-dagruns dag_id. This creates DAG runs for each logical date in the range.
Streaming pipeline backfill: More complex. Options: replay Kafka topics from archived data (if using tiered storage), consume from object storage instead of Kafka for historical data, or maintain a separate batch pipeline for historical processing (Lambda architecture).
See our ETL pipeline concepts and distributed systems guide.
Follow-up questions:
- How would you backfill a pipeline that depends on external API data that's no longer available?
- How do you handle downstream dependencies during a backfill (don't trigger downstream pipelines for backfill runs)?
- What is the operational impact of backfilling a pipeline that feeds real-time dashboards?
6. Compare CDC (Change Data Capture) approaches: log-based vs query-based.
What the interviewer is really asking: Do you understand the trade-offs between different CDC approaches and when each is appropriate?
Answer framework:
Log-based CDC (Debezium, AWS DMS, Fivetran): Reads the database's transaction log (PostgreSQL WAL, MySQL binlog, MongoDB oplog) to capture every change as it happens. The CDC tool connects as a replication client.
Advantages: captures all changes (including intermediate states between queries), minimal impact on source database (reads the log, not the tables), low latency (changes captured within seconds), captures deletes (which query-based CDC often misses).
Disadvantages: requires database configuration (enable logical replication, appropriate WAL level), specific to each database type, schema changes in the source may break the CDC connector, and the tool must maintain its position in the log (if it falls behind and the log is truncated, a full re-snapshot is needed).
Query-based CDC (periodic polling): Queries the source table periodically for changes, typically using an updated_at timestamp or a version column. SELECT * FROM orders WHERE updated_at > last_run_timestamp.*
Advantages: works with any database (just needs SQL access), simple to implement, no database configuration changes needed, works through read replicas.
Disadvantages: cannot capture deletes (deleted rows aren't returned by the query), misses changes between polling intervals (if a row is updated twice between polls, the intermediate state is lost), adds query load to the source database, higher latency (depends on polling interval).
Hybrid approach: Use log-based CDC for critical tables where you need every change and low latency. Use query-based CDC for less critical tables or systems where log-based CDC isn't available. Implement soft deletes (a deleted_at column) to make query-based CDC capable of capturing deletes.
Tool comparison: Debezium (open-source, Kafka-native, log-based), Fivetran (managed, supports both approaches), Airbyte (open-source, both approaches), AWS DMS (managed, log-based for AWS databases).
See our how Kafka works, PostgreSQL interview questions, and backend development crash course.
Follow-up questions:
- How would you handle the initial snapshot when setting up log-based CDC on a large table?
- What happens when the CDC connector goes down for several hours and the source log is truncated?
- How do you handle schema evolution in CDC pipelines?
7. How would you design a data pipeline for processing 100TB of daily log data?
What the interviewer is really asking: Can you design for massive scale with appropriate technology choices and cost awareness?
Answer framework:
100TB/day = ~1.15GB/second sustained throughput. This is a significant data volume requiring careful architecture.
Ingestion: Logs are collected by agents (Fluentd, Filebeat, Vector) on application servers and forwarded to Kafka. With an average log line of 500 bytes and ~200 billion log lines/day, Kafka needs significant cluster capacity. Use multiple Kafka clusters or tiered storage (Kafka with S3-backed cold storage) to manage costs.
Processing: Apache Spark on a managed cluster (EMR, Dataproc) processes log data in hourly or daily batches. Use columnar formats (Parquet) for intermediate storage — 100TB of raw logs compresses to approximately 10-20TB in Parquet.
Key processing steps: parse and structure raw logs (grok patterns, JSON parsing), enrich with metadata (service name, environment, region), filter noise (health checks, debug logs in production), aggregate (hourly/daily metrics), and write to both a data lake (for ad-hoc analysis) and a warehouse (for dashboards).
Storage tiers:
- Hot (0-7 days): Elasticsearch for real-time search and debugging. Only index a subset of fields to control costs.
- Warm (7-30 days): Parquet on S3/GCS. Queryable via Athena/Presto.
- Cold (30-365 days): Compressed Parquet on S3 with Glacier lifecycle policy for rarely accessed data.
- Archive (>1 year): Glacier Deep Archive or equivalent. Compliance retention only.
Cost optimization: At 100TB/day, cost is a primary design constraint. Compress aggressively (zstd compression in Parquet). Use spot instances for Spark processing (60-80% savings). Implement aggressive retention — most log data loses value quickly. Sample high-volume logs (keep 10% of debug logs, 100% of error logs).
See our monitoring system design and system design interview guide.
Follow-up questions:
- How would you handle a 10x spike in log volume during an incident?
- What is the cost breakdown between compute, storage, and query for this pipeline?
- How would you make specific log entries searchable within 60 seconds of generation?
8. How do you test data pipelines?
What the interviewer is really asking: Do you apply software engineering rigor to data pipelines, or are they untested scripts?
Answer framework:
Data pipeline testing should follow a layered approach similar to application testing:
Unit tests: Test individual transformation functions in isolation. Given specific input records, verify the output matches expectations. Use frameworks like pytest (Python), ScalaTest (Spark), or dbt's built-in tests. Mock external dependencies (databases, APIs).
Integration tests: Test the pipeline end-to-end with a local or test database. Seed the source with known test data, run the pipeline, and verify the output. Use Docker Compose to spin up test databases, Kafka, etc.
Contract tests: Verify that source data conforms to expected schemas and business rules. Schema tests (column types, not-null), referential integrity tests (foreign keys exist), and value range tests (amounts are positive, dates are reasonable). Run these before the pipeline processes data.
Data quality tests: Run after pipeline completion. dbt tests for uniqueness, not-null, accepted values, and relationships. Custom SQL assertions for business rules. Volume tests comparing expected vs actual row counts.
Regression tests: After pipeline logic changes, compare new output against previous output for the same input. Flag unexpected differences. Use data comparison tools (datacompy, Great Expectations profiling).
Performance tests: Benchmark pipeline execution time and resource usage with production-scale data samples. Detect performance regressions before deploying to production.
CI/CD for data pipelines: Run unit and integration tests on every PR. Run data quality tests on a development warehouse with sampled production data. Deploy to staging first, run full validation, then promote to production.
See our backend development crash course and system design interview guide.
Follow-up questions:
- How do you generate representative test data for a pipeline that processes sensitive production data?
- What is the difference between testing a batch pipeline and testing a streaming pipeline?
- How do you test idempotency in a data pipeline?
9. How do you implement incremental processing in data pipelines?
What the interviewer is really asking: Can you build efficient pipelines that process only changed data instead of reprocessing everything?
Answer framework:
Incremental processing is essential for pipelines where reprocessing the entire dataset is too expensive or slow. Instead of scanning all historical data, process only records that are new or changed since the last run.
Strategies for identifying changes:
Timestamp-based: Filter source data by updated_at > last_successful_run_timestamp. Simple and works with most databases. Limitations: misses deletes, requires a reliable timestamp column, and clock skew can cause missed records. Always use a small overlap (e.g., last run minus 1 hour) to catch edge cases.
CDC-based: Use Change Data Capture to capture inserts, updates, and deletes from the source. Most reliable method for identifying all changes. See the CDC question above for approaches.
Watermark-based: Maintain a high watermark (the maximum value of a monotonically increasing column like an auto-increment ID or timestamp) from the last successful run. Process records with values above the watermark. Simple and efficient but doesn't capture updates to existing records.
dbt incremental models: dbt supports incremental materialization where the model definition includes a is_incremental() block that filters for new/changed records. The dbt framework handles merge/insert logic.
Output strategies for incremental loads:
- Append: Simply add new records. Works for immutable event data.
- Merge/Upsert: Match on a key and update existing records, insert new ones. Essential when source records can be updated.
- Partition overwrite: Replace entire partitions (e.g., today's partition). Simple but may reprocess more data than necessary.
Key challenge: exactly-once processing. If the pipeline fails after processing some records but before committing the watermark, a retry might reprocess those records. Solutions: transactional writes (commit data and watermark atomically), or idempotent writes (re-processing the same records produces the same result).
See our ETL pipeline concepts and distributed systems guide.
Follow-up questions:
- How do you handle late-arriving data in an incremental pipeline?
- What are the limitations of dbt incremental models?
- How would you switch an existing full-refresh pipeline to incremental without data loss?
10. How do you handle data pipeline dependencies across teams?
What the interviewer is really asking: Do you understand the organizational challenges of data pipelines, not just the technical ones?
Answer framework:
Cross-team pipeline dependencies are one of the most common sources of data pipeline failures and organizational friction.
Data contracts: Formalize the interface between producer and consumer teams. A data contract specifies: schema (field names, types, nullability), SLAs (freshness — data available by 6am, completeness — 99.9% of records present), semantic meaning (what does the "revenue" field include? gross or net?), and change management process (how changes are communicated and migrated).
Schema registry: Use a central schema registry (Confluent Schema Registry, AWS Glue Schema Registry) to enforce schema compatibility. Producers register schemas before publishing data. Consumers are guaranteed backward compatibility.
Event-driven dependencies: Instead of polling for data availability, use events. When Team A's pipeline completes, publish a completion event to Kafka or SNS. Team B's pipeline subscribes and starts only when the event arrives. Airflow's ExternalTaskSensor provides this for DAG-level dependencies.
Data catalog and discovery: Use a data catalog (DataHub, Apache Atlas, Amundsen) so teams can discover available datasets, understand their schemas, and see lineage. This reduces duplicate pipelines and ensures teams build on the correct source.
Ownership and escalation: Every dataset must have a clear owner (team, not individual). SLA violations trigger alerts to the owning team. Establish escalation paths for cross-team data issues.
Platform approach: Provide self-serve infrastructure so teams can build their own pipelines without depending on a central data team for every request. The platform provides: ingestion connectors, transformation frameworks, quality checks, monitoring, and documentation tools.
See our system design interview guide and career path from senior to staff engineer.
Follow-up questions:
- How would you handle a situation where an upstream team pushes a breaking schema change without notice?
- What is the difference between data contracts and schema validation?
- How do you handle versioning of datasets consumed by multiple downstream teams?
11. Compare data ingestion tools: Fivetran, Airbyte, and custom connectors.
What the interviewer is really asking: Can you evaluate build vs buy for data ingestion and choose appropriately?
Answer framework:
Fivetran (managed SaaS): Fully managed connectors for 300+ sources. Zero maintenance — Fivetran handles schema changes, API pagination, rate limiting, and error handling. Pricing based on monthly active rows (MAR), which can become expensive at scale. Best for: teams that want to focus on transformation (dbt) not ingestion, standard SaaS sources (Salesforce, Stripe, Google Analytics), and companies where engineering time is more expensive than Fivetran licensing.
Airbyte (open-source, self-hosted or cloud): Open-source alternative to Fivetran with 300+ connectors. Can be self-hosted (free, but operational overhead) or cloud-hosted (paid, managed). Community-contributed connectors vary in quality. Best for: cost-sensitive organizations, teams that need custom connectors, companies requiring data to stay within their infrastructure.
Custom connectors: Build your own ingestion using Python, Spark, or dedicated frameworks. Maximum flexibility but highest development and maintenance cost. Best for: sources without off-the-shelf connectors, complex ingestion logic (custom APIs, file formats, business rules), and extremely high-volume sources where managed tools are too expensive.
Decision framework:
- Standard SaaS sources + small-medium volume → Fivetran (minimize engineering effort)
- Standard sources + cost-sensitive or self-hosting required → Airbyte
- Non-standard sources or extreme volume → Custom connectors
- Mix: use Fivetran/Airbyte for standard sources, custom connectors for special cases
Key evaluation criteria: connector reliability (handling of API changes, rate limits, pagination), schema change handling (automatic vs manual), monitoring and alerting, cost at your data volume, and data security (where does data transit?).
See our ETL pipeline concepts and backend development crash course.
Follow-up questions:
- How would you estimate the cost of Fivetran vs building custom connectors?
- What is the operational overhead of self-hosting Airbyte?
- How do you handle connector failures and data gaps with managed ingestion tools?
12. How do you implement data lineage tracking?
What the interviewer is really asking: Do you understand why lineage matters and how to implement it practically?
Answer framework:
Data lineage tracks the journey of data from source to consumption — which sources feed which transformations, which transformations produce which outputs, and which dashboards/models consume which outputs.
Why lineage matters: root cause analysis (a dashboard shows wrong numbers — trace back to find where the error was introduced), impact analysis (before changing a table, understand all downstream consumers), compliance (prove where a data point came from for regulatory audits), and debugging (understand why a pipeline takes 3 hours — identify the slow path in the DAG).
Implementation approaches:
Metadata-driven (dbt, Airflow): dbt automatically generates lineage from model references (ref() and source() functions). Airflow tracks task dependencies within and across DAGs. These tools provide lineage within their scope but not across tool boundaries.
Instrumentation-based (OpenLineage): OpenLineage is an open standard for lineage metadata. Integrations for Spark, Airflow, dbt, Flink, and others emit lineage events as pipelines run. A lineage backend (Marquez, DataHub) collects and visualizes the end-to-end lineage.
Catalog-driven (DataHub, Amundsen, Atlan): Data catalog platforms crawl metadata from databases, warehouses, BI tools, and pipeline tools to construct lineage graphs. They combine automated discovery with manual annotation.
Practical implementation strategy: start with dbt lineage (low effort, immediate value for warehouse transformations), add OpenLineage integration for Airflow and Spark, feed lineage into a catalog like DataHub for visualization and search. This provides column-level lineage from source databases through transformations to dashboards.
See our monitoring system design and system design interview guide.
Follow-up questions:
- What is column-level lineage, and why is it more valuable than table-level lineage?
- How do you handle lineage for pipelines that include external API calls?
- How does lineage help with GDPR compliance?
13. Design a pipeline for near-real-time data synchronization between microservices.
What the interviewer is really asking: Can you bridge the gap between operational (microservices) and analytical (data platform) data needs with low latency?
Answer framework:
Microservices need eventual consistency between their databases for operational purposes (e.g., the order service needs to know current inventory, the recommendation service needs recent user activity).
Architecture using CDC and Kafka:
-
Source capture: Each microservice's database changes are captured by Debezium CDC connectors, producing events to service-specific Kafka topics (e.g.,
cdc.orders.order_events,cdc.inventory.stock_levels). -
Stream processing: Flink or Kafka Streams consumes CDC events, applies transformations (denormalization, enrichment, filtering), and produces derived events to output topics.
-
Consumer services: Other microservices subscribe to relevant output topics and update their local data stores. This implements the CQRS (Command Query Responsibility Segregation) pattern — each service has a read-optimized local view of the data it needs.
-
Materialized views: For complex cross-service queries (e.g., "orders with customer and product details"), a stream processor joins events from multiple services and materializes the result in a query-optimized store (Elasticsearch for search, Redis for key-value lookups, PostgreSQL for relational queries).
Key design considerations:
- Ordering: Kafka guarantees ordering within a partition. Partition by entity ID to ensure all events for a given entity are processed in order.
- Exactly-once: Use Kafka transactions and idempotent producers to prevent duplicates. Consumer services must handle duplicate delivery (idempotent consumers).
- Consistency: The system is eventually consistent. Design services to handle temporary inconsistency (e.g., an order references a product that hasn't been replicated yet — retry or use a fallback).
- Latency: CDC → Kafka → consumer typically achieves sub-second latency. Monitor end-to-end latency and alert on degradation.
See our how Kafka works, microservices concepts, and system design interview guide.
Follow-up questions:
- How would you handle a scenario where a consumer service is down for 2 hours?
- What is the Outbox pattern, and when would you use it instead of CDC?
- How do you test a multi-service data synchronization pipeline?
14. How do you handle data deduplication at scale?
What the interviewer is really asking: Can you implement efficient deduplication for billions of records with appropriate accuracy/performance trade-offs?
Answer framework:
Deduplication strategies depend on whether you need exact or approximate deduplication, and whether you're processing batch or streaming data.
Exact deduplication in batch processing: Use a GROUP BY or window function on the deduplication key (event_id, or a composite key). ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY timestamp DESC) = 1 keeps the latest version. For Spark, use dropDuplicates() on the key columns. This requires shuffling all data by the key, which is expensive for large datasets.
Exact deduplication in streaming: Maintain a state store of seen event IDs. For each incoming event, check if the ID exists in the state store. Flink and Kafka Streams maintain this state internally. Challenge: the state store grows unboundedly. Solutions: TTL on state entries (assume duplicates arrive within a time window), periodic cleanup, or use a compact Kafka topic as a changelog.
Approximate deduplication: When exact deduplication is too expensive, use probabilistic data structures. Bloom filters provide a space-efficient set membership test with a configurable false positive rate (false negatives are impossible). HyperLogLog estimates cardinality (useful for counting unique events, not filtering).
Cross-system deduplication: The same event may arrive from multiple sources (mobile app retry, webhook retry, CDC replay). Use a globally unique event ID (UUID generated at the source) as the deduplication key across all systems.
Deduplication in data warehouses: MERGE operations naturally deduplicate (update if exists, insert if not). For append-only tables, schedule periodic deduplication jobs that identify and remove duplicates. Delta Lake and Iceberg support MERGE for efficient deduplication.
See our distributed systems guide and data structures for system design.
Follow-up questions:
- How would you deduplicate events across a 30-day window with billions of events?
- What is the memory/storage cost of exact vs approximate deduplication?
- How do you handle deduplication when the deduplication key is not unique (e.g., duplicate records with different IDs)?
15. How do you approach data pipeline migration from legacy systems?
What the interviewer is really asking: Have you dealt with the messy reality of migrating from old systems while maintaining business continuity?
Answer framework:
Pipeline migration is one of the highest-risk projects in data engineering. A phased approach minimizes business disruption:
Phase 1 — Assessment: Document the existing pipeline: inputs, outputs, transformations, schedules, SLAs, dependencies, and consumers. Map every downstream consumer (dashboards, reports, ML models, applications). Identify data quality issues in the current system (migration is an opportunity to fix them).
Phase 2 — Parallel build: Build the new pipeline alongside the old one. Both run simultaneously, producing output to separate targets. The old pipeline remains the source of truth. Do not modify the old pipeline during migration.
Phase 3 — Validation: Compare outputs between old and new pipelines. Automated comparison: row counts, sum/avg of numeric columns, sample record matching. Investigate and resolve every discrepancy. Some differences may be intentional (bug fixes), but all must be documented and approved.
Phase 4 — Shadow mode: Switch consumers to read from both old and new outputs. Consumers compare results and log discrepancies. No business impact — the old output is still the source of truth.
Phase 5 — Cutover: Switch consumers to the new pipeline. Keep the old pipeline running for a rollback window (2-4 weeks). Monitor closely.
Phase 6 — Decommission: After the rollback window, stop the old pipeline. Archive its code and documentation.
Key risks and mitigations: undocumented business logic in the old pipeline (spend time understanding every transformation, talk to consumers), data format differences (implicit type conversions in old systems may not be obvious), timing differences (old pipeline runs at midnight, new pipeline uses different timezone handling), and stakeholder resistance (involve consumers early, demonstrate accuracy).
See our backend development crash course and career path from senior to staff engineer.
Follow-up questions:
- How would you handle a migration where the old system has no documentation?
- What is the rollback strategy if the new pipeline has a critical bug after cutover?
- How do you manage the project timeline when the old system keeps getting new requirements during migration?
Common Mistakes in ETL/Data Pipeline Interviews
-
Designing non-idempotent pipelines — If re-running creates duplicates or misses data, the pipeline is not production-ready. Always discuss idempotency.
-
Ignoring error handling — Only describing the happy path signals lack of production experience. Discuss dead-letter queues, retries, and alerting.
-
Not considering backfill — Pipelines that can't be rerun for historical data are brittle. Design for reprocessing from the start.
-
Over-engineering for batch needs — Not every pipeline needs Kafka and Flink. Simple scheduled SQL queries are appropriate for many use cases.
-
Ignoring cost — Data pipelines can be extremely expensive at scale. Always discuss cost implications of technology choices.
-
Not testing data pipelines — Treating pipelines differently from application code by skipping tests is a red flag at the senior level.
How to Prepare
Week 1: Build a complete ETL pipeline locally using Python, Airflow, and PostgreSQL. Implement error handling, retries, and idempotent writes.
Week 2: Set up a Kafka-based streaming pipeline with Flink or Kafka Streams. Implement CDC with Debezium.
Week 3: Practice dbt transformations in a cloud warehouse. Implement tests, incremental models, and documentation.
Week 4: Design end-to-end data pipeline architectures for different scenarios (real-time analytics, batch reporting, ML feature engineering). Practice articulating trade-offs.
For comprehensive preparation, see our system design interview guide and explore the learning paths. Check our pricing plans for full access.
Related Resources
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.