10 Popular Interview Questions for Mid-Level Databricks Data Engineer Roles ($120-150k)
Where "I Know The Syntax" Stops Working
๐ฏ 1: Your pipeline writes 10,000 small files daily. Why is this a problem and how do you fix it?
โ ๏ธ ๐ง๐ต๐ฒ ๐๐๐ป๐ถ๐ผ๐ฟ ๐๐ป๐๐๐ฒ๐ฟ (๐ก๐ผ๐ ๐๐ถ๐ฟ๐ฒ๐ฑ):
โSmall files are slower to read. Iโd combine them somehow.โ
Why this fails: Doesnโt explain the mechanism or show practical solutions.
โ ๐ง๐ต๐ฒ ๐ ๐ถ๐ฑ-๐๐ฒ๐๐ฒ๐น ๐๐ป๐๐๐ฒ๐ฟ (๐๐ถ๐ฟ๐ฒ๐ฑ):
โSmall files create three problems:
1) Metadata overhead - listing 10,000 files is slow, especially on cloud storage,
2) Inefficient reads - each file means separate I/O operations and task scheduling,
3) Poor parallelism - many files smaller than a partition target.
Solutions: run OPTIMIZE regularly to compact files, enable auto-compaction in Delta (delta.autoOptimize), or repartition/coalesce before writing.
For streaming, increase trigger intervals to batch more data per write. Target file size should be 100MB-1GB.โ
Key concepts to mention:
โข Cloud storage metadata overhead (list operations)
โข One file = one task minimum overhead
โข OPTIMIZE for compaction
โข delta.autoOptimize.autoCompact setting
โข Target file size: 100MB-1GB
๐ก ๐ช๐ต๐ฎ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฟ๐ฒ ๐ง๐ฒ๐๐๐ถ๐ป๐ด:
Do you understand production data problems? Shows youโve run real pipelines, not just tutorials.
๐ฏ 2: When would you use Delta Live Tables vs a regular notebook pipeline?
โ ๏ธ ๐ง๐ต๐ฒ ๐๐๐ป๐ถ๐ผ๐ฟ ๐๐ป๐๐๐ฒ๐ฟ (๐ก๐ผ๐ ๐๐ถ๐ฟ๐ฒ๐ฑ):
โDLT is the newer way to build pipelines in Databricks.โ
Why this fails: Doesnโt explain trade-offs or decision criteria.
โ ๐ง๐ต๐ฒ ๐ ๐ถ๐ฑ-๐๐ฒ๐๐ฒ๐น ๐๐ป๐๐๐ฒ๐ฟ (๐๐ถ๐ฟ๐ฒ๐ฑ):
โDLT is ideal when:
1) You need built-in data quality checks via expectations,
2) The pipeline is mostly transformations following medallion pattern,
3) You want automatic dependency management between tables.
Regular notebooks are better when:
1) You need fine-grained control over execution,
2) Complex branching logic or external API calls,
3) Integration with non-Delta targets. DLT abstracts away orchestration and incremental processing - great for standardization, but less flexible for edge cases.โ
Key concepts to mention:
โข DLT expectations for data quality
โข Automatic dependency resolution
โข Incremental processing handled automatically
โข Trade-off: abstraction vs control
โข Streaming and batch in same DLT pipeline
๐ก ๐ช๐ต๐ฎ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฟ๐ฒ ๐ง๐ฒ๐๐๐ถ๐ป๐ด:
Can you make architecture decisions with trade-offs? Shows you think beyond โwhat tool exists.โ
๐ฏ 3: Explain MERGE INTO. When would you use it vs INSERT OVERWRITE?
โ ๏ธ ๐ง๐ต๐ฒ ๐๐๐ป๐ถ๐ผ๐ฟ ๐๐ป๐๐๐ฒ๐ฟ (๐ก๐ผ๐ ๐๐ถ๐ฟ๐ฒ๐ฑ):
โMERGE does upserts - updates existing rows or inserts new ones.โ
Why this fails: Correct but doesnโt explain performance implications or alternatives.
โ ๐ง๐ต๐ฒ ๐ ๐ถ๐ฑ-๐๐ฒ๐๐ฒ๐น ๐๐ป๐๐๐ฒ๐ฟ (๐๐ถ๐ฟ๐ฒ๐ฑ):
โMERGE INTO matches source rows against target based on a condition, then updates/inserts/deletes.
Use it for: CDC (Change Data Capture), SCD Type 2 history tracking, or when source has mixed updates and inserts. INSERT OVERWRITE replaces entire partitions - faster when youโre recomputing complete partitions anyway (daily reload of a partition). MERGE scans the target table to find matches - for large tables, this is expensive. If Iโm always inserting new data with no updates, plain INSERT is faster. Key optimization: partition the target table and include partition columns in the MERGE condition.โ
Key concepts to mention:
โข MERGE scans target for matching keys
โข Partition pruning in MERGE conditions
โข INSERT OVERWRITE for full partition replacement
โข CDC and SCD Type 2 use cases
โข Performance: MERGE on large tables is expensive
๐ก ๐ช๐ต๐ฎ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฟ๐ฒ ๐ง๐ฒ๐๐๐ถ๐ป๐ด:
Do you understand write pattern trade-offs? Shows you can choose the right pattern for the workload.
๐ฏ 4: How do you handle duplicate records in your pipeline?
โ ๏ธ ๐ง๐ต๐ฒ ๐๐๐ป๐ถ๐ผ๐ฟ ๐๐ป๐๐๐ฒ๐ฟ (๐ก๐ผ๐ ๐๐ถ๐ฟ๐ฒ๐ฑ):
โUse distinct() or dropDuplicates().โ
Why this fails: Works for one-time dedup, but doesnโt address incremental pipelines.
โ ๐ง๐ต๐ฒ ๐ ๐ถ๐ฑ-๐๐ฒ๐๐ฒ๐น ๐๐ป๐๐๐ฒ๐ฟ (๐๐ถ๐ฟ๐ฒ๐ฑ):
โDepends on where and when duplicates appear:
1) Source duplicates in batch data: dropDuplicates() with specific columns, or window functions with row_number() to pick the latest.
2) Duplicates from pipeline reruns: MERGE with match condition on business keys - duplicates become no-ops.
3) Streaming duplicates: dropDuplicatesWithinWatermark() with event-time deduplication.
4) Prevention: add unique constraints at the Delta table level using CHECK constraints or expectations in DLT. Iโd identify WHY duplicates exist (source issue? reprocessing?) and fix the root cause, not just filter them.โ
Key concepts to mention:
โข dropDuplicates() with specific columns
โข Window functions for keeping latest
โข MERGE for idempotent writes
โข Streaming watermark-based dedup
โข Root cause analysis vs symptom treatment
๐ก ๐ช๐ต๐ฎ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฟ๐ฒ ๐ง๐ฒ๐๐๐ถ๐ป๐ด:
Can you handle real data quality issues? Shows youโve dealt with messy production data.
๐ฏ 5: Whatโs the difference between partitioning and Z-ordering? When do you use each?
โ ๏ธ ๐ง๐ต๐ฒ ๐๐๐ป๐ถ๐ผ๐ฟ ๐๐ป๐๐๐ฒ๐ฟ (๐ก๐ผ๐ ๐๐ถ๐ฟ๐ฒ๐ฑ):
โBoth improve query performance by organizing data.โ
Why this fails: Too vague - doesnโt explain how they work differently.
โ ๐ง๐ต๐ฒ ๐ ๐ถ๐ฑ-๐๐ฒ๐๐ฒ๐น ๐๐ป๐๐๐ฒ๐ฟ (๐๐ถ๐ฟ๐ฒ๐ฑ):
โPartitioning physically separates data into directories (table/date=2024-01-01/) - queries with partition filters skip entire directories. Use for: high-cardinality filter columns (date, region), when most queries filter on this column.
Z-ordering co-locates related data WITHIN files using space-filling curves - enables data skipping via file-level min/max stats. Use for: columns commonly filtered together, lower-cardinality columns.
Key difference: partition pruning is directory-level (free), Z-ordering is file-level (needs stats lookup). Iโd partition by date, Z-order by customer_id or product_category.โ
Key concepts to mention:
โข Partition = directory structure
โข Z-ordering = data co-location within files
โข Partition pruning vs data skipping
โข Partition column cardinality limits (~10k values max)
โข Z-order after OPTIMIZE
๐ก ๐ช๐ต๐ฎ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฟ๐ฒ ๐ง๐ฒ๐๐๐ถ๐ป๐ด:
Do you understand storage optimization strategies? Shows you can design tables for performance.
๐ฏ 6: How do you monitor and alert on pipeline failures?
โ ๏ธ ๐ง๐ต๐ฒ ๐๐๐ป๐ถ๐ผ๐ฟ ๐๐ป๐๐๐ฒ๐ฟ (๐ก๐ผ๐ ๐๐ถ๐ฟ๐ฒ๐ฑ):
โCheck the Databricks job logs when something fails.โ
Why this fails: Reactive, not proactive - you shouldnโt find out from users.
โ ๐ง๐ต๐ฒ ๐ ๐ถ๐ฑ-๐๐ฒ๐๐ฒ๐น ๐๐ป๐๐๐ฒ๐ฟ (๐๐ถ๐ฟ๐ฒ๐ฑ):
โMulti-layer monitoring:
1) Job-level: Configure email/Slack alerts on job failure in Databricks Workflows,
2) Data quality: DLT expectations that fail or warn on quality issues, plus custom checks (row counts, null percentages),
3) Freshness: track table update timestamps, alert if data is stale beyond SLA,
4) Metrics: push custom metrics to monitoring tools (Datadog, CloudWatch) for dashboards.
Iโd also implement a โheartbeatโ table that downstream consumers check - if not updated, they know upstream failed. Log important checkpoints for debugging.โ
Key concepts to mention:
โข Workflow alerts (email, Slack, PagerDuty)
โข DLT expectations for data quality
โข Data freshness/SLA monitoring
โข Custom metrics to external systems
โข Heartbeat tables for dependency awareness
๐ก ๐ช๐ต๐ฎ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฟ๐ฒ ๐ง๐ฒ๐๐๐ถ๐ป๐ด:
Do you build production-ready pipelines? Shows you think about operations, not just development.
๐ฏ 7: Explain VACUUM in Delta Lake. Whatโs the risk if you get it wrong?
โ ๏ธ ๐ง๐ต๐ฒ ๐๐๐ป๐ถ๐ผ๐ฟ ๐๐ป๐๐๐ฒ๐ฟ (๐ก๐ผ๐ ๐๐ถ๐ฟ๐ฒ๐ฑ):
โVACUUM cleans up old files to save storage.โ
Why this fails: Doesnโt explain retention or the time travel impact.
โ ๐ง๐ต๐ฒ ๐ ๐ถ๐ฑ-๐๐ฒ๐๐ฒ๐น ๐๐ป๐๐๐ฒ๐ฟ (๐๐ถ๐ฟ๐ฒ๐ฑ):
โVACUUM removes data files no longer referenced by the transaction log, reclaiming storage. Default retention is 7 days - files older than that are deleted.
The risk: time travel queries and long-running jobs. If a query started before VACUUM reads a file that VACUUM deletes, it fails.
Setting retention too low (< 7 days requires spark.databricks.delta.retentionDurationCheck.enabled = false) breaks time travel.
Iโd coordinate VACUUM with:
1) When no long-running queries are active,
2) After verifying no pipelines need historical versions,
3) Retention aligned with compliance/audit requirements.โ
Key concepts to mention:
โข Removes unreferenced data files
โข Default 7-day retention
โข Time travel depends on retained files
โข Long-running query failure risk
โข retentionDurationCheck safety guard
๐ก ๐ช๐ต๐ฎ๐ ๐๐ป๐๐ฒ๐ฟ๐๐ถ๐ฒ๐๐ฒ๐ฟ๐ ๐๐ฟ๐ฒ ๐ง๐ฒ๐๐๐ถ๐ป๐ด:
Do you understand Delta maintenance and risks? Shows you can manage tables safely in production.


