10 Popular Interview Questions for Mid-Level Databricks Data Engineer Roles ($120-150k)

Where "I Know The Syntax" Stops Working

Jakub Lasak

Dec 07, 2025

∙ Paid

🎯 1: Your pipeline writes 10,000 small files daily. Why is this a problem and how do you fix it?

⚠️ 𝗧𝗵𝗲 𝗝𝘂𝗻𝗶𝗼𝗿 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗡𝗼𝘁 𝗛𝗶𝗿𝗲𝗱):

“Small files are slower to read. I’d combine them somehow.”

Why this fails: Doesn’t explain the mechanism or show practical solutions.

✅ 𝗧𝗵𝗲 𝗠𝗶𝗱-𝗟𝗲𝘃𝗲𝗹 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗛𝗶𝗿𝗲𝗱):

“Small files create three problems:

1) Metadata overhead - listing 10,000 files is slow, especially on cloud storage,

2) Inefficient reads - each file means separate I/O operations and task scheduling,

3) Poor parallelism - many files smaller than a partition target.

Solutions: run OPTIMIZE regularly to compact files, enable auto-compaction in Delta (delta.autoOptimize), or repartition/coalesce before writing.

For streaming, increase trigger intervals to batch more data per write. Target file size should be 100MB-1GB.”

Key concepts to mention:

• Cloud storage metadata overhead (list operations)

• One file = one task minimum overhead

• OPTIMIZE for compaction

• delta.autoOptimize.autoCompact setting

• Target file size: 100MB-1GB

💡 𝗪𝗵𝗮𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝗲𝗿𝘀 𝗔𝗿𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴:

Do you understand production data problems? Shows you’ve run real pipelines, not just tutorials.

🎯 2: When would you use Delta Live Tables vs a regular notebook pipeline?

⚠️ 𝗧𝗵𝗲 𝗝𝘂𝗻𝗶𝗼𝗿 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗡𝗼𝘁 𝗛𝗶𝗿𝗲𝗱):

“DLT is the newer way to build pipelines in Databricks.”

Why this fails: Doesn’t explain trade-offs or decision criteria.

✅ 𝗧𝗵𝗲 𝗠𝗶𝗱-𝗟𝗲𝘃𝗲𝗹 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗛𝗶𝗿𝗲𝗱):

“DLT is ideal when:

1) You need built-in data quality checks via expectations,

2) The pipeline is mostly transformations following medallion pattern,

3) You want automatic dependency management between tables.

Regular notebooks are better when:

1) You need fine-grained control over execution,

2) Complex branching logic or external API calls,

3) Integration with non-Delta targets. DLT abstracts away orchestration and incremental processing - great for standardization, but less flexible for edge cases.”

Key concepts to mention:

• DLT expectations for data quality

• Automatic dependency resolution

• Incremental processing handled automatically

• Trade-off: abstraction vs control

• Streaming and batch in same DLT pipeline

💡 𝗪𝗵𝗮𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝗲𝗿𝘀 𝗔𝗿𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴:

Can you make architecture decisions with trade-offs? Shows you think beyond “what tool exists.”

🎯 3: Explain MERGE INTO. When would you use it vs INSERT OVERWRITE?

⚠️ 𝗧𝗵𝗲 𝗝𝘂𝗻𝗶𝗼𝗿 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗡𝗼𝘁 𝗛𝗶𝗿𝗲𝗱):

“MERGE does upserts - updates existing rows or inserts new ones.”

Why this fails: Correct but doesn’t explain performance implications or alternatives.

✅ 𝗧𝗵𝗲 𝗠𝗶𝗱-𝗟𝗲𝘃𝗲𝗹 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗛𝗶𝗿𝗲𝗱):

“MERGE INTO matches source rows against target based on a condition, then updates/inserts/deletes.

Use it for: CDC (Change Data Capture), SCD Type 2 history tracking, or when source has mixed updates and inserts. INSERT OVERWRITE replaces entire partitions - faster when you’re recomputing complete partitions anyway (daily reload of a partition). MERGE scans the target table to find matches - for large tables, this is expensive. If I’m always inserting new data with no updates, plain INSERT is faster. Key optimization: partition the target table and include partition columns in the MERGE condition.”

Key concepts to mention:

• MERGE scans target for matching keys

• Partition pruning in MERGE conditions

• INSERT OVERWRITE for full partition replacement

• CDC and SCD Type 2 use cases

• Performance: MERGE on large tables is expensive

💡 𝗪𝗵𝗮𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝗲𝗿𝘀 𝗔𝗿𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴:

Do you understand write pattern trade-offs? Shows you can choose the right pattern for the workload.

🎯 4: How do you handle duplicate records in your pipeline?

⚠️ 𝗧𝗵𝗲 𝗝𝘂𝗻𝗶𝗼𝗿 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗡𝗼𝘁 𝗛𝗶𝗿𝗲𝗱):

“Use distinct() or dropDuplicates().”

Why this fails: Works for one-time dedup, but doesn’t address incremental pipelines.

✅ 𝗧𝗵𝗲 𝗠𝗶𝗱-𝗟𝗲𝘃𝗲𝗹 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗛𝗶𝗿𝗲𝗱):

“Depends on where and when duplicates appear:

1) Source duplicates in batch data: dropDuplicates() with specific columns, or window functions with row_number() to pick the latest.

2) Duplicates from pipeline reruns: MERGE with match condition on business keys - duplicates become no-ops.

3) Streaming duplicates: dropDuplicatesWithinWatermark() with event-time deduplication.

4) Prevention: add unique constraints at the Delta table level using CHECK constraints or expectations in DLT. I’d identify WHY duplicates exist (source issue? reprocessing?) and fix the root cause, not just filter them.”

Key concepts to mention:

• dropDuplicates() with specific columns

• Window functions for keeping latest

• MERGE for idempotent writes

• Streaming watermark-based dedup

• Root cause analysis vs symptom treatment

💡 𝗪𝗵𝗮𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝗲𝗿𝘀 𝗔𝗿𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴:

Can you handle real data quality issues? Shows you’ve dealt with messy production data.

🎯 5: What’s the difference between partitioning and Z-ordering? When do you use each?

⚠️ 𝗧𝗵𝗲 𝗝𝘂𝗻𝗶𝗼𝗿 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗡𝗼𝘁 𝗛𝗶𝗿𝗲𝗱):

“Both improve query performance by organizing data.”

Why this fails: Too vague - doesn’t explain how they work differently.

✅ 𝗧𝗵𝗲 𝗠𝗶𝗱-𝗟𝗲𝘃𝗲𝗹 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗛𝗶𝗿𝗲𝗱):

“Partitioning physically separates data into directories (table/date=2024-01-01/) - queries with partition filters skip entire directories. Use for: high-cardinality filter columns (date, region), when most queries filter on this column.

Z-ordering co-locates related data WITHIN files using space-filling curves - enables data skipping via file-level min/max stats. Use for: columns commonly filtered together, lower-cardinality columns.

Key difference: partition pruning is directory-level (free), Z-ordering is file-level (needs stats lookup). I’d partition by date, Z-order by customer_id or product_category.”

Key concepts to mention:

• Partition = directory structure

• Z-ordering = data co-location within files

• Partition pruning vs data skipping

• Partition column cardinality limits (~10k values max)

• Z-order after OPTIMIZE

💡 𝗪𝗵𝗮𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝗲𝗿𝘀 𝗔𝗿𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴:

Do you understand storage optimization strategies? Shows you can design tables for performance.

🎯 6: How do you monitor and alert on pipeline failures?

⚠️ 𝗧𝗵𝗲 𝗝𝘂𝗻𝗶𝗼𝗿 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗡𝗼𝘁 𝗛𝗶𝗿𝗲𝗱):

“Check the Databricks job logs when something fails.”

Why this fails: Reactive, not proactive - you shouldn’t find out from users.

✅ 𝗧𝗵𝗲 𝗠𝗶𝗱-𝗟𝗲𝘃𝗲𝗹 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗛𝗶𝗿𝗲𝗱):

“Multi-layer monitoring:

1) Job-level: Configure email/Slack alerts on job failure in Databricks Workflows,

2) Data quality: DLT expectations that fail or warn on quality issues, plus custom checks (row counts, null percentages),

3) Freshness: track table update timestamps, alert if data is stale beyond SLA,

4) Metrics: push custom metrics to monitoring tools (Datadog, CloudWatch) for dashboards.

I’d also implement a ‘heartbeat’ table that downstream consumers check - if not updated, they know upstream failed. Log important checkpoints for debugging.”

Key concepts to mention:

• Workflow alerts (email, Slack, PagerDuty)

• DLT expectations for data quality

• Data freshness/SLA monitoring

• Custom metrics to external systems

• Heartbeat tables for dependency awareness

💡 𝗪𝗵𝗮𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝗲𝗿𝘀 𝗔𝗿𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴:

Do you build production-ready pipelines? Shows you think about operations, not just development.

🎯 7: Explain VACUUM in Delta Lake. What’s the risk if you get it wrong?

⚠️ 𝗧𝗵𝗲 𝗝𝘂𝗻𝗶𝗼𝗿 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗡𝗼𝘁 𝗛𝗶𝗿𝗲𝗱):

“VACUUM cleans up old files to save storage.”

Why this fails: Doesn’t explain retention or the time travel impact.

✅ 𝗧𝗵𝗲 𝗠𝗶𝗱-𝗟𝗲𝘃𝗲𝗹 𝗔𝗻𝘀𝘄𝗲𝗿 (𝗛𝗶𝗿𝗲𝗱):

“VACUUM removes data files no longer referenced by the transaction log, reclaiming storage. Default retention is 7 days - files older than that are deleted.

The risk: time travel queries and long-running jobs. If a query started before VACUUM reads a file that VACUUM deletes, it fails.

Setting retention too low (< 7 days requires spark.databricks.delta.retentionDurationCheck.enabled = false) breaks time travel.

I’d coordinate VACUUM with:

1) When no long-running queries are active,

2) After verifying no pipelines need historical versions,

3) Retention aligned with compliance/audit requirements.”

Key concepts to mention:

• Removes unreferenced data files

• Default 7-day retention

• Time travel depends on retained files

• Long-running query failure risk

• retentionDurationCheck safety guard

💡 𝗪𝗵𝗮𝘁 𝗜𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄𝗲𝗿𝘀 𝗔𝗿𝗲 𝗧𝗲𝘀𝘁𝗶𝗻𝗴:

Do you understand Delta maintenance and risks? Shows you can manage tables safely in production.

🎯 8: Your query takes 10 minutes but should take 30 seconds. How do you investigate?

Continue reading this post for free, courtesy of Jakub Lasak.

Or purchase a paid subscription.

The Databricks Data Engineer

10 Popular Interview Questions for Mid-Level Databricks Data Engineer Roles ($120-150k)

Where "I Know The Syntax" Stops Working

🎯 1: Your pipeline writes 10,000 small files daily. Why is this a problem and how do you fix it?

🎯 2: When would you use Delta Live Tables vs a regular notebook pipeline?

🎯 3: Explain MERGE INTO. When would you use it vs INSERT OVERWRITE?

🎯 4: How do you handle duplicate records in your pipeline?

🎯 5: What’s the difference between partitioning and Z-ordering? When do you use each?

🎯 6: How do you monitor and alert on pipeline failures?

🎯 7: Explain VACUUM in Delta Lake. What’s the risk if you get it wrong?

🎯 8: Your query takes 10 minutes but should take 30 seconds. How do you investigate?

Continue reading this post for free, courtesy of Jakub Lasak.