The Databricks Data Engineer

The Databricks Data Engineer

10 Popular Interview Questions for Mid-Level Databricks Data Engineer Roles ($120-150k)

Where "I Know The Syntax" Stops Working

Jakub Lasak's avatar
Jakub Lasak
Dec 07, 2025
โˆ™ Paid

๐ŸŽฏ 1: Your pipeline writes 10,000 small files daily. Why is this a problem and how do you fix it?

โš ๏ธ ๐—ง๐—ต๐—ฒ ๐—๐˜‚๐—ป๐—ถ๐—ผ๐—ฟ ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—ก๐—ผ๐˜ ๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œSmall files are slower to read. Iโ€™d combine them somehow.โ€

Why this fails: Doesnโ€™t explain the mechanism or show practical solutions.

โœ… ๐—ง๐—ต๐—ฒ ๐— ๐—ถ๐—ฑ-๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œSmall files create three problems:

1) Metadata overhead - listing 10,000 files is slow, especially on cloud storage,

2) Inefficient reads - each file means separate I/O operations and task scheduling,

3) Poor parallelism - many files smaller than a partition target.

Solutions: run OPTIMIZE regularly to compact files, enable auto-compaction in Delta (delta.autoOptimize), or repartition/coalesce before writing.

For streaming, increase trigger intervals to batch more data per write. Target file size should be 100MB-1GB.โ€

Key concepts to mention:

โ€ข Cloud storage metadata overhead (list operations)

โ€ข One file = one task minimum overhead

โ€ข OPTIMIZE for compaction

โ€ข delta.autoOptimize.autoCompact setting

โ€ข Target file size: 100MB-1GB

๐Ÿ’ก ๐—ช๐—ต๐—ฎ๐˜ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„๐—ฒ๐—ฟ๐˜€ ๐—”๐—ฟ๐—ฒ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด:

Do you understand production data problems? Shows youโ€™ve run real pipelines, not just tutorials.


๐ŸŽฏ 2: When would you use Delta Live Tables vs a regular notebook pipeline?

โš ๏ธ ๐—ง๐—ต๐—ฒ ๐—๐˜‚๐—ป๐—ถ๐—ผ๐—ฟ ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—ก๐—ผ๐˜ ๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œDLT is the newer way to build pipelines in Databricks.โ€

Why this fails: Doesnโ€™t explain trade-offs or decision criteria.

โœ… ๐—ง๐—ต๐—ฒ ๐— ๐—ถ๐—ฑ-๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œDLT is ideal when:

1) You need built-in data quality checks via expectations,

2) The pipeline is mostly transformations following medallion pattern,

3) You want automatic dependency management between tables.

Regular notebooks are better when:

1) You need fine-grained control over execution,

2) Complex branching logic or external API calls,

3) Integration with non-Delta targets. DLT abstracts away orchestration and incremental processing - great for standardization, but less flexible for edge cases.โ€

Key concepts to mention:

โ€ข DLT expectations for data quality

โ€ข Automatic dependency resolution

โ€ข Incremental processing handled automatically

โ€ข Trade-off: abstraction vs control

โ€ข Streaming and batch in same DLT pipeline

๐Ÿ’ก ๐—ช๐—ต๐—ฎ๐˜ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„๐—ฒ๐—ฟ๐˜€ ๐—”๐—ฟ๐—ฒ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด:

Can you make architecture decisions with trade-offs? Shows you think beyond โ€œwhat tool exists.โ€


๐ŸŽฏ 3: Explain MERGE INTO. When would you use it vs INSERT OVERWRITE?

โš ๏ธ ๐—ง๐—ต๐—ฒ ๐—๐˜‚๐—ป๐—ถ๐—ผ๐—ฟ ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—ก๐—ผ๐˜ ๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œMERGE does upserts - updates existing rows or inserts new ones.โ€

Why this fails: Correct but doesnโ€™t explain performance implications or alternatives.

โœ… ๐—ง๐—ต๐—ฒ ๐— ๐—ถ๐—ฑ-๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œMERGE INTO matches source rows against target based on a condition, then updates/inserts/deletes.

Use it for: CDC (Change Data Capture), SCD Type 2 history tracking, or when source has mixed updates and inserts. INSERT OVERWRITE replaces entire partitions - faster when youโ€™re recomputing complete partitions anyway (daily reload of a partition). MERGE scans the target table to find matches - for large tables, this is expensive. If Iโ€™m always inserting new data with no updates, plain INSERT is faster. Key optimization: partition the target table and include partition columns in the MERGE condition.โ€

Key concepts to mention:

โ€ข MERGE scans target for matching keys

โ€ข Partition pruning in MERGE conditions

โ€ข INSERT OVERWRITE for full partition replacement

โ€ข CDC and SCD Type 2 use cases

โ€ข Performance: MERGE on large tables is expensive

๐Ÿ’ก ๐—ช๐—ต๐—ฎ๐˜ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„๐—ฒ๐—ฟ๐˜€ ๐—”๐—ฟ๐—ฒ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด:

Do you understand write pattern trade-offs? Shows you can choose the right pattern for the workload.


๐ŸŽฏ 4: How do you handle duplicate records in your pipeline?

โš ๏ธ ๐—ง๐—ต๐—ฒ ๐—๐˜‚๐—ป๐—ถ๐—ผ๐—ฟ ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—ก๐—ผ๐˜ ๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œUse distinct() or dropDuplicates().โ€

Why this fails: Works for one-time dedup, but doesnโ€™t address incremental pipelines.

โœ… ๐—ง๐—ต๐—ฒ ๐— ๐—ถ๐—ฑ-๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œDepends on where and when duplicates appear:

1) Source duplicates in batch data: dropDuplicates() with specific columns, or window functions with row_number() to pick the latest.

2) Duplicates from pipeline reruns: MERGE with match condition on business keys - duplicates become no-ops.

3) Streaming duplicates: dropDuplicatesWithinWatermark() with event-time deduplication.

4) Prevention: add unique constraints at the Delta table level using CHECK constraints or expectations in DLT. Iโ€™d identify WHY duplicates exist (source issue? reprocessing?) and fix the root cause, not just filter them.โ€

Key concepts to mention:

โ€ข dropDuplicates() with specific columns

โ€ข Window functions for keeping latest

โ€ข MERGE for idempotent writes

โ€ข Streaming watermark-based dedup

โ€ข Root cause analysis vs symptom treatment

๐Ÿ’ก ๐—ช๐—ต๐—ฎ๐˜ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„๐—ฒ๐—ฟ๐˜€ ๐—”๐—ฟ๐—ฒ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด:

Can you handle real data quality issues? Shows youโ€™ve dealt with messy production data.


๐ŸŽฏ 5: Whatโ€™s the difference between partitioning and Z-ordering? When do you use each?

โš ๏ธ ๐—ง๐—ต๐—ฒ ๐—๐˜‚๐—ป๐—ถ๐—ผ๐—ฟ ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—ก๐—ผ๐˜ ๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œBoth improve query performance by organizing data.โ€

Why this fails: Too vague - doesnโ€™t explain how they work differently.

โœ… ๐—ง๐—ต๐—ฒ ๐— ๐—ถ๐—ฑ-๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œPartitioning physically separates data into directories (table/date=2024-01-01/) - queries with partition filters skip entire directories. Use for: high-cardinality filter columns (date, region), when most queries filter on this column.

Z-ordering co-locates related data WITHIN files using space-filling curves - enables data skipping via file-level min/max stats. Use for: columns commonly filtered together, lower-cardinality columns.

Key difference: partition pruning is directory-level (free), Z-ordering is file-level (needs stats lookup). Iโ€™d partition by date, Z-order by customer_id or product_category.โ€

Key concepts to mention:

โ€ข Partition = directory structure

โ€ข Z-ordering = data co-location within files

โ€ข Partition pruning vs data skipping

โ€ข Partition column cardinality limits (~10k values max)

โ€ข Z-order after OPTIMIZE

๐Ÿ’ก ๐—ช๐—ต๐—ฎ๐˜ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„๐—ฒ๐—ฟ๐˜€ ๐—”๐—ฟ๐—ฒ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด:

Do you understand storage optimization strategies? Shows you can design tables for performance.


๐ŸŽฏ 6: How do you monitor and alert on pipeline failures?

โš ๏ธ ๐—ง๐—ต๐—ฒ ๐—๐˜‚๐—ป๐—ถ๐—ผ๐—ฟ ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—ก๐—ผ๐˜ ๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œCheck the Databricks job logs when something fails.โ€

Why this fails: Reactive, not proactive - you shouldnโ€™t find out from users.

โœ… ๐—ง๐—ต๐—ฒ ๐— ๐—ถ๐—ฑ-๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œMulti-layer monitoring:

1) Job-level: Configure email/Slack alerts on job failure in Databricks Workflows,

2) Data quality: DLT expectations that fail or warn on quality issues, plus custom checks (row counts, null percentages),

3) Freshness: track table update timestamps, alert if data is stale beyond SLA,

4) Metrics: push custom metrics to monitoring tools (Datadog, CloudWatch) for dashboards.

Iโ€™d also implement a โ€˜heartbeatโ€™ table that downstream consumers check - if not updated, they know upstream failed. Log important checkpoints for debugging.โ€

Key concepts to mention:

โ€ข Workflow alerts (email, Slack, PagerDuty)

โ€ข DLT expectations for data quality

โ€ข Data freshness/SLA monitoring

โ€ข Custom metrics to external systems

โ€ข Heartbeat tables for dependency awareness

๐Ÿ’ก ๐—ช๐—ต๐—ฎ๐˜ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„๐—ฒ๐—ฟ๐˜€ ๐—”๐—ฟ๐—ฒ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด:

Do you build production-ready pipelines? Shows you think about operations, not just development.


๐ŸŽฏ 7: Explain VACUUM in Delta Lake. Whatโ€™s the risk if you get it wrong?

โš ๏ธ ๐—ง๐—ต๐—ฒ ๐—๐˜‚๐—ป๐—ถ๐—ผ๐—ฟ ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—ก๐—ผ๐˜ ๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œVACUUM cleans up old files to save storage.โ€

Why this fails: Doesnโ€™t explain retention or the time travel impact.

โœ… ๐—ง๐—ต๐—ฒ ๐— ๐—ถ๐—ฑ-๐—Ÿ๐—ฒ๐˜ƒ๐—ฒ๐—น ๐—”๐—ป๐˜€๐˜„๐—ฒ๐—ฟ (๐—›๐—ถ๐—ฟ๐—ฒ๐—ฑ):

โ€œVACUUM removes data files no longer referenced by the transaction log, reclaiming storage. Default retention is 7 days - files older than that are deleted.

The risk: time travel queries and long-running jobs. If a query started before VACUUM reads a file that VACUUM deletes, it fails.

Setting retention too low (< 7 days requires spark.databricks.delta.retentionDurationCheck.enabled = false) breaks time travel.

Iโ€™d coordinate VACUUM with:

1) When no long-running queries are active,

2) After verifying no pipelines need historical versions,

3) Retention aligned with compliance/audit requirements.โ€

Key concepts to mention:

โ€ข Removes unreferenced data files

โ€ข Default 7-day retention

โ€ข Time travel depends on retained files

โ€ข Long-running query failure risk

โ€ข retentionDurationCheck safety guard

๐Ÿ’ก ๐—ช๐—ต๐—ฎ๐˜ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ฒ๐˜„๐—ฒ๐—ฟ๐˜€ ๐—”๐—ฟ๐—ฒ ๐—ง๐—ฒ๐˜€๐˜๐—ถ๐—ป๐—ด:

Do you understand Delta maintenance and risks? Shows you can manage tables safely in production.


๐ŸŽฏ 8: Your query takes 10 minutes but should take 30 seconds. How do you investigate?

User's avatar

Continue reading this post for free, courtesy of Jakub Lasak.

Or purchase a paid subscription.
ยฉ 2026 Jakub Lasak Consulting ยท Privacy โˆ™ Terms โˆ™ Collection notice
Start your SubstackGet the app
Substack is the home for great culture