The Databricks Data Engineer

The Databricks Data Engineer

Databricks CDC Interview: The 200GB/Day Pipeline with SCD Type 2

Why custom MERGE logic is the trap, and how AUTO CDC + Liquid Clustering is the senior answer that wins FAANG offers

Jakub Lasak's avatar
Jakub Lasak
Dec 05, 2025
∙ Paid

Senior Databricks Engineer Interview: “Design a CDC (Change Data Capture) pipeline for 200GB/day of database changes with < 15 min freshness. You must maintain full SCD Type 2 history.”

(Hint: Writing your own MERGE logic is the trap).

This isn’t just about moving data. It’s a test of handling complexity at scale. You have to balance correctness (handling out-of-order data), performance (merging into massive tables), and storage costs (rewrite amplification).

A junior engineer writes complex PySpark code. A senior engineer simplifies the architecture.

Here’s the breakdown that gets you the offer.


⚙️ Trade-off 1: The Processing Logic - Custom Logic vs. Declarative Frameworks

Context: You need to apply inserts, updates, and deletes while maintaining SCD Type 2 history (tracking historical changes with start/end dates).

The Trap: Writing a custom MERGE INTO statement in PySpark/SQL. Juniors often think, “I know SQL, I’ll just write the Merge.” Then they hit reality:

User's avatar

Continue reading this post for free, courtesy of Jakub Lasak.

Or purchase a paid subscription.
© 2026 Jakub Lasak Consulting · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture