Databricks CDC Interview: The 200GB/Day Pipeline with SCD Type 2
Why custom MERGE logic is the trap, and how AUTO CDC + Liquid Clustering is the senior answer that wins FAANG offers
Senior Databricks Engineer Interview: “Design a CDC (Change Data Capture) pipeline for 200GB/day of database changes with < 15 min freshness. You must maintain full SCD Type 2 history.”
(Hint: Writing your own MERGE logic is the trap).
This isn’t just about moving data. It’s a test of handling complexity at scale. You have to balance correctness (handling out-of-order data), performance (merging into massive tables), and storage costs (rewrite amplification).
A junior engineer writes complex PySpark code. A senior engineer simplifies the architecture.
Here’s the breakdown that gets you the offer.
⚙️ Trade-off 1: The Processing Logic - Custom Logic vs. Declarative Frameworks
Context: You need to apply inserts, updates, and deletes while maintaining SCD Type 2 history (tracking historical changes with start/end dates).
The Trap: Writing a custom MERGE INTO statement in PySpark/SQL. Juniors often think, “I know SQL, I’ll just write the Merge.” Then they hit reality:


