Databricks CDC Interview: The 200GB/Day Pipeline with SCD Type 2

Why custom MERGE logic is the trap, and how AUTO CDC + Liquid Clustering is the senior answer that wins FAANG offers

Dec 05, 2025

∙ Paid

Senior Databricks Engineer Interview: “Design a CDC (Change Data Capture) pipeline for 200GB/day of database changes with < 15 min freshness. You must maintain full SCD Type 2 history.”

(Hint: Writing your own MERGE logic is the trap).

This isn’t just about moving data. It’s a test of handling complexity at scale. You have to balance correctness (handling out-of-order data), performance (merging into massive tables), and storage costs (rewrite amplification).

A junior engineer writes complex PySpark code. A senior engineer simplifies the architecture.

Here’s the breakdown that gets you the offer.

⚙️ Trade-off 1: The Processing Logic - Custom Logic vs. Declarative Frameworks

Context: You need to apply inserts, updates, and deletes while maintaining SCD Type 2 history (tracking historical changes with start/end dates).

The Trap: Writing a custom MERGE INTO statement in PySpark/SQL. Juniors often think, “I know SQL, I’ll just write the Merge.” Then they hit reality:

Continue reading this post for free, courtesy of Jakub Lasak.

Or purchase a paid subscription.

The Databricks Data Engineer

Databricks CDC Interview: The 200GB/Day Pipeline with SCD Type 2

Why custom MERGE logic is the trap, and how AUTO CDC + Liquid Clustering is the senior answer that wins FAANG offers

⚙️ Trade-off 1: The Processing Logic - Custom Logic vs. Declarative Frameworks

Continue reading this post for free, courtesy of Jakub Lasak.