How One SQL Line Becomes 4,000 Parallel Tasks
The complete guide to Spark’s execution engine - from SQL to parallel tasks, and how to find bottlenecks when things get slow.
You wrote one line of SQL. Spark turned it into 4,000 parallel tasks across 50 machines.
No distributed code. No MapReduce boilerplate. Just SQL.
The LinkedIn post walked through the core pipeline: SQL → logical plan → physical plan → DAG → jobs → stages → tasks. That’s the skeleton. Now let’s put muscle on it.
This premium deep-dive covers what actually happens at each step, how to read what Spark is doing, why certain queries are slow, and how to trace from “this is taking forever” to “that’s the exact bottleneck.”
By the end, you’ll be able to open a slow query’s execution details, find the problem stage in under a minute, and explain to an interviewer exactly how Spark executes distributed SQL.
Part 1: The Core Pipeline (Quick Recap)
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “Quick refresher. When I click Run on a SQL query - what’s the sequence?”
𝗦𝗽𝗮𝗿𝗸: “Four phases, always in this order:
Parsing: Your SQL string becomes an unresolved logical plan - basically an abstract syntax tree of your intent.
Analysis: Catalyst resolves table names, column types, and function signatures against the catalog. If your column doesn’t exist, this is where it fails.
Optimization: The logical plan gets rewritten. Predicate pushdown, constant folding, projection pruning, join reordering. Dozens of rules fire to make your plan more efficient without changing the result.
Physical Planning: The optimized logical plan becomes a physical plan - the actual execution strategy. This is where Spark decides: broadcast join or sort-merge join? How many partitions for the shuffle? Which columns to read from Parquet?”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “And the physical plan becomes the DAG?”
𝗦𝗽𝗮𝗿𝗸: “Exactly. The physical plan maps to a DAG - a directed acyclic graph of operations. Each action (write, collect, count) triggers a job. Each job splits into stages at shuffle boundaries. Each stage splits into tasks - one per partition.
That’s your 4,000 tasks: multiple jobs × multiple stages × partitions per stage.”
Part 2: Inside the Catalyst Optimizer
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “The optimization step - everyone mentions ‘Catalyst’ but what does it actually do to my query?”
𝗦𝗽𝗮𝗿𝗸: “Catalyst applies rules in batches. Each batch rewrites your logical plan to be more efficient. Here are the ones that matter most for your daily work:
Predicate Pushdown: If you filter on a column after joining two tables, Catalyst pushes that filter down to BEFORE the join. Instead of joining 500 million rows and then filtering to 10 million, it filters first and joins only 10 million rows. This alone can make a query 10x faster.
Projection Pruning: If your query only needs 5 columns from a 200-column table, Catalyst tells Spark to only read those 5 columns from the Parquet files. Since Parquet is columnar, this means Spark physically skips reading 195 columns from disk.
Boolean Simplification: If your query says WHERE date > '2025-01-01' AND date > '2024-06-01', Catalyst simplifies it to WHERE date > '2025-01-01'. Small optimization, but it compounds across complex queries.
Join Reordering: If you join tables A, B, and C, the order matters enormously. Joining a 10-row lookup table with a billion-row fact table first (filtering down to 50 million rows) and THEN joining with another large table is dramatically faster than joining two billion-row tables first. Catalyst uses statistics to figure out the best order.”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “So Catalyst is rewriting my query into a better version of itself?”
𝗦𝗽𝗮𝗿𝗸: “Exactly. Same result, different execution path. And this happens before a single byte of data is read. You get these optimizations for free just by writing SQL - Catalyst doesn’t care if your query is messy. It’ll clean it up.”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “When does it NOT help? When does my query structure actually matter?”
𝗦𝗽𝗮𝗿𝗸: “Catalyst is smart, but it’s not omniscient. Three situations where your query structure matters:
User-defined functions (UDFs): Catalyst can’t see inside UDFs. It treats them as black boxes, so it can’t push predicates past them or optimize around them. If you wrap a filter column in a UDF, predicate pushdown breaks.
Complex subquery patterns: Deeply nested subqueries or correlated subqueries can confuse the optimizer. Sometimes restructuring a deeply nested query into a flatter form gives Catalyst more room to optimize.
Statistics are stale or missing: Join reordering and broadcast decisions depend on table statistics. If Catalyst thinks your table has 1,000 rows when it actually has 100 million, it’ll make bad decisions. Running ANALYZE TABLE periodically keeps statistics fresh.”
Part 3: Physical Planning - Where Strategy Gets Chosen
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “The physical plan step - this is where Spark decides HOW to execute. What decisions happen here?”
𝗦𝗽𝗮𝗿𝗸: “The biggest decision is join strategy. For every join in your query, Spark picks one of these:
Broadcast Hash Join: When one side of the join is small enough to fit in memory on every executor. Spark sends the small table to all executors. Each executor joins its partition of the large table with the full small table locally. No shuffle needed for the large table. This is the fastest join type by far.
Sort-Merge Join: When both sides are large. Spark shuffles both tables so that matching keys land on the same executor, sorts both sides by the join key, then merges them. Two shuffles, two sorts - expensive but works at any scale.
Shuffle Hash Join: Similar to sort-merge but skips the sort step. Faster when one side is moderately small (not small enough for broadcast, but small enough to build a hash table in memory).
The threshold for broadcast is configurable, but by default Spark broadcasts tables under 10MB. With AQE enabled, Spark can detect at runtime that a table (after filtering) is actually small enough to broadcast, even if the original table is huge.”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “Wait - AQE can change the join strategy at runtime? So the plan changes DURING execution?”
𝗦𝗽𝗮𝗿𝗸: “Yes. This is one of AQE’s most powerful features. Here’s what happens:
Before AQE, Spark committed to the physical plan before reading any data. If statistics were wrong, you were stuck with a bad plan.
With AQE, Spark executes stage by stage. After each stage completes, it looks at the actual data volumes - not estimated, but real - and can re-optimize the remaining stages.
Three things AQE adjusts at runtime:
Coalescing shuffle partitions: If the default 200 shuffle partitions produce mostly tiny partitions (common with filtered data), AQE merges small partitions together. Instead of 200 tasks where 180 process almost nothing, you get 30 tasks with reasonable data each.
Converting to broadcast join: If a shuffle stage produces a small result (say 5MB instead of the estimated 2GB), AQE converts the next join from sort-merge to broadcast. This eliminates one entire shuffle.
Handling skew joins: If AQE detects that one partition has 10x more data than the average (a skewed key), it splits that partition into smaller sub-partitions and replicates the other side to match. The skewed key gets parallel processing instead of one overwhelmed task.”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “So AQE is basically Spark fixing its own bad estimates during execution?”
𝗦𝗽𝗮𝗿𝗸: “Exactly. It’s self-correcting execution. And it’s why the gap between ‘works fine in dev’ and ‘breaks in production’ has gotten smaller - AQE handles a lot of the data-volume surprises that used to require manual tuning.”
Part 4: Shuffles - The Most Expensive Operation You Can’t Avoid
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “Everyone says shuffles are expensive. What actually happens during a shuffle?”
𝗦𝗽𝗮𝗿𝗸: “A shuffle is data redistribution. When Spark needs all records with the same key on the same executor - for a GROUP BY, JOIN, or REPARTITION - it has to physically move data across the network. Here’s the sequence:
Shuffle Write (map side): Each task in the current stage writes its output to local disk, organized by the target partition. If the shuffle has 200 output partitions, each task creates a file with 200 sections - one for each target partition. This is the ‘shuffle write’ you see in the Spark UI.
Shuffle Read (reduce side): Each task in the next stage reads the relevant section from EVERY map-side task. If the previous stage had 100 tasks and the current stage has 200 partitions, each reduce task fetches its data from up to 100 remote locations. This is the ‘shuffle read.’
That’s why shuffles are expensive: they hit disk (write), network (transfer), and disk again (read). A single GROUP BY on a 500GB dataset can shuffle the entire 500GB across the cluster.”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “And shuffle spill? I see that metric sometimes.”
𝗦𝗽𝗮𝗿𝗸: “Spill happens when a task runs out of memory during a shuffle operation. Spark doesn’t fail - it spills data to local disk and continues processing. But now it’s doing extra disk I/O on top of the shuffle.
You’ll see two spill metrics:
Spill (Memory): The amount of data that was in memory before it had to be spilled. This tells you how much data the task tried to hold.
Spill (Disk): The compressed size written to disk. Since Spark compresses spilled data, this is usually smaller than the memory number.
If you see spill on a handful of tasks while others are fine, that’s data skew - a few tasks got disproportionately large partitions. If ALL tasks are spilling, your executors need more memory or you need more partitions to reduce per-task data volume.”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “How do I reduce shuffle cost?”
𝗦𝗽𝗮𝗿𝗸: “Three approaches, in order of impact:
Eliminate the shuffle entirely: Use broadcast joins when one table is small. Filter data before joins and aggregations - a WHERE clause before a GROUP BY can reduce shuffle volume by 90%.
Reduce what gets shuffled: Only select the columns you need before the shuffle. Shuffling 5 columns instead of 50 across 500GB means shuffling 50GB instead of 500GB.
Partition your output strategically: If you always aggregate by date, and your Delta table is organized by date (via liquid clustering or partitioning), Spark can perform partial aggregations locally before shuffling. Less data moves.”
Part 5: Reading the Spark UI Like a Senior Engineer
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “Okay, I understand the theory. Now I have a slow query. Walk me through diagnosing it.”
𝗦𝗽𝗮𝗿𝗸: “Open the Spark UI. Here’s your diagnostic flow:
Step 1: Jobs tab → Find the slow job Each action in your query creates a job. Find the job that took the longest. Click into it.
Step 2: Stages within that job → Find the slow stage You’ll see all stages for that job with their durations. One stage is almost always the bottleneck. Click the slow one.
Step 3: Within that stage → Check these metrics in order:
Task duration distribution: Look at the min, median, and max task duration. If the max is 10x the median, you have data skew. One partition has way more data than the others.
Shuffle Read Size: If this stage reads a lot of shuffled data, the previous stage’s shuffle is your bottleneck. Check whether you can reduce what’s being shuffled (fewer columns, earlier filtering).
Spill metrics: If you see spill to disk, tasks are running out of memory. Either increase executor memory, increase partition count, or fix the skew that’s overloading specific tasks.
Input size vs. records: If you’re reading 500GB but only processing 10 million rows, you might be reading too many columns or your data isn’t being pruned effectively (check if predicate pushdown is working).
Task count: If a stage has only 1 or 2 tasks processing huge amounts of data while the cluster has 50 executors, you have a parallelism problem. The data needs more partitions.”
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “So the diagnostic is: slow job → slow stage → check task distribution and shuffle metrics?”
𝗦𝗽𝗮𝗿𝗸: “Yes. And here’s the part most engineers miss: the stage name in the Spark UI tells you which operation created it. If Stage 3 is labeled ‘Exchange hashpartitioning(customer_id)’ and it’s the bottleneck - you know the JOIN or GROUP BY on customer_id is the problem. You know exactly which part of your SQL to optimize.
That connection - from slow stage back to the specific SQL operation - is what separates ‘I think it’s slow somewhere’ from ‘the GROUP BY on customer_id is skewed because 40% of transactions belong to 3 enterprise accounts.’”
Part 6: The Five Bottleneck Patterns You’ll See 80% of the Time
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “What are the most common things I’ll find when diagnosing?”
𝗦𝗽𝗮𝗿𝗸: “Five patterns cover 80% of slow queries in production:
Pattern 1: Data Skew in Joins
What you see: One task takes 15 minutes while the other 199 finish in 30 seconds. Spill metrics on that one task.
What’s happening: One join key has massively more records than others. Classic example: a NULL key, a default value like ‘unknown’, or a mega-customer that represents 30% of all transactions.
What to do: Filter out NULLs before the join if they’re not needed. For legitimate skew, AQE’s skew join handling should kick in automatically. If it doesn’t, salt the skewed key - append a random number (0-9) to the key on both sides, join on the salted key, then aggregate to remove the salt. This splits one massive partition into 10 manageable ones.
Pattern 2: Shuffle Spill Across All Tasks
What you see: Every task in a stage shows spill metrics. Shuffle read is enormous.
What’s happening: Too much data being shuffled through too-small executor memory. Either the dataset is large, or you’re shuffling unnecessary columns.
What to do: Select only needed columns before the shuffle. Filter early. If the dataset is genuinely large, increase executor memory or increase partition count so each task handles less data.
Pattern 3: Too Many Small Tasks
What you see: 10,000 tasks that each process 1MB of data and finish in 100ms. The stage takes 5 minutes just from scheduling overhead.
What’s happening: Over-partitioned data. Either the source has too many small files, or the shuffle created too many empty or tiny partitions.
What to do: AQE should coalesce these automatically. If it’s source files, run OPTIMIZE on the Delta table. If it’s a shuffle, reduce the partition count.
Pattern 4: Full Table Scan When Filter Should Apply
What you see: Input size is the full table even though your query has a WHERE clause.
What’s happening: Predicate pushdown isn’t working. Common causes: the filter column is wrapped in a function or UDF, the column has no statistics, or the data layout doesn’t match the filter pattern.
What to do: Check that you’re filtering on raw columns, not computed expressions. For Delta tables, ensure liquid clustering or Z-ordering aligns with your common filter columns. Run ANALYZE TABLE to refresh statistics.
Pattern 5: Sort-Merge Join When Broadcast Would Work
What you see: Two large shuffle stages before a join, but one side of the join is actually small (under 100MB after filtering).
What’s happening: Catalyst estimated the table as large based on stale or missing statistics, so it chose sort-merge. Or the table IS large but filters reduce it to a small result that AQE hasn’t detected.
What to do: AQE should catch this in most cases. If not, update table statistics. As a last resort, you can hint the broadcast, but prefer fixing statistics so Spark makes the right choice automatically.”
Part 7: The Execution Plan - How to Read It
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿: “Can I see the plan BEFORE the query runs? To catch problems early?”


