spilling and skewing

08 Aug, 2024

hello there, if you've dabbled in distributed computing or played around with Apache Spark, you've probably heard the terms "spilling" and "skewing" thrown around. if that's the case, this post might not be for you, but hey, stick around! you never know what new nuggets of knowledge you might pick up.

for the rest of you, don't worry, i'm keeping things simple and using plenty of examples to help you grasp these concepts, simplicity is key.

so, let's dive in.

a quick peek at Spark architecture

before we get to what brought you here, let's take a quick look at how Spark is set up. when you start a Spark application, you're dealing with two main components:

the driver process: think of this as the brains of the operation, it:
- runs the main() function
- keeps track of all the application info
- responds when you poke it (user input)
- analyzes, distributes and schedules work across the executors
the executors: these are the workhorses, they:
- run the code assigned by the driver
- report results back to base

imagine a trucking company: the driver process is like the command center, telling the trucks where to go, what to carry, and which routes to take. the **executorsª*? they're the trucks themselves.

Spark Architecture as a Trucking Company

for me, the driver process is the heart of Spark! you can have more trucks (executors) to handle more jobs, or bigger trucks (more memory per executor) to haul more data. but if the command center can't keep up, you're in for a world of trouble.

now, let's tackle our main topics.

data spilling: when your truck overflows

so, what's data spilling all about? sticking with our trucking analogy, each truck (executor) has a set amount of space (memory) to carry materials (data). what happens when you need to transport more stuff than your truck can handle?

the simple solution: make multiple trips! that's essentially what data spilling is in Spark.

Executor Memory Contents

when the data doesn't fit in memory Spark writes the overflow to disk. every time you need that spilled data, you have to fetch it from the disk. it's like going back and forth to pick up the extra cargo – not very efficient, right?

but wait, it gets trickier! you can't just dump a data table onto a disk as-is, you need to "serialize" it first – basically, transform the data object into a series of bytes.

and guess what? when you want to use that data again, you have to "deserialize" it. so now your poor truck is not only making multiple trips but also has to repackage the cargo every single time.

why does this happen?

maybe your partitions (the wooden pallets being loaded into the truck) are too big.
your partitions might be skewed (some pallets are way bigger than others).
the memory allocated to your executors is too small (time for bigger trucks my friend).

what can we do about it?

if there's skewing, we need to tackle that first (more on this in a bit!).
adjust the number of partitions to a more manageable size (aim for 128-256MB per partition).
allocate more memory to each executor (upgrade those trucks!).
keep a close eye on the Spark UI.

data skewing: when some trucks carry more than others

now, let's talk about data skewing. this happens when your data isn't spread out evenly across partitions. it's like some trucks are hauling massive loads while others are practically empty.

how can we fix it?

got a lot of nulls causing trouble? if you don't need them, remove them.
play around with Spark configs like dynamic partition or adaptive query execution.
use a custom partition for each execution if you're feeling fancy.
try salting. (this is my personal favourite)

what's salting?

salting is all about breaking up those big, skewed partitions into smaller, more manageable chunks. how? by adding random integers to the values in the skewed column. this spreads out the data more evenly across different partitions. it's like repackaging your cargo so it fits better in all your trucks.

SELECT 
    CASE WHEN salted_key LIKe '%_salt'
        THEN 'xyz'
    ELSE 
        salted_key
    END as customer_id,
    max(transaction_date) as last_transaction_date
FROM (
    SELECT
        CASE WHEN customer_id is 'xyz'
            THEN concat(cast(1*FLOOR(RAND(12345)*500) AS STRING), '_', 'salt')
        ELSE
            customer_id
        END as salted_key,
        max(transaction_date) as transaction_date
    FROM
        customer_sales_table
    GROUP BY
        salted_key
)
GROUP BY 
    customer_id