wilows balloc

data skipping pt1 - z-ordering

hey yo

so, i like to talk about cool things. we’ve talked about spilling and skewing (which will show up again here), we’ve talked about RDDs and why they’re dope (get it?). today i’m going to talk about z-order. if you’ve played around with databricks you’ve probably already heard of z-order as an optimization trick to speed up your queries.

by the end of this post, you’ll (hopefully) understand how it works, why it works and the motivations behind it. so, without further ado, let’s start with the latter.


chapter 1

imagine a world where a file, where you’re storing data, doesn’t have any metadata about what it contains. now imagine you have a query engine sitting on top of those files.

let’s say you have a table like this:

country, client_id, client_name, email

without metadata, if you wanted to run a query like:

select * from table where country = 'sweden'

the engine would have to scan all the files and check every row. that’s basically unusable.

to fix this, modern file formats store metadata, usually the min and max values for each column. that way the engine can skip files that clearly don’t match your query. this is called data skipping.

sounds pretty good, right? imagine you have 6 files:

6 files

if you query for sweden, the engine only needs 1 out of 6 files. neat! Sweden file

but here’s the catch: data skipping works best on the column that drives how the files are partitioned.

let’s go back to the example. if your query is:

select * from table where country = 'sweden' and name = 'martin'

you’ll still need to scan the entire sweden file, because all the names are crammed in there together. the second filter (on name) doesn’t help much. Sweden names

or let’s try an even worse case:

select * from table where id = 12345

ids usually have super high cardinality, meaning each value is unique or close to unique. every file’s min and max range will be wide enough to include that id. so guess what? you end up scanning all the files again. feels like we’re back at square one.


chapter 2

to fix this, we need something smarter: multi-dimensional clustering using z-order.

a z-order curve is a way of mapping multi-dimensional data into a single dimension while still preserving data locality. in other words, if two values are close to each other across multiple columns, they’ll still end up close to each other after being mapped.

z-order curve example

this gives us a better way to store and group rows across multiple columns, so queries with multiple filters can skip more files.

let’s look at an example:

that doesn’t sound like a big deal here, but it scales fast: the more partitions and dimensions you add, the bigger the savings.

the key insight is that z-order respects locality. points that are “neighbors” in the data space stay neighbors when stored, which means fewer files to scan and faster queries.


chapter 3

so how do we build a z-order? with something called bit interleaving. this is where it gets beautiful.

bit interleaving means taking the binary representation of two (or more) values and weaving their bits together.

example: take the pair (2,3) → in binary that’s (010, 011). interleaved, it becomes 1101, which equals 13. that’s its z-order index.

why does this preserve locality? because values that are numerically close usually only differ in their least significant bits. when you interleave bits, the most significant ones stay aligned, so “neighbors” remain close in the ordering.

think of it like zooming into quadrants of a 2d space. each round of interleaving tells you which quadrant you’re in, and the quadrants keep getting smaller and smaller. changing a least significant bit just nudges you into a nearby quadrant, instead of throwing you somewhere far away.

Quadrants

that’s why z-ordering feels like a quad-tree in disguise.


chapter 4

ok, but what do we actually z-order on?

take the id example. you wouldn’t directly map raw ids into bits - that’d be useless. instead, you assign them a rank (using dense_rank or similar). this way you’re ordering them relative to each other.

now, you might say: “but hugo, aren’t window functions slow?” fair point. that’s why the implementation is optimized to group and rank in chunks instead of calculating a full rank per value. way less overhead.

once you’ve got your ranks, you interleave the bits, and finally repartition the data using those z-order keys. boom. your files are now z-ordered.

but wait — what about skewing?


chapter 5

ah yes, skewing, the eternal villain.

z-order itself doesn’t break with skewed data, but the write step can get messy. if one partition has way more rows than the others, one unlucky executor is stuck doing all the work.

the fix? salting.

if you’ve read my post about skewing, you know this trick. basically, you sprinkle in some randomness into the z-order key (like an extra random byte). this spreads the heavy partitions across executors.

and since you’re only tweaking the least significant bits, you’re not destroying locality! neat little hack.


so that’s z-order: sorting data into a one-dimensional curve that acts like a quad-tree, preserves locality, and lets you skip more files. honestly, it’s a beautiful idea.

next time i’ll talk about liquid clustering, which is basically z-order on steroids. same idea, but with a hilbert curve, which is even better at grouping data. stay tuned :)