Experimental ClickHouse: Projections
ClickHouse tags a major release around once a month, which is an order of magnitude more often than similar projects, and it does it while maintaining speed and stability.
ClickHouse tags a major release around once a month, which is an order of magnitude more often than similar projects, and it does it while maintaining speed and stability.
One of the drawbacks of having a fast release cycle is that large developments are harder to introduce, specially if they can’t be broken down into smaller changes. The way ClickHouse handles this is by introducing these features as experimental meaning they are not ready for production use, either because they are not fully implemented, the design needs more iterations, they require more testing or further reviews, etc. With time they might be promoted to a stable feature, replaced by a better alternative or discarded.
Some features that were experimental in the past are decimal or LowCardinality types; some that are currently experimental are Live Views or Window Functions. A new addition to this list are projections, introduced in 21.6 by Amos Bird, which can be thought of as materialized views for table parts. A nice aspect about them is that the database will decide whether to use, and which one, projections automatically based on the query.
An example
To present an example with real data I took the Github events dataset and loaded the events from the months of January and February of 2020.
Once fully merged, I have one partition per day and each of them is internally ordered by repository, so the data related to events in the same repository will stored together in disk. If I wanted to know how many events happened around the Postgis repository I could count them with:
To know the result, ClickHouse will go to each one of the table partitions (60), open the block containing the repository names and count how many of them contain this exact string. As the data is ordered it doesn’t need to check all names and instead it can look for where 'postgis/postgis'
starts and ends. As a result it is only exploring ~500k rows instead of the whole 113 million.
On the other hand, if I wanted to know how many events were generated by my username, ClickHouse needs to read the whole username blocks completely since the ORDER BY
doesn’t help us in this query:
We will also need to read the complete blocks if we mix both columns and need to know which repositories are interacted with by the users interacting with Postgis:
One of the subqueries, the one looking for interactions with Postgis, benefits from ordering by repository while the subquery that counts events per username would benefit from ordering by username. There isn’t a good way to improve both subqueries by sorting the data in a different way.
A possible solution could be to create a materialized view. By having a second table, this time ordered by username, we could have the data ordered by repository and by username and use one or the other depending on the query needs. This has a clear drawback were you know need to remember which table to use depending on what’s best when writing the query.
Let’s see how we could address this with projections instead:
Now, using the exact same query requires reading way less rows from disk:
In the background ClickHouse will:
- Find all the usernames that have interacted with the
'postgis/postgis'
repository. To do that it will use the default partitions, ordered by repository. This requires reading only ~0.5 M rows. - Out of those usernames discard the ones ending in
[bot]
. No need to read anything from disk. - For each remaining username (50), find which repositories they have interacted with. Using the projection partitions, ordered by username, it can read only ~0.5M rows instead of the whole dataset. 50 usernames x 0.5M rows = 25M rows (if you go one by one).
- Lastly, it groups everything together to count them. No extra reading required.
Without requiring any change from the caller we’ve reduced the disk usage by about 85% at the cost of duplicating the storage disk usage.
But wait, there is more. ClickHouse really shines when you know the kind of queries you are going to use. In the same way that we could have an aggregation in a materialized view, we could create a projection that stores the relation between usernames and repositories:
When we execute the exact same query, which is the magical part that wasn’t available in ClickHouse before, it will detect that using the new partition makes sense and use it:
Note that we could theoretically improve things further if the new projection was sorted to make it easier to do search later, but this isn’t currently supported. It’s an experimental feature at the end of the day.
Projections and Skipping Index
How does projections differ from other ClickHouse features like Skipping Index? Well, indexes are more limited since they are applied per granule ("the smallest indivisible data set that ClickHouse reads when selecting data") and you only have a few types to choose from, but they can be equally powerful depending on the query:
The result is the same with slightly more data read:
Both Skipping Indexes and Projections share the limitation that they are only available for the MergeTree family and the benefit that they are automatically applied over the queries. As indexes can be much smaller, I would consider them as the first option when optimizing the storage for a query and only think of projections when a more complex transformation or aggregation is needed.
Projections and Materialized Views
Materialized views have some clear benefits over projections aside from the fact that they are already a stable feature: you can use them with any table engine and you can output several views to the same target table. With materialized views you can easily enrich the input rows with data from multiple sources.
Having a target table is both a benefit and a drawback of materialized views. For simple cases, like aggregations or transformations of the source table, having to change your queries or know there are materialized tables aside from the main one can be inconvenient, but as queries become more complex, hiding them behind a materialized table can be a very good thing.
One important distinction is that even though both projections and materialized views are generated after the data has been inserted in the main table, how they act if the query used fails is completely different. On the one hand, a failure while adding data through a materialized view means that this data won’t be available in the target table. This situation leads to many headaches as it could happen that some data has been already inserted and some failed, and recovering from this inconsistent status requires extra information. On the other hand, as projections are generated in the background, they deal with failures silently. This is ok since you can mix data from both projections and normal parts, but since the issue was caused by the applied query, it is expected that once you go and apply that query later you will get the exception. I find this way of error handling preferable as the caller will either get a clear error from their SQL query or a full result with no missing data.