July 26, 2022

[Tech] The Origin of Clickhouse - Aaron Katz

15 minutes

Source: the Analytics Engineering Podcast: https://roundup.getdbt.com/p/aaron-katz-clickhouse (2mins in)

Feedback/Discuss on Twitter: https://twitter.com/swyx/status/1552835669373894656

Survey

Fill out our 2022 Survey! https://forms.gle/g2s1Np9wS5qmrKSRA!
Survey context: https://mixtape.swyx.io/episodes/swyx-mixtape-survey-refactor-and-deadpool-swyx

More on Clickhouse

Our previous episode on Clickhouse: https://twitter.com/swyx/status/1502129209111576577

HN comments on Clickhouse: https://news.ycombinator.com/item?id=28595419

I'd like to thank the creators of ClickHouse as i hope they are reading here. We've been using it since 2019 in a single server setup with billions of rows. No problems at all. And query speeds that seem unreal compared to MySQL and pg.

https://news.ycombinator.com/item?id=26316401

Also wanted to share my overall positive experience with Clickhouse.

UPSIDES

* started a 3-node cluster using the official Docker images super quickly

* ingested billions of rows super fast

* great compression (of course, depends on your data's characteristics)

* features like https://clickhouse.tech/docs/en/engines/table-engines/merget... are amazing to see

* ODBC support. I initially said "Who uses that??", but we used it to connect PostgreSQL and so we can keep the non-timeseries data in PostgreSQL but still access PostgreSQL tables in Clickhouse (!)

* you can go the other way too: read Clickhouse from PostgreSQL (see https://github.com/Percona-Lab/clickhousedb_fdw, although we didn't try this)

* PRs welcome, and quickly reviewed. (We improved the ODBC UUID support)

* code quality is pretty high.

DOWNSIDES

* limited JOIN capabilities, which is expected from a timeseries-oriented database like Clickhouse. It's almost impossible to implement JOINs at this kind of scale. The philosophy is "If it won't be fast as scale, we don't support it"

* not-quite-standard SQL syntax, but they've been improving it

* limited DELETE support, which is also expected from this kind of database, but rarely used in the kinds of environments that CH usually runs in (how often do people delete data from ElasticSearch?)
It's really an impressive piece of engineering. Hats off to the Yandex crew.

I'd like to add an upside which is:

Totally great and simple on a single node.

I looked at a bunch of analytical databases and had a lot that started with "so here's a basic 10 node cluster". Clickhouse installed and worked instantly with decent but not "big" data with no hassle. A hundred million rows with lots of heavy text blobs and a lot of columns, that kind of thing. Happily dealt with triple nested joins over that, and with billions of entries in arrays on those columns didn't bat an eye.

https://news.ycombinator.com/item?id=29098637

This has been my experience with ClickHouse as well...that is, you can basically close your eyes while writing the schema and still maintain to get extremely impressive performance.

That being said, ClickHouse also has a ton of clever levers you can pull to squeeze out better performance and compression which aren't used by default, such as using Delta/DoubleDelta CODECs with LZ4/ZSTD compression, etc. Not to mention, MATERIALIZED VIEWs and/or the relatively newer feature MergeTree Projections[1]

...more

View all episodes

By Swyx

11 ratings