The Swyx Mixtape

[Tech] The Origin of Clickhouse - Aaron Katz


Listen Later

Source: the Analytics Engineering Podcast: https://roundup.getdbt.com/p/aaron-katz-clickhouse (2mins in)

Feedback/Discuss on Twitter: https://twitter.com/swyx/status/1552835669373894656

Survey

  • Fill out our 2022 Survey! https://forms.gle/g2s1Np9wS5qmrKSRA!
  • Survey context: https://mixtape.swyx.io/episodes/swyx-mixtape-survey-refactor-and-deadpool-swyx


More on Clickhouse

Our previous episode on Clickhouse: https://twitter.com/swyx/status/1502129209111576577

HN comments on Clickhouse: https://news.ycombinator.com/item?id=28595419

I'd like to thank the creators of ClickHouse as i hope they are reading here. We've been using it since 2019 in a single server setup with billions of rows. No problems at all. And query speeds that seem unreal compared to MySQL and pg.


https://news.ycombinator.com/item?id=26316401 

Also wanted to share my overall positive experience with Clickhouse.

UPSIDES

* started a 3-node cluster using the official Docker images super quickly

* ingested billions of rows super fast

* great compression (of course, depends on your data's characteristics)

* features like https://clickhouse.tech/docs/en/engines/table-engines/merget... are amazing to see

* ODBC support. I initially said "Who uses that??", but we used it to connect PostgreSQL and so we can keep the non-timeseries data in PostgreSQL but still access PostgreSQL tables in Clickhouse (!)

* you can go the other way too: read Clickhouse from PostgreSQL (see https://github.com/Percona-Lab/clickhousedb_fdw, although we didn't try this)

* PRs welcome, and quickly reviewed. (We improved the ODBC UUID support)

* code quality is pretty high.

DOWNSIDES

* limited JOIN capabilities, which is expected from a timeseries-oriented database like Clickhouse. It's almost impossible to implement JOINs at this kind of scale. The philosophy is "If it won't be fast as scale, we don't support it"

* not-quite-standard SQL syntax, but they've been improving it

* limited DELETE support, which is also expected from this kind of database, but rarely used in the kinds of environments that CH usually runs in (how often do people delete data from ElasticSearch?)
It's really an impressive piece of engineering. Hats off to the Yandex crew.


more

I'd like to add an upside which is:

Totally great and simple on a single node.

I looked at a bunch of analytical databases and had a lot that started with "so here's a basic 10 node cluster". Clickhouse installed and worked instantly with decent but not "big" data with no hassle. A hundred million rows with lots of heavy text blobs and a lot of columns, that kind of thing. Happily dealt with triple nested joins over that, and with billions of entries in arrays on those columns didn't bat an eye.


https://news.ycombinator.com/item?id=29098637 

This has been my experience with ClickHouse as well...that is, you can basically close your eyes while writing the schema and still maintain to get extremely impressive performance.

That being said, ClickHouse also has a ton of clever levers you can pull to squeeze out better performance and compression which aren't used by default, such as using Delta/DoubleDelta CODECs with LZ4/ZSTD compression, etc. Not to mention, MATERIALIZED VIEWs and/or the relatively newer feature MergeTree Projections[1]

...more
View all episodesView all episodes
Download on the App Store

The Swyx MixtapeBy Swyx

  • 5
  • 5
  • 5
  • 5
  • 5

5

1 ratings


More shows like The Swyx Mixtape

View all
The Joe Rogan Experience by Joe Rogan

The Joe Rogan Experience

228,882 Listeners

CLUBLIFE by Tiësto

CLUBLIFE

6,535 Listeners

The Bulwark Podcast by The Bulwark

The Bulwark Podcast

12,271 Listeners

The Binge Cases: Where is Daniel Morcombe? by Sony Music Entertainment

The Binge Cases: Where is Daniel Morcombe?

4,220 Listeners

Camp Shame by iHeartPodcasts

Camp Shame

425 Listeners

Unicorn Girl by Apple TV / Seven Hills

Unicorn Girl

1,671 Listeners