Building ClickHouse Cloud from scratch in a year

sytse · on March 21, 2023

Very fast progress and great article with lots of useful information. I found myself nodding a lot in agreement having recently seen GitLab Dedicated https://docs.gitlab.com/ee/subscriptions/gitlab_dedicated/ being developed that had many similar challenges. I wonder what other people think of Kops https://kops.sigs.k8s.io/

manishsharan · on March 21, 2023

For about $70 per month, you could get managed Kubernetes from AWS, GCP or Azure. Why would you bother with Kops on these platforms? Also DO, OVH and Vultr provide managed K8s for free.

re-thc · on March 21, 2023

Depends on how well managed it is (hint: not always good).

For a high performance service like ClickHouse the nodes may need to be optimized and that's not often done in the fully managed solutions.

For EKS they often take forever to get to the current version of k8s and that might be a problem.

Whilst performance of the master nodes have increased recently they might not be up to scratch for what's required if you're doing a lot of operations.

All in all managed does not mean fit for all requirements. It certainly is great for at least 80% of cases but not all.

hodgesrm · on March 21, 2023

EKS and GKE are good products. We run a competing platform for ClickHouse on AWS and made the same decision as ClickHouse Inc 3 years ago and for very similar reasons. Kubernetes requires investment to run well, especially if you aspire to be multi-platform.

Disclaimer: I work for Altinity.

manishsharan · on March 21, 2023

The question I would ask is would your team have the bandwidth and experience to do it better than GKE , EKS ,AKS etc.? My employer deploys to AKS and Anthos ; they both have limitations but my advise to teams is to work within those limitations as we are in the business of building business applications and not systems management s/w.

re-thc · on March 21, 2023

And it is a question to ask. Maybe a good question for ClickHouse and you can wait and see as their cloud offerings play out.

Well in this case they are providing systems management services as in the ClickHouse database as a service.

FlyingSnake · on March 21, 2023

> Why would you bother with Kops on these platforms?

Given sytse is the CEO and founder of GitLab, I am sure he’s not interested in getting vendor locked. Community solutions like kops help keep things open and accessible to everyone.

dilyevsky · on March 21, 2023

“Hey we decided to forcefully upgrade your cluster and everything is fucked now, good luck! -love, gke team”

Haven’t used ovh but do and vultr offering were severely limited when i tried it

pm90 · on March 21, 2023

> A bi-directional API layer between the Control Plane and the Data Plane defines the only integration point between the two planes. We decided to go with a REST API for the following reasons:

> REST APIs are independent of technology used, which helps avoid any dependency between Control Plane and Data Plane. We were able to change the language from Python to Golang in the Data Plane without any changes or impact to the Control Plane. They offer a lot of flexibility, decoupling various server components which can evolve independently. > They can scale efficiently due to the stateless nature of the requests - the server completes every client request independently of previous requests.

Im not sure what REST apis are being compared against? Something like RPCs? If so, then using an interchange format like protobufs allows for automatic client generation in several languages. Additionally, schema changes are much less painful.

sho · on March 21, 2023

> using an interchange format like protobufs allows for automatic client generation in several languages

Maybe things have improved, and maybe I was using the "wrong" languages, but when we tried this at a previous company the generated code was deemed so awful that we abandoned that idea almost on day 1. To this day I tend to believe the promise of "write once, generate anywhere!" to be mostly BS.

Not even picking on protobuf specifically here - last time I tried a REST generative approach (swagger?) the results were similarly unusable. The whole concept is a fantasy, unless things have staggeringly improved.

closeparen · on March 21, 2023

What do you mean “awful”? You’re not meant to read it. We don’t even materialize it into our repos - the build system generates it at compile time. When you go-to-definition in there it’s a little weird, but it works.

J_tt · on March 21, 2023

We’re actually finding good success with GraphQL codegen, specifically for a Go backend and Apollo + TypeScript frontend.

Out of all the options for generating typesafe interfaces this seems the most flexible and generates usable code.

davewritescode · on March 21, 2023

I've had really good luck with gRPC + protobufs for generating solid client side code. Swagger + REST never seemed to deliver what it promised but then again, neither did SOAP + WSDL.

sho · on March 21, 2023

Mind sharing what the client side languages were? Admittedly what I was using at the time was ruby, hardly a priority for the google team I imagine.

ramesh31 · on March 21, 2023

>Im not sure what REST apis are being compared against? Something like RPCs? If so, then using an interchange format like protobufs allows for automatic client generation in several languages. Additionally, schema changes are much less painful.

It's so much more expensive in the long run, though. Everything talks in HTTP. Everyone can read and understand HTTP. Proprietary binary protocols add a layer of complexity to literally everything else you do, and require a translation layer to talk to the front end.

yunohn · on March 21, 2023

Yeah, I’m not sure why they need to justify REST APIs in 2023?

citrin_ru · on March 21, 2023

Probably to prevent readers asking questions like ‘why not gRPC’.

yunohn · on March 21, 2023

Fair enough, I’m surprised no one has asked about GraphQL yet.

halfmatthalfcat · on March 21, 2023

GraphQL is REST (normal POST against /graphql usually). It’s just a grammar by which you retrieve information.

presentation · on March 21, 2023

Depending on your definition of REST I guess, what you describe is just saying GraphQL works over HTTP, but I certainly wouldn’t call it RESTful in the slightest.

yunohn · on March 21, 2023

GraphQL is an entire API definition. Yes, it’s available at a specific POST endpoint - but that does not make it RESTful.

tylerhannan · on March 22, 2023

Having also spent time at a GraphQL company, I keep expecting this question too ;)

But, as of yet, it has not come up.

faizshah · on March 21, 2023

This is an amazing post on building a cloud product. While I was reading I was thinking “how did they get around the IAM role limits with dedicated roles” and they even touched on how they used a cellular architecture.

I wish there was a little more info on the team size and which parts took the most engineering effort to build.

zinclozenge · on March 21, 2023

Cellular architecture is a complexity trap that's 100% not worth it unless you're a cloud provider, or Confluent size. I've been at DBaaS companies that wanted to do it and instead ended up with the worst of all worlds.

With that said, it sounds like they went with a simpler design than what is described in the AWS article they linked in their blog post. It just sounds like they're able to spin up multiple data plane cells per region, which is what every DBaaS provider needs to do.

jpnc · on March 21, 2023

OT: anyone know what they're using for diagramming? I like how it looks.

ryadh · on March 21, 2023

Thanks for this feedback, I also love how the diagrams render in this post. They were made by our design folks at ClickHouse, Shout-out to them!

eternalban · on March 21, 2023

they made a typo: arch diagram - storage - r/visually/virtually

(to GP: Powerpoint and Keynote can go a long way.)

ryadh · on March 22, 2023

Good catch! thanks for letting us know

eternalban · on March 22, 2023

cristina_a · on March 21, 2023

Figma :)

orf · on March 21, 2023

Interesting that they use a single IAM role per client. This will also run into limits (which can be upped), but wouldn’t a better design be to use a single role with a specific STS session name and session tag enforcement policy?

You can then say this role can read/write to “s3://bucket/${account-tag-name}/“ and have it enforced.

Because they use a single service account per client, which has no limits as far as I know, they can add a IAM assume role policy to say “the client-ID session tag must match the Kubernetes service account name” or something like that.

What’s annoying is that you can’t go any more granular than the service account name - currently there is no way to create a pod IAM role with specific pre-baked tags, so you can’t say like “this IAM session can only be assumed when tag X equals the k8s pod label value Y”.

Being able to do this would be insanely valuable for specific and fairly niche IAM problems. As I understand it, it’s a limitation with Kubernetes.

cpnz · on March 21, 2023

First time poster, long time lurker. I'm a fairly proficient kubernetes admin and developer. I'm using it in depth every day. It seems like that functionality would be really easy to add to the IRSA (iam roles for service accounts). Is it judt that nobody has bothered, or am I missing an important blocker?

orf · on March 21, 2023

The issue as I understand it is that there is no way to encode this information in the Kubernetes token.

The pod IAM roles stuff leverages Kubernetes stuff, and the token that’s mounted into the container is a YAML representation of a Kubernetes token object. There are no fields or other way to add this information into the object.

You would need to encode it into the JWT itself, which isn’t possible or something.

I’m half remembering this, and I can’t find the issue on Guthub because everything has been shuffled around since.

llama052 · on March 21, 2023

Feels good to see them land on the same technical decisions my team and I have made. Some crazy progress from clickhouse!

winrid · on March 21, 2023

So they basically extended the buffer cache to mean an entire machine and the IO just goes out to S3? Pretty neat - that way you just use the existing cache manager's heuristics, maybe tuned a little, to determine what to keep local and what stays cold.

At least, that's what I would have done, given the timeframe.

zX41ZdbW · on March 21, 2023

Yes, and actually, it is simpler than expected. The cached segments are written into the local filesystem, and therefore we get both the cache on SSD and the page cache in memory "for free". As ClickHouse is OLAP, the read/written segments are not so small (more than 1 MB is expected), so it works fairly well (the same would be more tricky for OLTP).

It requires some considerations. For example, the linear read/write performance on typical c5d/m5d/r5d instances is just around 4 GB/sec, which is faster than 25 Gbit network, but slower than 40 Gbit network. You get 25 Gbit network on larger normal instances in AWS, otherwise, it requires "n" - network-optimized instances. S3 read performance is higher - 50 Gbit/sec achieved from a single server, higher values will require either more CPU or uncompressible data (the data is usually compressed "too much" for a 100 Gbit network to become a bottleneck). But S3 is a tricky beast, and it won't give a good performance unless you read in parallel and with carefully selected ranges from multiple files. The higher latencies of S3 are expected to be around 50 ms unless you are being throttled (and you will be throttled), which is worse than both local SSDs and every type of EBS. S3 is cheap and saves cost on cross-AZ traffic... unless you make too many requests. So, a lot of potential troubles and a lot of things to improve.

orf · on March 21, 2023

You won’t be throttled if you partition your data “correctly” and have a reasonable amount of data in your bucket. That’s difficult to do, but more than possible.

zX41ZdbW · on March 21, 2023

We already partition our data correctly, as the AWS solution architects recommend, but there are limitations on the: - total throughput (100 Gbit/sec); - the number of requests.

For example, at this moment, I'm doing an experiment: creating 10, 100, and 500 Clickhouse servers, and reading the data from s3, either from a MergeTree table or from a set of files.

Ten instances saturate 100 Gbit/sec bandwidth, and there is no subsequent improvement.

JFYI, 100 Gbit/sec is less than one PCI-e with a few M.2 SSDs can give.

orf · on March 21, 2023

Ahh yes, sorry, you are running it on a single instance. That’s capped, but the aggregate throughput across instances can be a lot higher.

There are request limits but these are per partition. There is also an undocumented hard cap on list requests per second which is much lower than get object.

zX41ZdbW · on March 21, 2023

No, I'm creating 10, 100, and 500 different EC2 instances and measuring the aggregate throughput.

zX41ZdbW · on March 21, 2023

Worth to note that ClickHouse is a distributed MPP DBMS, it can scale up to 1000s of bare-metal servers and process terabytes (not terabits) of data per second.

It also works on a laptop, or even without installation.

zX41ZdbW · on March 21, 2023

100 instances of m5.8xlarge, 500 instances of m5.2xlarge

orf · on March 21, 2023

Something must be wrong, at my previous work we were able to achieve a much higher aggregate throughput - are you spreading them across AZ, and are you using a VPC endpoint?

We did use multiple VPCs, that might make a difference

zX41ZdbW · on March 21, 2023

Yes, the current experiment was run from a single VPC. It should explain the difference.

orf · on March 21, 2023

Why not randomise VPCs between instances? It’s fairly easy to do with a fixed pool and random selection.

zX41ZdbW · on March 21, 2023

https://repost.aws/knowledge-center/s3-maximum-transfer-spee...

zX41ZdbW · on March 21, 2023

https://imgur.com/a/uh8H9PO

debarshri · on March 21, 2023

This post highlights the change in infrastructure engineering in past couple of years and the importance of k8s.

I think all the choices they made makes so much sense.

lex2ross · on March 21, 2023

One thing is not clear from the post is how multi-tenancy is handled. Suppose Clickhouse has 10K customers and a data plane cluster has 1K only nodes then it can not cater all the customers in a single cluster, right?

This essentially means they must be running multiple dataplane clusters however they might have just a single control plane.

Thoughts?

nhoughto · on March 21, 2023

That’s the cell architecture, so N data plane clusters aka cells, testing qualifies a cell to meet a cert perf level and you horizontally scale out more cells to meet demand.

Whilst hoping a single customer doesn’t exceed the limits of a cell =)

mayank · on March 21, 2023

This is a wonderful article, architecture, and project. Can anyone from Clickhouse comment on any non-technical factors that allowed such a rapid pace of development, e.g. team size, structure, etc.?

rtangirala · on March 21, 2023

Couple of things- 1. Hiring the best engineers globally - this was huge as we are a distributed company and can hire the best talent anywhere in the world. 2. Flat team structure specially in the first year with a concept of verticals(technical areas - like autoscaling, security, proxy layer, operations etc) and having one engineer own and drive these verticals. 3. Lot of cross team collaboration and communication (across product, engineering, business, sales) 4. Lastly as mentioned in the blog post, it was very important for us to stick to the milestone based sprints for faster development and product team helped a lot to prioritize the right features so that engineering could deliver.

qoega · on March 21, 2023

Passion and experience

dboreham · on March 21, 2023

I wanted to like this article, but tbh it's buzzword bingo.

zX41ZdbW · on March 21, 2023

Ok, then here is the summary:

We picked these ten technologies to translate YAML configs to cloud bills and put ClickHouse inside.

goodpoint · on March 21, 2023

CV driven development...

nhoughto · on March 21, 2023

Watch out firebolt.io sounds like CH has solved the ease of deployment problem!

bydlocoder · on March 21, 2023

I wonder what's the performance and cost hit of storing data in S3?

ryadh · on March 22, 2023

That's a fair question and something we obsess about at ClickHouse! It's mentioned in the post but we maintain continuously updated benchmarks for ClickHouse Cloud and it's on-prem shared nothing counterpart (as well as other databases!). You'll find in [1] the comparison for every ClickHouse deployment option.

From the blog post: "the fastest baseline here is ClickHouse server running on an AWS m5d.24xlarge instance that uses 48 threads for query execution. As you can see, an equivalent cloud service with 48 threads performs very well in comparison for a variety of simple and complex queries represented in the benchmark" so there can be a small difference depending on the sizing but it's something to consider on a case per case basis and often other dimensions need to be taken into account (operational cost, bottomless storage, linear scalability, reliability etc.).

[1] https://tinyurl.com/chbench

bydlocoder · on March 30, 2023

That's impressive. ClickHouse is great but it's a huge PITA to operate on-prem so you guys will be rolling in them moneys. Speaking of, maybe you know someone who had to stay in Russia and didn't go to youse guys or Altinity? We (AliExpress Russia) are looking for a CH DBA to help us implement an on-prem CHaaS. Salary is like $9-10K/month.

Basically, we're looking for the person who has at least some expertise with operating CH in production and paying big money. In four month offering $6K-$7K/month we've got ziltch.

UnCommonLisp · on March 20, 2023

It is truly a tremendous feat of engineering, product, and project management to build a solution of this magnitude in just a year. I believe fluid cross-team communication, and excellent foresight in flagging future issues and design decisions are some of the ingredients that led ClickHouse to such flawless execution.

umeshunni · on March 21, 2023

Lol - is this comment written by a bot?

UnCommonLisp · on March 21, 2023

dboreham · on March 21, 2023

Turing test: check

FBISurveillance · on March 21, 2023

Space launch 'commoditization', JWST, AI/NLP advancements (regardless the current hype), mRNA vaccines are real examples of "truly a tremendous feat of engineering".

Running a database on Kubernetes backed by an object based storage? Not even close.

Don't get me wrong, Clickhouse folks are great and their DB is fantastic, but your scale may need some adjustments.