For about $70 per month, you could get managed Kubernetes from AWS, GCP or Azure. Why would you bother with Kops on these platforms? Also DO, OVH and Vultr provide managed K8s for free.
Depends on how well managed it is (hint: not always good).
For a high performance service like ClickHouse the nodes may need to be optimized and that's not often done in the fully managed solutions.
For EKS they often take forever to get to the current version of k8s and that might be a problem.
Whilst performance of the master nodes have increased recently they might not be up to scratch for what's required if you're doing a lot of operations.
All in all managed does not mean fit for all requirements. It certainly is great for at least 80% of cases but not all.
EKS and GKE are good products. We run a competing platform for ClickHouse on AWS and made the same decision as ClickHouse Inc 3 years ago and for very similar reasons. Kubernetes requires investment to run well, especially if you aspire to be multi-platform.
The question I would ask is would your team have the bandwidth and experience to do it better than GKE , EKS ,AKS etc.? My employer deploys to AKS and Anthos ; they both have limitations but my advise to teams is to work within those limitations as we are in the business of building business applications and not systems management s/w.
> Why would you bother with Kops on these platforms?
Given sytse is the CEO and founder of GitLab, I am sure he’s not interested in getting vendor locked. Community solutions like kops help keep things open and accessible to everyone.
> A bi-directional API layer between the Control Plane and the Data Plane defines the only integration point between the two planes. We decided to go with a REST API for the following reasons:
> REST APIs are independent of technology used, which helps avoid any dependency between Control Plane and Data Plane. We were able to change the language from Python to Golang in the Data Plane without any changes or impact to the Control Plane.
They offer a lot of flexibility, decoupling various server components which can evolve independently.
> They can scale efficiently due to the stateless nature of the requests - the server completes every client request independently of previous requests.
Im not sure what REST apis are being compared against? Something like RPCs? If so, then using an interchange format like protobufs allows for automatic client generation in several languages. Additionally, schema changes are much less painful.
> using an interchange format like protobufs allows for automatic client generation in several languages
Maybe things have improved, and maybe I was using the "wrong" languages, but when we tried this at a previous company the generated code was deemed so awful that we abandoned that idea almost on day 1. To this day I tend to believe the promise of "write once, generate anywhere!" to be mostly BS.
Not even picking on protobuf specifically here - last time I tried a REST generative approach (swagger?) the results were similarly unusable. The whole concept is a fantasy, unless things have staggeringly improved.
What do you mean “awful”? You’re not meant to read it. We don’t even materialize it into our repos - the build system generates it at compile time. When you go-to-definition in there it’s a little weird, but it works.
I've had really good luck with gRPC + protobufs for generating solid client side code. Swagger + REST never seemed to deliver what it promised but then again, neither did SOAP + WSDL.
>Im not sure what REST apis are being compared against? Something like RPCs? If so, then using an interchange format like protobufs allows for automatic client generation in several languages. Additionally, schema changes are much less painful.
It's so much more expensive in the long run, though. Everything talks in HTTP. Everyone can read and understand HTTP. Proprietary binary protocols add a layer of complexity to literally everything else you do, and require a translation layer to talk to the front end.
Depending on your definition of REST I guess, what you describe is just saying GraphQL works over HTTP, but I certainly wouldn’t call it RESTful in the slightest.
This is an amazing post on building a cloud product. While I was reading I was thinking “how did they get around the IAM role limits with dedicated roles” and they even touched on how they used a cellular architecture.
I wish there was a little more info on the team size and which parts took the most engineering effort to build.
Cellular architecture is a complexity trap that's 100% not worth it unless you're a cloud provider, or Confluent size. I've been at DBaaS companies that wanted to do it and instead ended up with the worst of all worlds.
With that said, it sounds like they went with a simpler design than what is described in the AWS article they linked in their blog post. It just sounds like they're able to spin up multiple data plane cells per region, which is what every DBaaS provider needs to do.
Interesting that they use a single IAM role per client. This will also run into limits (which can be upped), but wouldn’t a better design be to use a single role with a specific STS session name and session tag enforcement policy?
You can then say this role can read/write to “s3://bucket/${account-tag-name}/“ and have it enforced.
Because they use a single service account per client, which has no limits as far as I know, they can add a IAM assume role policy to say “the client-ID session tag must match the Kubernetes service account name” or something like that.
What’s annoying is that you can’t go any more granular than the service account name - currently there is no way to create a pod IAM role with specific pre-baked tags, so you can’t say like “this IAM session can only be assumed when tag X equals the k8s pod label value Y”.
Being able to do this would be insanely valuable for specific and fairly niche IAM problems. As I understand it, it’s a limitation with Kubernetes.
First time poster, long time lurker. I'm a fairly proficient kubernetes admin and developer. I'm using it in depth every day. It seems like that functionality would be really easy to add to the IRSA (iam roles for service accounts). Is it judt that nobody has bothered, or am I missing an important blocker?
The issue as I understand it is that there is no way to encode this information in the Kubernetes token.
The pod IAM roles stuff leverages Kubernetes stuff, and the token that’s mounted into the container is a YAML representation of a Kubernetes token object. There are no fields or other way to add this information into the object.
You would need to encode it into the JWT itself, which isn’t possible or something.
I’m half remembering this, and I can’t find the issue on Guthub because everything has been shuffled around since.
So they basically extended the buffer cache to mean an entire machine and the IO just goes out to S3? Pretty neat - that way you just use the existing cache manager's heuristics, maybe tuned a little, to determine what to keep local and what stays cold.
At least, that's what I would have done, given the timeframe.
Yes, and actually, it is simpler than expected. The cached segments are written into the local filesystem, and therefore we get both the cache on SSD and the page cache in memory "for free". As ClickHouse is OLAP, the read/written segments are not so small (more than 1 MB is expected), so it works fairly well (the same would be more tricky for OLTP).
It requires some considerations. For example, the linear read/write performance on typical c5d/m5d/r5d instances is just around 4 GB/sec, which is faster than 25 Gbit network, but slower than 40 Gbit network. You get 25 Gbit network on larger normal instances in AWS, otherwise, it requires "n" - network-optimized instances. S3 read performance is higher - 50 Gbit/sec achieved from a single server, higher values will require either more CPU or uncompressible data (the data is usually compressed "too much" for a 100 Gbit network to become a bottleneck). But S3 is a tricky beast, and it won't give a good performance unless you read in parallel and with carefully selected ranges from multiple files. The higher latencies of S3 are expected to be around 50 ms unless you are being throttled (and you will be throttled), which is worse than both local SSDs and every type of EBS. S3 is cheap and saves cost on cross-AZ traffic... unless you make too many requests. So, a lot of potential troubles and a lot of things to improve.
You won’t be throttled if you partition your data “correctly” and have a reasonable amount of data in your bucket. That’s difficult to do, but more than possible.
We already partition our data correctly, as the AWS solution architects recommend,
but there are limitations on the:
- total throughput (100 Gbit/sec);
- the number of requests.
For example, at this moment, I'm doing an experiment: creating 10, 100, and 500 Clickhouse servers, and reading the data from s3, either from a MergeTree table or from a set of files.
Ten instances saturate 100 Gbit/sec bandwidth, and there is no subsequent improvement.
JFYI, 100 Gbit/sec is less than one PCI-e with a few M.2 SSDs can give.
Ahh yes, sorry, you are running it on a single instance. That’s capped, but the aggregate throughput across instances can be a lot higher.
There are request limits but these are per partition. There is also an undocumented hard cap on list requests per second which is much lower than get object.
Worth to note that ClickHouse is a distributed MPP DBMS, it can scale up to 1000s of bare-metal servers and process terabytes (not terabits) of data per second.
It also works on a laptop, or even without installation.
Something must be wrong, at my previous work we were able to achieve a much higher aggregate throughput - are you spreading them across AZ, and are you using a VPC endpoint?
We did use multiple VPCs, that might make a difference
One thing is not clear from the post is how multi-tenancy is handled. Suppose Clickhouse has 10K customers and a data plane cluster has 1K only nodes then it can not cater all the customers in a single cluster, right?
This essentially means they must be running multiple dataplane clusters however they might have just a single control plane.
That’s the cell architecture, so N data plane clusters aka cells, testing qualifies a cell to meet a cert perf level and you horizontally scale out more cells to meet demand.
Whilst hoping a single customer doesn’t exceed the limits of a cell =)
This is a wonderful article, architecture, and project. Can anyone from Clickhouse comment on any non-technical factors that allowed such a rapid pace of development, e.g. team size, structure, etc.?
Couple of things-
1. Hiring the best engineers globally - this was huge as we are a distributed company and can hire the best talent anywhere in the world.
2. Flat team structure specially in the first year with a concept of verticals(technical areas - like autoscaling, security, proxy layer, operations etc) and having one engineer own and drive these verticals.
3. Lot of cross team collaboration and communication (across product, engineering, business, sales)
4. Lastly as mentioned in the blog post, it was very important for us to stick to the milestone based sprints for faster development and product team helped a lot to prioritize the right features so that engineering could deliver.
That's a fair question and something we obsess about at ClickHouse! It's mentioned in the post but we maintain continuously updated benchmarks for ClickHouse Cloud and it's on-prem shared nothing counterpart (as well as other databases!). You'll find in [1] the comparison for every ClickHouse deployment option.
From the blog post: "the fastest baseline here is ClickHouse server running on an AWS m5d.24xlarge instance that uses 48 threads for query execution. As you can see, an equivalent cloud service with 48 threads performs very well in comparison for a variety of simple and complex queries represented in the benchmark"
so there can be a small difference depending on the sizing but it's something to consider on a case per case basis and often other dimensions need to be taken into account (operational cost, bottomless storage, linear scalability, reliability etc.).
That's impressive. ClickHouse is great but it's a huge PITA to operate on-prem so you guys will be rolling in them moneys. Speaking of, maybe you know someone who had to stay in Russia and didn't go to youse guys or Altinity? We (AliExpress Russia) are looking for a CH DBA to help us implement an on-prem CHaaS. Salary is like $9-10K/month.
Basically, we're looking for the person who has at least some expertise with operating CH in production and paying big money. In four month offering $6K-$7K/month we've got ziltch.
It is truly a tremendous feat of engineering, product, and project management to build a solution of this magnitude in just a year. I believe fluid cross-team communication, and excellent foresight in flagging future issues and design decisions are some of the ingredients that led ClickHouse to such flawless execution.
Space launch 'commoditization', JWST, AI/NLP advancements (regardless the current hype), mRNA vaccines are real examples of "truly a tremendous feat of engineering".
Running a database on Kubernetes backed by an object based storage? Not even close.
Don't get me wrong, Clickhouse folks are great and their DB is fantastic, but your scale may need some adjustments.