Also saying they “missed the Snowflake opportunity” doesn’t make sense either: they have Redshift? They don’t need to acquire Snowflake, they’re already the incumbent in the field.
Author here. They missed the Snowflake opportunity by having the wrong architecture for Redshift (decoupled storage and compute). They shifted to the Snowflake model in 2019, but the damage might already be done. For other the other services I listed, the main differentiators are mainly plugins/extensibility and developer experience.
If you’re going to pay someone for a data warehouse, you can pay AWS some money, or you can pay Snowflake a heinous amount of money, for something not significantly better?
Mongo, ES and Redis all come with open source versions, which shifts the appeal back away from the AWS offerings, especially with things like Elastic Cloud for Kubernetes which has made our ES cluster basically a hands free experience.
Snowflake is technically far superior to Redshift. The performance and features are somewhere else. Even if you pay more in operating cost (which is not always the case, especially if you count Redshift DBA salary), you get to do what you actually need to do much faster instead of fighting the platform.
Snowflake is also technically going to always be far superior to Redshift, because AWS is a follower, not a leader. Their strategy is to copy what others are doing and build a moat of "but we can do it too, and you already have other things with us".
And if you really care about operating cost, you should be running Trino+Minio or Clickhouse onprem or something as your warehouse.
> Snowflake is technically far superior to Redshift. The performance and features are somewhere else
From the discussions I’ve had about this before, I think I’m in the minority when I say I’m categorically unimpressed by Snowflakes performance.
Add that to the hideous cost, and the worlds most aggressive sales/account management team and I’ve less than zero desire to ever deal with them again.
> Snowflake is also technically going to always be far superior to Redshift, because AWS is a follower
While this is true, how many businesses actually need or actually utilise the features they’re paying for with things like Snowflake. Sure it’s got separate storage and compute, but how many places have so much data that they need that? I’ve worked with places that were running multi-node data warehouse clusters, that we migrated to a single ClickHouse or Postgres instance and got equivalent or better performance for a fraction of the economic and operational overhead.
> And if you really care about operating cost, you should be running Trino+Minio or Clickhouse onprem or something as your warehouse.
Can’t disagree here, although I’d say it shouldn’t just be relegated to operating cost: CH blows most alternatives out of the water and is available in hosted options now.
Haven’t used Trino recently, last time I used it, it was still called Presto, and it was frustratingly slow, know if it’s improved recently?
I think that depends on your target and especially scope.
ClickHouse demonstrates this well - it is incredibly fast and powerful, but also very limited. It has its own dialect of SQL incompatible with anything else. It doesn't even have a traditional query planner so you have to be expert to write fast queries. It has no update, no merge, no CTEs. To get most out of it, you have to think about data sorting, data types (is this LowCardinality String or just String?). And once you reach scale where single node can handle it (which is much more than single node Redshift could handle) you manage a stateful cluster which is a pain compared to storage/compute separated clusters like Presto/Trino over S3.
So ClickHouse is very good if it is in hands of dedicated team that is willing to learn CH in and out and is using it for a particular purpose. But you can't really setup ClickHouse, dump data into it and then send 100 random Data Analysts to "go forth and make reports". The barrier of entry and list of things that are different in ClickHouse is just too long.
Snowflake is exactly at the opposite end of "how much knowledge / how much of a dedicated team do I need to make use of this". Dumping data in and giving other teams access is something you can do easily. Scaling isn't a problem, users interfering with each other isn't a problem (you can give them their own compute that suspends when not in use), all the typical "data features" are there, backup is handled via time travel once they delete something they shouldn't, they don't have to understand data types or what is the best partitioning and sorting for the querying they are planning to do. It even has fancy web UI with charts and CSV upload.
You do pay a pretty penny for that, but there are many companies with a lot of money and many data problems to solve, so paying more for infrastructure and enabling a many of less technically skilled users (who are likely SMEs) is worth it.
Redshift sits somewhere in the middle. You still need DBAs, you still somewhat need to think about data layout, you still need to worry about managing backups and workload management where one using can slow down everyone else, etc. It feels like worst of all worlds.
In onprem world, Trino behaves more like Snowflake than ClickHouse, which has its uses - again if youre looking to build a general platform instead of dedicated application. Might be worth it even if you never get milisecond latency queries out of it.
> ClickHouse demonstrates this well - it is incredibly fast and powerful, but also very limited.
Your examples are not entirely correct. ClickHouse introduced CTEs in early 2021. I use them constantly. ClickHouse SQL is turning into a superset of standard SQL at least for queries. Most SELECT syntax, including window functions, just works. ClickHouse does have updates; but they are asynchronous. This too is being fixed. Synchronous DELETEs will be available this year, UPDATE is next. One big issue for my money is distributed joins. They still require a lot of reasoning about data locality.
You are right that ClickHouse requires attention and skills to get the best performance. However, that includes fixed, low-latency use cases like real-time marketing that Snowflake simply does not handle. ClickHouse and Snowflake aren't interchangeable for these use cases.
Oh wow, I missed that. I guess I heard about the CTE somewhere and understood dbt-clickhouse not having ephemeral materialization as CH still not supporting it.
I am still somewhat salty about not being able to do GROUP BY 1, 2, 3 :)
> You are right that ClickHouse requires attention and skills to get the best performance. However, that includes fixed, low-latency use cases like real-time marketing that Snowflake simply does not handle. ClickHouse and Snowflake aren't interchangeable for these use cases.
Nevertheless, it is already available in ClickHouse as well under `enable_positional_arguments` setting and we are considering making it enabled by default.
> Snowflake is also technically going to always be far superior to Redshift, because AWS is a follower, not a leader.
Redshift was the first cloud data warehouse-as-a-service in the Amazon cloud. Every data warehouse since then has built on that concept. Speaking of innovation, Snowflake depends on object storage, which Amazon basically invented.
Not so sure about that: https://aws.amazon.com/ground-station/. Enabling data collection from the rest of the solar system seems pretty cutting edge to me.
Are you saying AWS does not make money off the infrastructure Snowflake consumes? Snowflake consumes a lot of compute, plus there are lots of other SaaS services that integrate with it and also need compute, networking, etc.
It's my understanding the margins on compute at least are pretty good. So it seems as if AWS wins either way.
I don't know what you are smoking here but let's take MongoDB vs DocumentDB or Elastic Cloud. Mongo built a 30B$ business around Atlas that is growing at a staggering rate. DocumentDB has a tiny share. Same with Elastic.
Also what the hell are you talking about with RedShift. Decoupled Compute and Storage is a far superior architecture. Here is an expert from snowflake.com - " "Snowflake was founded on the belief that tying compute and storage together is not an effective approach for limitless, seamless scaling."
Yeah, I'm not sure the point they are trying to make here. Each of these AWS services is clearly and demonstrably less popular than the alternative it's compared against. Mongo and DocumentDB for example aren't even close: https://db-engines.com/en/ranking_trend/system/Amazon+Docume...
OpenSearch is a massively inferior offering compared to Elasticsearch too. It became outdated the moment it was forked, the documentation is lacking, and since you'll end up looking up ES docs and forgetting to switch to version 7.10, you'll get a nice reminder of everything new that has been added that you can't actually use.
The only thing it has going for it is that it's managed and you're already on AWS, so you don't need to spend months working up a contract with a new vendor and doing the security audit dance.
> so you don't need to spend months working up a contract with a new vendor and doing the security audit dance.
That's a very big moat. Many decision makers are risk-averse w.r.t to infrastructure vendors and don't mind paying (or making someone else pay) a premium for that.
The only thing that changed in the saying "no one was fire for choosing IBM" is the name.
> The only thing it has going for it is that it's managed and you're already on AWS, so you don't need to spend months working up a contract with a new vendor and doing the security audit dance.
That's a pretty substantial win for many projects, similar to how popular RDS is while giving up the ability to get updates as fast as if you run your own servers. People pay a lot for stability and reduced staffing requirements, which is a choice — calling it “massively inferior” seems like the wrong call versus recognizing that not everyone has the same needs and resources as you do.
It's functionally inferior, because if you go into it expecting modern elasticsearch you'll be sorely disappointed. It's only now that it's been renamed to OpenSearch that this becomes less of an issue, as the two technologies have completely diverged.
Anything AWS decides to run a managed version of will automatically have an advantage over the non-AWS equivalent but that's not a property of the technology itself really, it's just that you've already done all the paperwork and bureaucracy to use AWS in your org.
I mean, the word we're not using here but ought to is vendor lock-in. I don't really need the patronising remark about recognising other's needs either.
> It's functionally inferior, because if you go into it expecting modern elasticsearch you'll be sorely disappointed.
Or, for a large number of people, won't notice the difference. There are some important questions about how ElasticSearch's license & community works, and whether you're risking lock-in by using a managed service, but detail-free hyperbole contributes noise but not value to that discussion.
For example, what are the features you think no user of ElasticSearch could live without which are in ElasticSearch after 7.10 but not OpenSearch 1.1? What percentage of users depend on those features? How concerned are you about vendor lock-in with ElasticSearch's unilateral control of the project's roadmap? Do you think there's more or less risk from an open source project you can run anywhere which, you fear, will not be updated as frequently or from one which has a history of backwards-incompatible changes requiring you to stay current or fall out of support by popular clients?
These are all questions about the merits of either technology.
We've already identified that the merit of a managed AWS (but really any cloud provider) solution is that you've already done your due diligence for that provider, and that alone far outweighs whatever other merit you might consider.
Some of the other stuff is what I might consider if I had to make a case for approving a new vendor, but this is an overwhelming barrage of rhetorical questions that makes it difficult to have a reasoned conversation about.
I mean, come on... what percentage of users depend on those features? Unilateral control over a product roadmap? Am I supposed to actually know these things when stating my opinion that OpenSearch is lame in comparison to the original ElasticSearch, based on my anecdotal experience of dealing with the two? Do you have those answers yourself?
As to the risk, a fair question. There's currently a threat that ES client libraries will reject a connection to OS (and I recall this has already been done for some languages). Not the end of the world but the only people who have lost out from the licensing spat you've alluded to are the users. The risks of onboarding Elastic as a new vendor or self-hosting are well known and part of the usual discussion of trade-offs one will have.
You were making some absolute assertions, so yes, I would expect you to have some actual examples based on real experience. Asking you what homework you based that on wasn’t rhetoric but simply a question because I know multiple projects using OpenSearch who’ve had no problems other than Elastic sabotaging certain clients. It would be useful to know, for example, if there was a certain class of functionality or performance challenge where the difference is significant.
- #2, #3, even maybe #4/#5 are big revenue. Cloud is unusually big and still growing insanely.
- crazy margins when they don't have to invent the core concept, go through core product/market fit R&D, nor fight for a distribution channel to market+sell it, nor fight middlemen for competitive pricing
- cross-selling & ecosystem lock-in means even revenue/profit don't have to be high or even positive
OpenSearch is a massively inferior offering compared to Elasticsearch too. It became outdated the moment it was forked, the documentation is lacking, and since you'll end up looking up ES docs and forgetting to switch to version 7.10, you'll get a nice reminder of everything new that has been added that you can't actually use.
> Each of these AWS services is clearly and demonstrably less popular than the alternative it's compared against.
Subject to the limitations of the data, which is mostly what they can scrape from open sources with an unspecified weighting algorithm: https://db-engines.com/en/ranking_definition
That's important to pay attention to because there are many areas this can go wrong — for example, it doesn't include AWS support or Amazon's own Q&A forums so you know you're missing a certain fraction of highly-relevant activity and, more importantly, as a bulk data-mining exercise you have a big challenge differentiating breadth and depth. MongoDB was heavily promoted about a decade ago so there are a ton of SO questions from people who fired up a copy and were looking to use it — which is great, but it doesn't tell you how many of them ended up actually using it for something serious or how big their project was. A thousand hobbyists storing 1% of the number of records of a single enterprise customer is not really something you can easily distill down to a single value. It also doesn't tell you how well people are sticking with it — for example, mature development / ops teams tend to ask fewer basic questions (they've been resolved & in-house expertise means they might never hit StackOverflow) but they might post harder questions about scaling. Does that mean that use of the technology is tapering or that the community is maturing?
The other big question is how you account for managed services. For example, if I use Kinesis I'm outsourcing a lot of operations to AWS; if I use Kafka I have to bring that to the table — one of those scenarios is likely going to involve a LOT more questions and open activity which on its own doesn't tell you anything about how many applications or how much data I'm using it for in either case.
Lol. Wrong on all of those.