Worked at Hortonworks and post merger Cloudera. Was interesting to see how marke...

swyx · on June 1, 2021

for the uninitiated, do you mind diffing what Hortonworks used to do and what post merger Cloudera is now focused on?

threeseed · on June 1, 2021

Before the cloud took off Hortonworks and Cloudera owned the Big Data market.

They both offered a Hadoop distribution but had different strengths e.g. Hortonworks had fine grained access control, Cloudera had a better SQL product with Impala.

Then AWS came along and built their own which was significantly cheaper and more flexible as you could easily scale your cluster up/down. And so companies moved to it when they over time began to move to the cloud.

The Hortonworks/Cloudera response to this threat was to put away their differences and merge together.

Over time Big Data has evolved from being Hadoop centric to being much more ML/AI focused i.e. not just manipulating and querying the data but doing something interesting with it. And AWS, Azure, GCP have really jumped in with a whole suite of products that are tightly integrated with the rest of their cloud offerings. And it's a large part of what differentiates their offerings so they compete very hard.

So Cloudera has no choice but to do things that cloud providers won't or can't do: (1) focus on non-cloud or multi-cloud and (2) offer a much more integrated and cohesive solution.

But having spent 10+ years in this space and deployed many Hadoop clusters I can tell you that Cloudera is going to struggle. Companies that I never thought would move to the cloud e.g. banks are figuring out the security and regulatory challenges and eagerly moving across. And so it's going to be a Cloudera versus Amazon/Google/Microsoft which is an impossible fight.

wardb · on June 1, 2021

Competing with Amazon/Google/Microsoft on their own cloud is...ehm...good luck with that indeed. I believe they should have partnered with them early on (real partnership, like a premier offering, not the rubber stamp / marketplace partnership).

manigandham · on June 1, 2021

It can work, as that's exactly what Snowflake has done and it's one of the fastest-growing SaaS companies today.

A good product is more valuable than a partnership.

wardb · on June 2, 2021

Good products, but don't discount their go-to-market strategy.

IIRC MS/Azure is an early Databricks investor and their sales folks were heavily incentivised to sell it. They also pushed Snowflake in the early days until they had a competing product and their relationship status was upgraded to 'it's complicated'.

8ytecoder · on June 2, 2021

Databricks as well. Azure sells both their own version and “Azure Databricks”

bostonsre · on June 1, 2021

AWS EMR is still pretty pricey compared to free ambari/cloudera running on ec2. Although, there is a lot of time and effort that needed to be put into automation that uses those ambari/cloudera hadoop management layers. After they merged, they got really aggressive and made moves that effectively killed each of the free versions. They definitely put another nail in the coffin of hadoop. Spark on kubernetes is pretty gorgeous and has been a successful route out of pricey hadoop infrastructure for my company.

8ytecoder · on June 2, 2021

It need not have been this way. All three major providers do work with third party vendors all the time. They could have been the Databricks. Or even better a fully managed solution on the cloud as well (like Snowflake).

mathattack · on June 2, 2021

They will struggle as on-pre-K only. They are trying to get to the cloud. The cloud vendors struggle some because their PAAS products aren’t as good. Lift and Shift doesn’t work. It’s Lift and Whoops and Refactor.

mcny · on June 2, 2021

> Lift and Shift doesn’t work. It’s Lift and Whoops and Refactor.

I’d like to learn more. What doesn’t work? What can vendors do to make it easier? My understanding is that lift and shift doesn’t mean one and done, no grueling manual testing required.

Once you set the lift and shift as long as the source schema doesn’t change, you could run it as often as you’ll like as you deploy, test, fix?

swyx · on June 1, 2021

thank you! i have no history in this space so can't ask followups except to observe that the tendency of ML/AI to reward the "big gets bigger" phenomenon is exemplified here. I don't feel too great about that but also don't have ideas for a better system.

macksd · on June 1, 2021

Worked at Cloudera pre- and post-merger. I thought of on-premises CDH clusters (and similarly HDP clusters) as trying to be the majority of your data infrastructure, but open so that it can integrate with other stuff. It's not just about having big data, but one place to store all of that data regardless of schema: massive database tables, logs, etc. all on shared hardware. AND frameworks to process it different ways in-place: SQL queries, Spark jobs, Search, etc. Data gravity was very important to the business model.

As more people moved to the cloud, Hadoop-style storage was extremely expensive (naively moving your Hadoop cluster to 3x replication on EBS volumes would result in a nasty case of sticker shock) so the data would move to S3 / ADLS / GCP. And now you've lost your data gravity.

Post-merger Cloudera focused less on on-premises clusters and tried to offer those same diverse workloads as a multi-cloud SaaS, with more focus on elasticity. This is hard because (a) there's a massive amount of surface area if you want enterprise customers to bring their own accounts, run all these managed open-source services in those accounts, and be multi-cloud, and (b) you're just competing more directly with the cloud vendors, on their turf as both a customer, partner and competitor.

threeseed · on June 1, 2021

Would add that HDFS was a particular nightmare to manage.

You had to worry about the size of files since the NameNode would be overloaded. Being a Java app running on the older JVMs it would do a full GC under heavy load and cause failovers. And it was impossible to get data in/out from outside the cluster using third party tools.

I remember many companies seeing S3 and just being in shock that it was so cheap, limitless and that someone else was going to manage it all.

bpodgursky · on June 1, 2021

It's interesting, because I think HDFS (and NameNodes in particular) were impressively engineered for a use-case which didn't quite materialize — ie, very fast metadata queries (they are still much faster than S3 API calls). Turns out that cheap, simple, and massively scalable object storage is just far far more important in practice.

I think there are still a couple use-cases where HDFS dominates S3 (I think some HBase workloads?). But yeah, I scaled up and maintained a 2000+ Hadoop cluster for years, and I would never choose it over object storage if given any plausible alternative.

macksd · on June 1, 2021

This is actually a topic I love to talk about because I spent a lot of my time on S3A and the cloud FileSystem implementations. Fast metadata queries were actually a huge deal for query planning, and of course with performance there were a lot of potential surprises on S3. HBase was (unsurprisingly) heavily dependent on semantics that HDFS has but that are hard to get right on object storage, and required a couple of layers to be able to work properly on S3 (and even then - write-ahead logs were still on a small HDFS cluster last I heard). My biggest complaint about S3 was always eventual consistency (for which Hadoop developed a work-around - it originally employed a lot of worst-practices on S3 and suffered from eventual consistent A LOT) but now that S3 has much better consistency guarantees, I agree: it's incredibly hard to beat something that cheap.

fendale · on June 2, 2021

For a job that needs to access 100's of thousands of small files, the ability to read the meta data quickly is very important.

This is the wider issue with small files. On HDFS each file uses up some namenode memory, but if there are jobs that need to touch 100k+ files (which I have seen plenty of), that puts a real strain on the Namenode too.

I have no experience with S3 to know how it would behave in terms of metadata queries for lots of small objects.

dikei · on June 2, 2021

Small files with S3 is both slow and expensive too. But at least one bad query won't be able to kill your whole cluster like HDFS.

macksd · on June 1, 2021

Yeah I would have loved to see HDFS get really scalable metadata management. I remember hearing about LinkedIn's intentions to really do some significant work there are the last community event I attended, but from their blog post this week it doesn't sound like that's happened since the read-from-standby work [1].

Kerberos (quite popular on big enterprise clusters) is really what makes it hard to get data in / out IMO. I see generic Hadoop connectors in A LOT of third party tools.

[1] https://engineering.linkedin.com/blog/2021/the-exabyte-club-...

fendale · on June 2, 2021

Apache Ozone https://hadoop.apache.org/ozone/ is an attempt to make a more scalable (for small files / metadata) HDFS compatible object store with a S3 interface. Solving the meta data problem in the HDFS namenode will probably never happen now. Too much of the Namenode code expects all the meta data to be in memory. Efforts to overcome the NN scalability have been around "read from standby", which offers impressive results.

The meta data is not the only problem with small files. Massive parallel jobs that need to read tiny files will always be slower than if the files were larger. The overhead of getting the metadata for the file, setting up a connection to do the read is quite large to read only a few 100kb or a few MB.

The other issue with the HDFS namenode, is that it has a single read/write lock to protect all the in memory data. Breaking that lock into a more fine grained set of locks would be a big win, but quite tricky at this stage.