Hacker News new | past | comments | ask | show | jobs | submit login
Cloudera taken private for $5.3B, acquires Datacoral and Cazena (cloudera.com)
153 points by swyx on June 1, 2021 | hide | past | favorite | 109 comments



I worked at Cloudera in the early 2010s, before they went IPO. It's a bit sad to see the way things played out. All I can say is, it was a fun ride while it lasted.

Cloudera actually started in the cloud, but quickly moved to selling mostly on-premise software. Back then, cloud was a lot less cost competitive (it still is not really cost competitive) https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap... The on-premise business had much higher margins, which kind of "took the oxygen out of the room" for the cloud side of the business. In retrospect, neglecting the cloud market was a big mistake, of course.

Cloudera management knew even back then that Hadoop was eventually going to be superseded by a newer technology. We tried to build that technology, in the form of Impala, an optimized SQL query engine. Unfortunately, it never really took off in the marketplace, for several reasons. For one, its dialect of SQL was not compatible with Hive, the traditional SQL frontend for Hadoop. For another, early versions of Impala were somewhat unstable, and didn't have the fault-tolerance of traditional Hadoop tools.

From my perspective, the main issue that cloud solves is that managing the server side stuff is really hard, and requires experts. Cloud also aligns expectations: it's really clear to both customer and vendor that a cloud contract is not a stepping stone towards self-managing. Unfortunately, Cloudera just didn't see the shift towards cloud coming, and they're now paying the price.


Worked at Hortonworks and post merger Cloudera. Was interesting to see how market demand changed over the years and how the company worked on re-inventing themselves. Databricks and Snowflake seemed to understand SaaS earlier and better though. Still have some great friends working at Cloudera, and hope this will indeed accelerate a great next phase.


for the uninitiated, do you mind diffing what Hortonworks used to do and what post merger Cloudera is now focused on?


Before the cloud took off Hortonworks and Cloudera owned the Big Data market.

They both offered a Hadoop distribution but had different strengths e.g. Hortonworks had fine grained access control, Cloudera had a better SQL product with Impala.

Then AWS came along and built their own which was significantly cheaper and more flexible as you could easily scale your cluster up/down. And so companies moved to it when they over time began to move to the cloud.

The Hortonworks/Cloudera response to this threat was to put away their differences and merge together.

Over time Big Data has evolved from being Hadoop centric to being much more ML/AI focused i.e. not just manipulating and querying the data but doing something interesting with it. And AWS, Azure, GCP have really jumped in with a whole suite of products that are tightly integrated with the rest of their cloud offerings. And it's a large part of what differentiates their offerings so they compete very hard.

So Cloudera has no choice but to do things that cloud providers won't or can't do: (1) focus on non-cloud or multi-cloud and (2) offer a much more integrated and cohesive solution.

But having spent 10+ years in this space and deployed many Hadoop clusters I can tell you that Cloudera is going to struggle. Companies that I never thought would move to the cloud e.g. banks are figuring out the security and regulatory challenges and eagerly moving across. And so it's going to be a Cloudera versus Amazon/Google/Microsoft which is an impossible fight.


Competing with Amazon/Google/Microsoft on their own cloud is...ehm...good luck with that indeed. I believe they should have partnered with them early on (real partnership, like a premier offering, not the rubber stamp / marketplace partnership).


It can work, as that's exactly what Snowflake has done and it's one of the fastest-growing SaaS companies today.

A good product is more valuable than a partnership.


Good products, but don't discount their go-to-market strategy.

IIRC MS/Azure is an early Databricks investor and their sales folks were heavily incentivised to sell it. They also pushed Snowflake in the early days until they had a competing product and their relationship status was upgraded to 'it's complicated'.


Databricks as well. Azure sells both their own version and “Azure Databricks”


AWS EMR is still pretty pricey compared to free ambari/cloudera running on ec2. Although, there is a lot of time and effort that needed to be put into automation that uses those ambari/cloudera hadoop management layers. After they merged, they got really aggressive and made moves that effectively killed each of the free versions. They definitely put another nail in the coffin of hadoop. Spark on kubernetes is pretty gorgeous and has been a successful route out of pricey hadoop infrastructure for my company.


It need not have been this way. All three major providers do work with third party vendors all the time. They could have been the Databricks. Or even better a fully managed solution on the cloud as well (like Snowflake).


They will struggle as on-pre-K only. They are trying to get to the cloud. The cloud vendors struggle some because their PAAS products aren’t as good. Lift and Shift doesn’t work. It’s Lift and Whoops and Refactor.


> Lift and Shift doesn’t work. It’s Lift and Whoops and Refactor.

I’d like to learn more. What doesn’t work? What can vendors do to make it easier? My understanding is that lift and shift doesn’t mean one and done, no grueling manual testing required.

Once you set the lift and shift as long as the source schema doesn’t change, you could run it as often as you’ll like as you deploy, test, fix?


thank you! i have no history in this space so can't ask followups except to observe that the tendency of ML/AI to reward the "big gets bigger" phenomenon is exemplified here. I don't feel too great about that but also don't have ideas for a better system.


Worked at Cloudera pre- and post-merger. I thought of on-premises CDH clusters (and similarly HDP clusters) as trying to be the majority of your data infrastructure, but open so that it can integrate with other stuff. It's not just about having big data, but one place to store all of that data regardless of schema: massive database tables, logs, etc. all on shared hardware. AND frameworks to process it different ways in-place: SQL queries, Spark jobs, Search, etc. Data gravity was very important to the business model.

As more people moved to the cloud, Hadoop-style storage was extremely expensive (naively moving your Hadoop cluster to 3x replication on EBS volumes would result in a nasty case of sticker shock) so the data would move to S3 / ADLS / GCP. And now you've lost your data gravity.

Post-merger Cloudera focused less on on-premises clusters and tried to offer those same diverse workloads as a multi-cloud SaaS, with more focus on elasticity. This is hard because (a) there's a massive amount of surface area if you want enterprise customers to bring their own accounts, run all these managed open-source services in those accounts, and be multi-cloud, and (b) you're just competing more directly with the cloud vendors, on their turf as both a customer, partner and competitor.


Would add that HDFS was a particular nightmare to manage.

You had to worry about the size of files since the NameNode would be overloaded. Being a Java app running on the older JVMs it would do a full GC under heavy load and cause failovers. And it was impossible to get data in/out from outside the cluster using third party tools.

I remember many companies seeing S3 and just being in shock that it was so cheap, limitless and that someone else was going to manage it all.


It's interesting, because I think HDFS (and NameNodes in particular) were impressively engineered for a use-case which didn't quite materialize — ie, very fast metadata queries (they are still much faster than S3 API calls). Turns out that cheap, simple, and massively scalable object storage is just far far more important in practice.

I think there are still a couple use-cases where HDFS dominates S3 (I think some HBase workloads?). But yeah, I scaled up and maintained a 2000+ Hadoop cluster for years, and I would never choose it over object storage if given any plausible alternative.


This is actually a topic I love to talk about because I spent a lot of my time on S3A and the cloud FileSystem implementations. Fast metadata queries were actually a huge deal for query planning, and of course with performance there were a lot of potential surprises on S3. HBase was (unsurprisingly) heavily dependent on semantics that HDFS has but that are hard to get right on object storage, and required a couple of layers to be able to work properly on S3 (and even then - write-ahead logs were still on a small HDFS cluster last I heard). My biggest complaint about S3 was always eventual consistency (for which Hadoop developed a work-around - it originally employed a lot of worst-practices on S3 and suffered from eventual consistent A LOT) but now that S3 has much better consistency guarantees, I agree: it's incredibly hard to beat something that cheap.


For a job that needs to access 100's of thousands of small files, the ability to read the meta data quickly is very important.

This is the wider issue with small files. On HDFS each file uses up some namenode memory, but if there are jobs that need to touch 100k+ files (which I have seen plenty of), that puts a real strain on the Namenode too.

I have no experience with S3 to know how it would behave in terms of metadata queries for lots of small objects.


Small files with S3 is both slow and expensive too. But at least one bad query won't be able to kill your whole cluster like HDFS.


Yeah I would have loved to see HDFS get really scalable metadata management. I remember hearing about LinkedIn's intentions to really do some significant work there are the last community event I attended, but from their blog post this week it doesn't sound like that's happened since the read-from-standby work [1].

Kerberos (quite popular on big enterprise clusters) is really what makes it hard to get data in / out IMO. I see generic Hadoop connectors in A LOT of third party tools.

[1] https://engineering.linkedin.com/blog/2021/the-exabyte-club-...


Apache Ozone https://hadoop.apache.org/ozone/ is an attempt to make a more scalable (for small files / metadata) HDFS compatible object store with a S3 interface. Solving the meta data problem in the HDFS namenode will probably never happen now. Too much of the Namenode code expects all the meta data to be in memory. Efforts to overcome the NN scalability have been around "read from standby", which offers impressive results.

The meta data is not the only problem with small files. Massive parallel jobs that need to read tiny files will always be slower than if the files were larger. The overhead of getting the metadata for the file, setting up a connection to do the read is quite large to read only a few 100kb or a few MB.

The other issue with the HDFS namenode, is that it has a single read/write lock to protect all the in memory data. Breaking that lock into a more fine grained set of locks would be a big win, but quite tricky at this stage.


One annoying thing I have noticed about many company/project blogs that they have no link on the header to go to the actual project website (project.com). Instead the main logo takes back to the blog (blog.project.com). You have to manually type the URL for the actual website.

For e.g. even this site: https://blog.ycombinator.com/


The HN Algoila page also doesn't link back to HN: https://hn.algolia.com/


The elephant (no pun intended) in the room continues to be most data in the enterprise isn't big data. You don't need mapreduce when your data set fits in RAM.


This is simply nonsense. Big data has never been just about MapReduce.

It has always revolved around the concept of a data lake, with data stored as objects, a series of data engineering pipelines moving data around and a query engine on top. And in almost every enterprise company this is the high level architecture you see today.

And this model only continues to grow in popularity as the use of siloed SaaS products drives data sprawl and the need for tools like Spark, Fivetran etc to move it all back to a centralised data lake for analysis.


Big data just means big data.

A data lake is one way to deal with it. It's convenient but not always the best.


Sort of. The issue is coupling data with sql databases doesn’t play well with even basic inferential statisics so you end up replicating data based on analytics use cases. Having parquet files in a cheap storage system that can be queried with sql, python, R, etc. is very convenient. What I don’t get is why so little investments are going in making data lakes safer and easier to govern with proper access controls


You can run those queries straight from relational databases too, since everything ends up in the same dataframe structure for processing in the end. There are tons of data source adaptors now and all the modern OLAP databases support various deeper integrations to run stats/ML directly.


Yes! And if it doesn’t fit in RAM, then most businesses will only need an OLAP data warehouse, like Snowflake or Redshift.


And if you want to join data from different SaaS and internal systems e.g. Google Analytics and a Pega decisioning system.

Are you going to spend months upfront carefully modelling the data in order to ingest it making sure to handle schema and DQ issues etc. All to support one use case who only needs a handful of fields.

No. Which is why data lakes exist. Because it's cost effective. You simply dump the data and ask the Engineer or Data Scientist building the use case to do the heavy lifting rather than a centralised data team.


There are integration companies that solve this specific use-case. I’ve used Fivetran [0] and highly recommend it. They will extract-load data from your SaaS to your warehouse and your data scientists can run SQL against the tables. Their most popular warehouses are Redshift and Snowflake. So you can still use a centralized data warehouse without dedicating internal resources to the integrations.

[0] https://fivetran.com/


What I find amazing is that Fivetran is a bunch of glue code to forward data between different APIs and database formats, and it's legitimately useful, in part because when the upstream API breaks they go and fix the connectors for you instead of you having to deal with the resulting emergency... but it's only needed because data interchange standards are in such poor shape. If users demanded that SaaS products make data/event streams/replication logs available via robust and standardized APIs, a lot of the use cases for Fivetran would disappear.


The real problem is that even when it doesn't fit in ram, you still don't need mapreduce.


Cloudera and Hortonwork promise an easy way to deploy on-premise Hadoop clusters. Yet, HDP & CDH are still too hard to maintain for the majority of companies:

* Tuning for HDFS and HBase is hard. Too many files and GC pause will kill you. Hive Metastore is also a bottleneck.

* It's a nightmare to figure out how to configure some third-party software that's outside the bundle.

* It's even harder, if you want to use a more up-to-date version of a software included in the bundle. We still haven't figured out how to use Spark 3 with our old HDP clusters.

* Sometime the included software is a buggy mess. For example, HDP 3 provides HBase 2.0, which is absolutely unusable. Luckily, we were able to migrate to HBase 2.2 before putting too much data into it.

All in all, it still takes a lot of ops work to own a Cloudera's Hadoop clusters. I can't wait until the time when we can move to a better solution.


This might be a silly question but: Why is it so hard to use? What made cloudera not see how difficult their product was to use and invest in fixing that? It appears to have costed them the market, yes?


> Why is it so hard to use? What made cloudera not see how difficult their product was to use and invest in fixing that?

It is packaged open source software, from multiple projects (some of which are financed by Cloudera) all of which release at different paces.

And distributed systems are really hard to get right. There are lots of knobs to tune, all of which are needed in some circumstances, but.. there are lots of things to go wrong.

> It appears to have costed them the market, yes?

Cloudera is probably the market leader for on-prem Hadoop clusters.

This is a big market (even now - lots of defence clients who have issues with cloud).

But generally this market is getting eaten by Cloud SAAS products.


TL DR: Distributed system is hard, and Hadoop cluster are not really designed for a dynamic environment.

There are as probably as many services running inside a Hadoop clusters as a Micro Service meshes, except:

* The services can be huge: HDFS Namenode can takes hundreds GB of RAM, hours to start up in a multi PB cluster. Updating configuration requires restarting, so it is a huge pain if you care about zero downtime.

* The services are often bound to host and can not be easily migrated to other hosts.

* The communication interfaces between services are not well-defined: Hive Metastore, for example, didn't have a formal protocol documentation for a long time, yet everything depends on it to store metadata (I'm not even sure if they have the protocol now). As a result, services are often tightly coupled: you need everything to be compiled with the same libraries version, or stuffs might not work correctly. Furthermore, due to dynamic code loading, issues might not surface until days later - in the middle of the night, making life miserable for everyone.


This is an interesting move. Many companies are still operating on-prem, hybrid cloud or they need to have a multicloud strategy (to operate in certain geographies or to avoid vendor lock-in). If Cloudera can get to the point where they have a big data infrastructure that is competitive with the native offerings from GCP, AWS, etc but easily supports on-prem, hybrid-cloud and multi-cloud, I would imagine there would be many ready corporate customers.

Going private could give them the opportunity to make significant RnD investments away from the quarterly demands of a public company. On the other hand, the private equity backers could enact a bunch of cost-cutting, gut the company, cause its top employees to leave and load the company up with debt. It will be interesting to see how this shakes out.


Well. Not.

PE usually is about discipline. I.e. cash based discipline. So PE change the debt/equity ratio from 30/70 to 70/30, thus enforcing the company to be much more cash efficient. I.e. take LESS risks.

To sum up, I do not envision more R & D.


I think the problem of PE is 1) They need some quick return, 2) They don't necessarily understand the details of the tech (although they probably have access to some talents who do, or claim to do. So usually they just cut cost and make profit from it.


I hear this argument a lot but with all due respect, simply understanding the tech is what got them into this issue. They didn’t understand the commercial.

PE has some scars, but they don’t typically take over healthy, well run companies. They take over mismanaged companies, and refocus on the pursuit of profit that you seem to take issue with, ultimately saving the company from bankruptcy.


Eh, you had me up until the ‘ultimately saving from bankruptcy’ part. PE cares about investor return, not some high and mighty altruistic end goal no?

There are plenty of PE deals of companies no where near insolvency, where the companies just had some more juice that could be squeezed (short/medium/long term depending, but usually short/medium term). If it’s a problem if THAT ends in bankruptcy or not depends on if they are able to get sufficient positive return/exit before they get hit by any losses - seen that many times.


How is saving from bankruptcy altruistic? It’s a financial outcome…

If you can save a company from bankruptcy, that’s wealth that would otherwise have been destroyed.


Because that isn’t the goal and is often not what happens?

It is purchased to make return, not to save them from bankruptcy. If it is a side effect, that’s fine, but that is not why it’s done or what the goal is no?


I was a Cloudera client in 2015. I've since moved to company where we did on prem (thousands of nodes) then another AWS shop. Each case was an example of a superior solution to Cloudera. Cloudera's market seems to be shrinking & they need to change.

On Datacoral, it seems like a very good product, simplifying the overhead of data pipeline development but still staying modern. Datacoral has many cool features, such as treating ETLs as code with an option to develop via GUI. The industry is moving that way, has been for some time.

Good for Datacoral to get acquired and get exposure to so many new customers. But Cloudera has their work cut out for them to stay relevant in a market that is extremely competitive.


Cloudera / Horton works merger -> President steps down -> Cloudera gets taken private. Interesting trajectory over the last few years.


Agreed. It's astonishing, to me at least, that a company that was on the vanguard of both the cloud and big data was so incapable of figuring out a way to make real money off either one. Their sale today, despite the buyer premium, is well below their IPO price.


One thing to note, and this was seen with the Skype purchase, typically contracts as part of a private equity deal limit stock issued to lower level employees but also generally have an aggressive clawback clause.

I'm too lazy to find the best reference, but this is one re: Skype - https://www.businessinsider.com/skype-scandal-silver-lake-20...


The Skype one sounds like it wasn't such a weird deal. Those that got fired were rumoured to not get their options but in reality they did. Only someone that decided to quit voluntarily 13 months in didn't get an equity deal on the exit that happened later. And this wasn't a secret clause somewhere, those were the terms he had from the beginning.


Is this surprising? I wouldn't be surprised if the value in cloud accretes almost exclusively to the big cloud companies. Exit strategies for these kind of enterprises should primarily be "get acquired by X cloud provider".


Sounds like businesses can't compete running on or off these platforms. Which could stifle competition. Why start a business if one of the few, enormous players can just step in and crush it with network effects or preferential pricing.


they got outspent.

they raised a bunch, built a market... then Amazon and Google came for them. Impala was great - I was a big advocate, but then I worked on BQ for a bit, and now Omni.... Cloudera cannot compete.

The only stumble I can identify is that they didn't support Spark. They backed the Hadoop side too hard and left Spark open to Databricks. They should have signed those guys before any other investors got in (told Matie and co "go get offers, we will add 30%).


WDYM? They were the Hadoop company. You can't just become the Spark company, the philosophies of the products are very different. This comment is pretty silly.

The "only" stumble I can identify is that they're selling a last-generation solution and most companies see Hadoop as tech debt nowadays. Which is to say, it's a systemic issue with their entire product, not a tiny mistake. This is like Mesos vs Kubernetes. One got squashed.


Spark's initial path to success was "a faster way to process your data in HDFS". Cloudera was selling users Spark before DataBricks was even founded. The idea was that Hadoop was an ecosystem of tools for processing data built on commodity storage and compute hardware, for when your data was too big and expensive to transfer to the cloud.

Over time it became increasingly popular to use cloud storage instead of running HDFS. This really destroyed Cloudera's moat, because there was no operational overhead to putting your data in S3 or GCS. You just needed to run some stateless compute, and if you fucked up it didn't matter. Nowadays your "data lake" is a bunch of files in commodity storage someone else runs.


Yes I agree. It isn't really that Spark killed Hadoop but S3/GCS made managing Hadoops clusters pointless. Spark plays well with the storage ecosystem so it's thriving now. But my whole point is that it seems unlikely to me that Cloudera would just become a compute company if they had invested more into Spark. The core thing they were selling became less and less important over time. That was the problem


Yeah, branding themselves as the "Hadoop company" made it difficult to get on board with Spark, etc. If they had branded themselves as the "big data company," it would have been far easier to move with the market.

There is probably a business-school case study there for branding yourself with the problem area rather than a single solution, esp. in fast-moving domains such as tech.


They just proved the hard way, that there is no one size fits all data infrastructure based on Hadoop that makes sense financially. Most of the value comes from deep understanding of data access patterns and having the right solution.


It seems to read similar to a leveraged buyout. Easier to do things when the company doesnt have to show its books.

Also interesting is that these are market areas that AWS is starting to expand into.


Been quite a ride to watch, feels like a surprise every couple of years!

Looking ahead, on the one side, we're seeing a privately hosted (on-prem + cloud) revival due to multicloud / lockin distrust + extreme cloud markup for AI/ML (GPUs etc) on the other, and for cloud, insane 30-50% YoY growth rates even at $B scales. On the other, I'm not sure how that lines up w/ Cloudera's current products and the investment team's R&D/M&A appetite

It also makes me wonder what is the value of being in n-th place. We're reviewing which new connectors to prioritize for our own viz tool dev, and cloud-native offerings like snowflake/databricks/redshift are generally ahead of on-prem ones, and we're not the only ones. So what's the value of current growth, and of paying premiums to leap frog (ML, GPUs, etc.), esp. when you have the kind of footprint cloudera earned?


>After all, we invented the whole idea of Big Data.

Really?


To all Cloudera's clients: pack your stuff and try to get an alternative vendor. Private Equity will not bring anything valuable to this business except higher prices,more aggressive sales and poor customer service.


As seen with ExtJS/Sencha and TravisCI. I'm sure the story will be the same. Chill for about 6 months to 1 year, and then layoff all the engineers, layoff the support while shipping it overseas and then hope long tail subscriptions and "trapped" organizations continue to pay the fees.


Are there any examples of tech PE companies doing good? Seems like all I hear are horror stories


It's so obvious, isn't it?


If you're looking for an alternative vendor: We've started a new company (called Stackable) to build a distribution for all these open source "data" tools (e.g. Apache Kafka, Apache NiFi, Apache Spark, ...).

I'm a committer for Apache HBase, Apache Hive myself and I've been in this space for 13 years now. Yes, the hype is over but there are tons of companies using this stuff in production and tons of companies choosing it for new projects.

We're trying to tackle the biggest pain points our customers had: Lack of flexibility (i.e. locked into specific versions for ages), CDH/HDP not built on Infrastructure as Code principles, Security is hard to do, ...

The three-sentence (buzzword heavy) technical summary: Our distro uses the Kubernetes control plane but we've developed a custom Kubelet that runs software using systemd as its backend as well as a bunch of operators (all in Rust...). This allows us to leverage the best of both worlds and also allows hybrid scenarios (part in containers, part on "bare metal"). We've replaced Ranger/Sentry with OpenPolicyAgent.

If anyone's interested feel free to reach out to me (E-Mail in profile / https://www.stackable.de/en ).


Your cookie pop-up is in german even in the english website.


Thanks for the heads-up, I'll forward the feedback.


After the merger with hortonworks, they already went extremely aggressive and I didn't think it was possible for them to have an even dimmer future...


Serious question: Has PE M&A ever led to an improved product?

(Maybe that's just a stupid rather than a serious question)


You could make an argument that Berkshire Hathaway is the largest M&A firm ever. But it's not really private equity (though it's not really public, either).


Dell? Silver Lake played a big role in that IIRC.


Dell's approach was actually more like the classic "taking a company private again", where you use public equity markets to grow big but keep control, then take it private at terms that don't really reward shareholders for the massive growth. This looks like the modern variety of PE capturing predictable revenues from a large, mature client base that can pay their fund the expected returns for the next 5-7 years. It's boring as hell and never means (a) a better product, or (b) a bigger pay-off for employees.


Silver Lake was only a source for money, not “management expertise” on that deal.


Having worked for a Silver Lake funded company (I originally called a startup, but that's not fair to say anymore for a private company that now makes billions), I can assure you that they don't take a back seat to how the company is ran (that's not to say they take a direct hands on approach, either).


In 2017, I joined a company that had been spun out of Ebay and bought by a PE firm. The firm invested a large amount of "growth capital" in the biz to transform the product from a software license to a cloud-based service. This transition not only increased our revenue exponentially but also gave us the ability to analyze data on how customers were using our product (prior to this, we had zero visibility into how customers used our on-prem product). Using this data, we were able to better serve our customers & partners and improve the overall experience of using the product. A few years later, we sold the company to a large company for a pretty penny (>$1B).

This is all definitely anecdata but, IMO, being backed by a PE firm forced us to focus on revenue (really EBITDA) alongside product growth. A mechanism that forced us to focus on the impacts of each product decision we made. I think this ultimately helped us keep a steady pulse on the market w/o chasing every shiny new trend that popped up.


Dynatrace is a monitoring solution and company that recently went public again after being initially taken private by a PE.

I believe their offering significantly improved during period.

(I was a Professional Services employee for a few years)


Limit that to KKR and you are going to see many "good" examples...


Hilton hotels in my opinion got much better after the takeover


Exactly this, don't get squeezed.


This 100%


The data industry continues to hype this idea of “multi-cloud,” but then the “modern data stack” is centralized around a single warehouse and nobody sees any irony in that.

The big bet we’re making at Splitgraph [0] is that the next wave of data engineering will take a more decentralized, “data mesh” type approach to enterprise architecture. “Data gravity” really does exist - it’s expensive to move, in terms of both cost and operational complexity. And with increasing specialization of analytical databases, a single source of truth will become unrealistic. So instead of bringing the data to the query, why not bring the query to the data? All we need for that is a set of read only credentials. And yes, it should also be easy to warehouse your data, but it doesn’t need to be the default.

Cloudera mentions they bought DataCoral to help with data integration and connectors. They’ve correctly identified the problem - data sprawl and fragmentation will inevitably grow - but I’m not sure they have the right solution.

Data integration is important, but it’s a moving target, which is why it calls for a collaborative open source solution. This is why so many new startups, like AirByte most recently, are coalescing around the Singer taps that Stitch left behind after its acquisition by Talend.

We also support using Singer taps to ingest data into versioned Splitgraph images [1], so we’re excited to see more collaboration on maintenance of taps. For us it’s a useful feature, but it should be just that — a feature. Is there really a need to replicate all of your data before you can even query it? Or would you rather experiment by directly querying its source?

[0] https://www.splitgraph.com

[1] unreleased and undocumented atm, but it does work. We’re hiring, especially on the frontend if you want to help build the web UI. See profile.


This comment seems unrelated to the original story or the comment you're responding you, mostly reads like an ad.


I was under the impression that private equity was mainly a vehicle for the rich to get richer. Serious question: Do companies have no choice but to go this route or why are they doing this?


Not really. PE is mainly a vehicle for smart Wall Streeters to fleece credulous and incompetent pension fund managers.

But the latest boom in PE is a function of: higher than usual levels of credulity around private vs public performance (but PE funds have lower volatility???), and the wave of free money coming out of the Fed. In 2010, lots of PE funds got bailed out after making unbelievably bad bets with no economic rationale (the worst being Blackstone Real Estate, they should have gone bust, they now manage $250bn in RE).

Tech companies are hot right now because of Vista Equity's numbers. Tech companies appear nominally attractive in cash flow terms because so many tech companies pay staff non-cash. So you can acquire at an unreasonable multiple, load up with debt, and then leave employees holding your bags if it goes wrong...self-evidently though, no actual value is being created here beyond playing the capital cycle.

Most of the growth in PE is correlated to the growth in investors (both in the funds, and financing) who don't understand what they are doing. If you look at Europe, private equity activity has exploded higher with money-printing from the ECB and the growth in direct lending/leveraged loan markets (these have gone from 20bn to 150bn in five years or so...where it was in 2007, and lots of very unsophisticated private debt funds with poor incentives). Shadow banking all over again, no-one knows where the money is coming from, no-one knows where it is going.


Many struggling technology companies will go private to all them to substantially invest in RnD. This allows then to make those expenditures without having the overhead and distractions of having to be a public company where investors are expecting short-term returns. Further, these private equity companies may then roll up several related businesses in an attempt to create synergies through sales or product.

Ultimate software is a recent example of this: https://www.forbes.com/sites/antoinegara/2019/03/01/an-insid...


Yeah, I think there's a pretty big difference between tech companies going private (often to double-down R&D investment) and mature consumer brands going private (which is usually a way to extract profits and milk the carcass dry, see: Toys r Us, Olive Garden).


this +100. You see the later in manture enterprise software a lot more these days and it is not fun to be a part of it. You get to watch software development transition from a profit generator to a cost center, which changes the game completely.


Dell did this at 2013 and seems to be doing fine. https://www.dell.com/learn/us/en/uscorp1/secure/acq-dell-sil...

This changed my impression to going-private moves, although in Dell's case the buyer is more like a founder with a help from the investor.


Dell was a special case .... going private allowed them to avoid shooting themselves in the foot like HP


Could you expand on this a little? What did HP do, I'm just not familiar


Absolutely a special case.


I'm not an expert so below is only my personal take on the matter.

Being private allows for less accountability to people who might not be acquainted with the business, want short-term profits, or do not share the owners long-term vision for the company. All of these allow for more freedom in movements and limit the damages to those who understand the risks. On the other side, the company has more limited pool of liquidity sources and each of the investors might have stronger voice in shareholder meetings.

As a result, the public investor might not share the profits of the business, but it might not incur losses because of it as well. Therefore, private equity is just another business organization model which has its pros and cons.


Stock markets only look a quarter or two in the future. It's hard to do anything long-term like invest in R&D unless your numbers are so spectacular you can slip it in there. Very few public companies have numbers that good.


>Stock markets only look a quarter or two in the future.

How does this square with the valuations of companies like Amazon and Tesla whose share price performance is certainly not based on the next quarter or two worth of earnings.

Or pick any other top companyies' stocks which are collectively driving the market, presumably because they are expected to deliver years of more growth and cash flow.


The people with the authority to make the deal get paid huge sums for pulling the trigger. After that it doesn't matter to them that the company gets digested from the inside out.


AWS, GCP and Azure ate Cloudera's lunch.


A huge portion of the growth in cloud was the relatively cheaper aspect for a VPS vs places that sold physical server rackspace. Once the foothold was in there the value-add of services on top became a huge revenue generator.


KJR! Expect half your staff to be gone next year after massive cost cutting and leveraging.


Clayton, Dubilier & Rice (CD&R) and KKR.

When I see the name of KKR mentioned...


I don't even know what they do anymore. They used to make money by just raiding their portfolio companies' pensions, but now that corporations don't offer private pensions, I just don't know. Maybe they're actually improving the companies?


For a reference:

Hadoop on Google Trends peeked at 2015: https://trends.google.com/trends/explore?date=all&geo=US&q=h...

"Hadoop is dead" seems to be a popular topic in past few years. https://www.google.com/search?q=hadoop+is+dead


Good points of evidence, however you can't read too much into the google trends. A lot of technologies peak when they are new and then fall to a steady state.

For example "computers" peaked pre-2004: https://trends.google.com/trends/explore?date=all&geo=US&q=c...

Javascript: https://trends.google.com/trends/explore?date=all&geo=US&q=j...

Machine learning: https://trends.google.com/trends/explore?date=all&geo=US&q=m...


In this case it's accurate. Hadoop is largely dead.

YARN, Hive, HDFS, MapReduce have been replaced by Kubernetes, Snowflake, S3, Spark.


And even that will continue to change.

Kubernetes is overused right now, it has its place but it's not nearly universally the right tool for the job.

Snowflake will eventually fall to something else due it's poor economics.

S3 and Spark though I anticipate to be around for a good few years and if they lose out it will be to imitators or evolutionary equivalents.


Kubernetes works very well for SaaS. The big problem is management of Kubernetes itself, but so far our company has had good experiences using Amazon EKS. I would not say it is perfect. However, it does allow devs to focus (mostly) on problems related to actual applications.


> Snowflake will eventually fall to something else due it's poor economics.

Can you elaborate on this point? What’s wrong with their model?


Entry level Snowflake is $2 per hour. Or $48 per day if you transfer the metric.

Entry level DO compute instance (so boring, I know) is $5 per month.

There is a large gulf of pricing ranges that can undercut them in the coming years. It doesn't matter now because a lot of analysis projects are disconnected from market forces due to their projects mostly being darling child green field projects or new revenue streams. The moment the next AI winter comes along, a lot of projects by then will start to look like legacy code and the original thought process turns into worrying about cost centers.

And my understanding is they jacked up the prices to boost the earnings to boost the stock price leading up to IPO. They can be disrupted much faster than they will decide to let off that pedal.


Unless you plan to throw out everything else and only ever use Snowflake, Hive Metastore is still as important as ever.

Almost every Big Data tools only works with Hive Metastore (and Amazon Glue Catalog, but the compatibilities is not 100%)


Yeah, very few companies are running Hadoop clusters on premise these days the way many were at least trying to 5 years ago.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: