Launch HN: Dashdive (YC W23) – Track your cloud costs precisely

danpalmer · 2024-01-30T06:06:43.000000Z

How do you aim to attribute cost for multi-tenant services like databases?

For example, if I have a database costing $100/m, and 4 equal sized customers (by data volume, compute, requests), you could say that each is costing $25/m. However if I'm over provisioned by 50%, then an incremental user is only actually $12.50. To understand this difference requires understanding resource utilisation, which particularly for compute can be really hard. There's also a difference between desired headroom (for redundancy) and undesired headroom (due to inefficiencies), and these will be nuanced per company or service deployment.

This could also be taken in reverse – rather than the cost for an incremental user, the savings from removing a user. Given 2x $100/m databases with 2 customers, one accounting for 60% and one for 40%, losing the larger one would allow a 50% reduction on cost, but losing the smaller one would not.

What's your thinking here? Are you trying to communicate the incremental cost of a user? Are you trying to communicate the cost of a user against the current infrastructure, whatever that infrastructure might be?

ashug · 2024-01-30T08:48:40.000000Z

TL;DR - We treat the per-customer usage data as the ground truth, and the per-customer cost can vary based on parameters chosen by the Dashdive user and their preferred mental model.

At a minimum, every customer is assigned the cost resulting from the usage that is directly attributable to them. This "directly attributable" figure is obtained by "integrating" vCPU-seconds and RAM-seconds on a per-DB-query basis, per-API-invocation basis, or similar. For example, if Customer X used 5 vCPU seconds and 7 RAM-GiB-seconds due to queries over a period of 10 seconds on our RDS cluster with total capacity 1 vCPU and 2 GiB RAM, then they directly utilized 50% of vCPU capacity and [7 GiB-sec / (2 GiB * 10 sec)] = 35% of RAM capacity over that period.

The question remains of how to distribute the cost of the un-utilized capacity over that period amongst the customers, perhaps distributing some portion to special "desired headroom" and "undesired headroom" values. As you mentioned, the answer is subjective and can vary between Dashdive users (or even over time for the same Dashdive user, e.g. a user decides they can reduce their desired headroom from 30% to 20%). The only sensible approach in our opinion is to make this configurable for the user, with sane and explicit defaults.

Let's go through both your examples to illustrate. In example 1, each of the 4 equally sized customers would be assigned only 12.5% of the total cost of the cluster. The dashboard would show that, by default, 30% headroom is desired, so out of the remaining 50% capacity, 30% would be marked as desired headroom, and 20% would be marked as undesired headroom. The user can override the desired headroom percentage. Although in our opinion it is most correct for all headroom to be treated as a fixed cost in multitenant scenarios, we would also provide the option to distribute both/either headroom types amongst all customers, either proportionally or equally.

For example 2, our model is not sophisticated enough to capture the nuance that losing only the larger customer would allow cost reduction. Assuming both customers used both databases (let's say they're replicas or shards), and 0% headroom, we would simply assign 40% of costs to the smaller customer and 60% of costs to the larger one. This is subtle, but the missed nuance is only important if $100/m is the finest-grained resource you can get. Otherwise, if you lose the 40% customer, you can switch to a 2x $60/m DB, for example.

This is a very astute callout! It has come up a couple times as a point of concern from prospective customers. Would be keen to hear if this diverges from your expectations at all.

danpalmer · 2024-01-30T22:54:35.000000Z

Thank you for the very detailed response, it all sounds excellent and it's great to see you've been thinking these things through. I don't really know what my expectations were, but this seems like a good combination of flexible, with opinions based in real-world experience. You're right that in my second example it's unlikely to occur at scale, the bigger you get the more granular you can be in general, so I doubt that'll be an issue.

ericb · 2024-01-29T18:16:37.000000Z

Very cool!

Feature request: I have really struggled with turning the thing costing me money off in AWS.

If, with the right master credentials, I could consistently and easily do that somehow, that'd be a 10x feature. If you made that use-case free, you'd get tons of installations from people who desperately need this in the top of your sales funnel.

edit: This used to say "in your app" and that wasn't quite what I want, so I changed that language, but jedberg's objections in the comments below were, and are, valid concerns with what I was stating and any implementation.

scapecast · 2024-01-29T19:05:07.000000Z

We solved that exact problem with our open source tool Resoto. Specifically our "Defrag" module, which cleans up unused and expired resources:

https://resoto.com/defrag https://github.com/someengineering/resoto

The magic behind the clean up is Resoto's inventory graph - the graph captures the clean up steps for each individual AWS resource.

One of Resoto's users, D2iQ (now part of Nutanix), reduced their monthly cloud bill by ~78%, decreased from $561K to $122K per month. There's a step-by-step tutorial on our blog how they did it.

I don't mean to hijack Dashdive's thunder here though, congrats on the launch!

alex_lav · 2024-01-31T18:54:21.000000Z

This is the first time I'm hearing about your company but I am already a huge fan of the tools you're building. I've had a need for most of them and next time I am met with those needs, I will use your tools. Please keep going!

jedberg · 2024-01-29T18:38:52.000000Z

That would be a security nightmare. You don't want to give such powerful credentials to anyone, much less a 3rd party.

But a good stopgap would be a feature to spit out an API command that someone could run (or a CloudFormation or TF file) where you can put your own credentials in and run it yourself.

abraae · 2024-01-29T18:40:23.000000Z

If you could only turn things off then perhaps that is less than a nightmare in some settings.

jedberg · 2024-01-29T18:45:01.000000Z

The only way it would be effective is if that credential had broad abilities to destroy, and I wouldn't want such a credential to get stolen. It would be bad enough for your most trusted operator to have it honestly.

The best way to do it would be to run the delete with no access, see what permission errors you get, and then only give those permissions until you've successfully deleted the object.

The safest way (but obviously more work) to do one off work like this is start permissionless and slowly open up. There are tools that can help with this, extracting the permission errors and generating the files to update the permissions.

nathanwallace · 2024-01-29T18:48:59.000000Z

Perhaps give Flowpipe [1] a try? It provides "Pipelines for DevOps", including a library of AWS actions [2], that can be run on a schedule (e.g. daily) or a trigger (e.g. instance start) to do things like turn off, update or delete expensive assets. It can also be combined with Steampipe [3] and the queries from the AWS Thrifty mod [4] to do common queries. We'd love your feedback if you give it a spin!

1 - https://github.com/turbot/flowpipe 2 - https://github.com/turbot/flowpipe-mod-aws 3 - https://github.com/turbot/steampipe 4 - https://github.com/turbot/steampipe-mod-aws-thrifty

debarshri · 2024-01-30T00:56:14.000000Z

What you are saying might be problematic. This is a naive approach towards saving cost in the cloud. Companies who would pay big bucks for a platform like this, I would imagine they would optimize cost by managing commitments on Reserved Instance over 1 year/3 year or sometimes monthly via platforms like some non-aws platforms. Net-saving is much higher than just turning off/on instances.

Managing cost without impacting any infrastructure is the right way to do it in my opinion.

aayvazyan · 2024-02-05T10:42:25.000000Z

It's very easy to do with Cloudchipr (https://app.cloudchipr.com). It allows to sort and filter and by product family (Compute, DataTransfer, Storage, etc..), then go down by tree and identify the most expensive resources under the category.

The nice part here is that when you see an instance burning $3000 a month, you can click on it and see cost break down based on traffic, compute, etc...

ericb · 2024-02-05T13:18:39.000000Z

But is there a way to stop that instance, though?

aayvazyan · 2024-02-06T13:10:21.000000Z

Sure thing! You can stop, delete, backup and delete. It supports all the actions that cloud provider supports.

kapilvt · 2024-01-29T19:37:57.000000Z

The CNCF project, Cloud Custodian fits these sorts of use cases pretty well, and supports periodic or event based triggers. https://cloudcustodian.io/docs/aws/examples/

ashug · 2024-01-29T19:07:04.000000Z

This is an interesting point, and we could definitely consider something like this. Where exactly do you run into problems?

For example, let's say you've figured out that a particular EC2 instance or database is too costly. What is the sticking point for you in turning it off? Is it that the resource has other essential functions unrelated to the cost spike? Or is it the identification of the exact resource that's the problem?

ericb · 2024-01-29T20:22:46.000000Z

When we evolved to SSO and subaccounts with various roles/access, there were resources running/used by different accounts. I would see the instance in the costs. But then, I'd try and find that instance, started by another dev, and get access to shut if off. And even though I'm the main account owner--which to me, means I should be able to nuke whatever, since I pay the bills, I always had trouble getting to it, and getting the permissions for what I wanted.

I used Vantage, which helped me see the problem, but then taking action on it was traumatic.

The barriers are:

- who owns it?

- what service is it in (if it is logs, for example)?

- where is the screen it is on?

- how do I get the permissions to kill it?

ashug · 2024-01-29T20:42:38.000000Z

Makes a lot of sense - in my opinion AWS doesn't do the best job with this. We had a similar problem with EKS where even as the root user I couldn't view cluster details (https://medium.com/@elliotgraebert/comparing-the-top-eight-m....).

I agree that this would be a great feature. To be honest, our product isn't currently focused on this sort of automatic management of resource lifecycle; we're much better at data collection. But thank you for flagging this! We'll definitely keep it in mind as we add support for compute services (right now we only support S3 and there's nothing to "spin down").

Edit: The part less related to permissions (how can I kill it) and more related to discoverability (which resource is it) is more adjacent to what we've already built and is something we can take a look at soon. Perhaps we can take a crack at the permissions aspect afterwards.

williamcotton · 2024-01-30T15:08:47.000000Z

That our industry does basically no cost accounting is basically insane.

Your product is something that most managers don’t even realize they are missing.

I’m kind of obsessed with activity-based costing as applied to software firms. Hit me up at my username at gmail if you want to nerd out about managerial accounting and/or if you might need some marketing material! I could write for days on the subject!

ashug · 2024-01-30T20:05:07.000000Z

Awesome!! We'll send you an email shortly :)

esafak · 2024-01-30T04:43:35.000000Z

I think you are swimming adjacent to usage-based pricing products like stigg. They also keep track of usage, and though their focus is not on what it costs you, that follows readily from usage.

Are you thinking of expanding in that direction, or integrating with such products? I would look favorably at a product that combined pricing (them) and observability (you). You're solving basically the same problem; don't make me buy two products.

ashug · 2024-01-30T06:10:07.000000Z

Great point. We could definitely add usage-based pricing (UBP) adjacent features - for example, a Stripe integration and user-defined rules to auto-calculate invoices based on incurred cloud usage per-customer. Would that be useful?

However, it's not always possible to infer one's own costs from UBP events. In UBP products, the user defines what constitutes a "usage event" (e.g. "customer generates a PDF") and so these events can be quite disconnected from underlying cloud usage. In other words, there's nothing that prevents some "someone generated a PDF" events from incurring large amounts of EC2 usage while other "someone generated a PDF" events incur very little EC2 usage, depending on the input parameters to the workload. And in most UBP scenarios, this difference in underlying cloud usage from PDF generation to PDF generation is not taken into account; often all UBP events of a given type are billed at the same rate. In fact, we've seen this exact issue in the wild: namely, a company implementing UBP but still being unsure about profit margin because certain UBP event types had high variance in cloud usage per-event.

One company is planning to use Dashdive's S3 storage data to charge their customers based on usage, so in some cases the data we collect can serve as a substitute for UBP.

I agree that it would be more convenient if we also offered user-defined UBP events. This way, we could be a single vendor for the folks that want both usage monitoring and usage-based billing, where the UBP events don't necessarily align super well with underlying cloud usage.

esafak · 2024-01-30T06:17:32.000000Z

Excellent points. Now that I think about it, I think the two products are naturally offered together, because usage-based pricing should be based on underlying costs, monitoring which is what you provide. If you had better visibility into your costs, you could set usage-based prices, and price tiers in a more principled manner.

andrewdb · 2024-01-29T23:14:34.000000Z

Grats on the launch!

Would be very interested to see this working with GCP Cloud Run.

ashug · 2024-01-30T08:49:17.000000Z

Thanks! It's on the roadmap :)

danpalmer · 2024-01-30T05:49:59.000000Z

Congrats on the launch. I'll never forget at a previous company the CEO came to the engineering team one day to triumphantly announce "I've just signed X for $Y/m, if we give them a dedicated instance!", to which we all gasped in horror, knowing that a dedicated instance would cost at least 5x.

If that sort of understanding can be better communicated throughout a company, particularly to sales, then that's great news.

ashug · 2024-01-30T09:15:12.000000Z

Haha, we've heard many a story like this one over the months!

An advisor suggested we offer a sort of "pricing heuristics" deliverable to sales as part of our product (e.g. "if a customer needs feature foo, add $bar to the price"), and your story makes me think he's onto something.

learner007 · 2024-01-29T18:23:26.000000Z

Looks like a good idea, my only gripe is the base plan is too expensive for MVP products that are not earning anything yet.

ashug · 2024-01-30T06:24:40.000000Z

You're right; it's a bit inaccessible at the moment. We're planning to offer a more affordable tier in the next 1-2 months. A bit more context here: https://news.ycombinator.com/item?id=39178753#39186948.

dev-docsai · 2024-02-02T19:09:23.000000Z

Very cool and awesome stuff! Keeping track of cloud sprawl/spend can get super stressful, esp if you are mutli-cloud! Excited to see what you all continue to ship!!

ravivyas · 2024-01-29T18:26:55.000000Z

Do you plan to have on prem usage (if not cost ) in the future?

Also what about utilisation rates ?

ashug · 2024-01-29T19:02:32.000000Z

Usage is what we track directly, and then we apply the cloud provider's billing rules for the given service (e.g. S3) to calculate the resulting costs. So utilization rates for something like a Kubernetes cluster are easy to derive with the data already collected - just take the usage we've tracked and divide by the total resources available to the cluster. We haven't finished the k8s offering yet, but this would be a great feature / view for us to include (the same goes for most other compute offerings, e.g. EC2, ECS).

The same goes for on prem. We don't have any on prem customers currently, but it would be easy to add a feature where you input the total capacity and/or monthly cost of your on prem infrastructure, and use the collected usage data to calculate utilization rate and "effective cost" incurred by each feature, customer, etc. Thanks for the questions!

sk4wave · 2024-01-30T03:02:32.000000Z

Congratulations on the launch!

However, as a developer, the pricing seems a bit steep for me to give it a try. Any plans for more affordable options or perhaps a developer-friendly tier in the future?

ashug · 2024-01-30T06:19:42.000000Z

Thanks! Yes, we're working on making the product more accessible. Right now, for every new customer, we have to manually provision and manage some additional infrastructure. We're worried we could quickly get overextended in both time and cost if we have to do this for lots of users in a free tier for example.

It's on our roadmap in the next 1-2 months to eliminate these manual steps and make these last parts of our infra multitenant. At that point, we plan to release a cheaper tier for individual devs.

helloericsf · 2024-01-29T19:12:19.000000Z

Many engineering teams unquestionably find this challenging. Just a quick question, does it solely track usage, or can it also aid in cutting down costs?

ashug · 2024-01-29T22:27:07.000000Z

You can use the usage and cost data Dashdive collects to identify cost spikes or ongoing inefficiencies (e.g. this particular feature is using more vCPU than should be necessary). But we won't do any automatic cost cutting for you (some products allow you to buy reserved instances or rightsize from directly within their app).

carlobadini · 2024-01-30T02:14:04.000000Z

Congrats on the launch team! Love the idea!

neom · 2024-01-29T19:25:26.000000Z

I've been using vantage.sh - how are you thinking about differentiation there?

ashug · 2024-01-29T19:49:29.000000Z

The key differentiation from Vantage and other similar products is the level of granularity.

Vantage is in the category mentioned above: it combines AWS, GCP, Datadog, Snowflake, etc. cost data in a single dashboard and supports tagging. For example, if I have a single tenant architecture where every customer has their own Postgres RDS instance, I can tag each RDS node with `customerId:XXX`. Then I can get cost broken down by customer ID in Vantage.

However, if my entire app (including every customer) uses the same large RDS instance, or if I'm using a DBaaS like Supabase, tools like Vantage, which rely on tagging at the resource level, cannot show a breakdown of usage or cost per customer. By contrast, we record each query to your monolith or "serverless" DB (SELECT/INSERT/UPDATE) along with the vCPU and memory consumed, tag the query with `customerId` and other relevant attributes, and calculate the cost incurred based on the billing rules of RDS or Supabase.

neutralino1 · 2024-01-30T01:38:50.000000Z

Congrats on the launch!

sidcool · 2024-01-30T10:46:56.000000Z

Congrats on launching! How does it differ from Spot.io?

ashug · 2024-01-30T20:34:45.000000Z

Thank you! Spot.io falls broadly into the same category as Vantage/Ternary/others, so the same answer as here applies: https://news.ycombinator.com/item?id=39183504.

In a sentence, these tools display the same cost data available within AWS - where max granularity is per-database or per-EC2 instance - whereas, Dashdive can accurately attribute portions of usage on the same DB or instance to different features/customers.

darkbatman · 2024-01-29T21:51:45.000000Z

how are you ingesting from kinesis to clickhouse. are you using some custome sink connector or processes on ec2 or lambda?

ashug · 2024-01-29T22:21:02.000000Z

We actually use Kafka rather than Kinesis, although they're very similar. For writing to ClickHouse from Kafka, we use the ClickHouse Kafka sink connector: https://github.com/ClickHouse/clickhouse-kafka-connect.

darkbatman · 2024-01-30T12:23:58.000000Z

we are actually trying something similar but possible kinesis + clickhouse or kafka + clickhouse. Currently kinesis seems easier to deal with but not a good intergration or sink connector available to process records at scale for kinesis to put into clickhouse. Were you ever felt into similar problems where you had to process records at huge scale to be able to insert into clickhouse without much delay.

One more thing is kinesis can have duplicates while kafka is exactly once delivery.

ashug · 2024-01-30T19:38:07.000000Z

I'm not familiar with Kinesis's sink APIs, but yes I'd imagine you'll have to write your own connector from scratch.

To answer your question, though, no: in the Kafka connector, the frequency of inserts into ClickHouse is configurable relatively independent of the batch size, so you don't need massive scale for real-time CH inserts. To save you a couple hours, here's an example config for the connector:

  # Snippet from connect-distributed.properties

  # Max bytes per batch: 1 GB
  fetch.max.bytes=1000000000
  consumer.fetch.max.bytes=1000000000
  max.partition.fetch.bytes=1000000000
  consumer.max.partition.fetch.bytes=1000000000

  # Max age per batch: 2 seconds
  fetch.max.wait.ms=2000
  consumer.fetch.max.wait.ms=2000

  # Max records per batch: 1 million
  max.poll.records=1000000
  consumer.max.poll.records=1000000

  # Min bytes per batch: 500 MB
  fetch.min.bytes=500000000
  consumer.fetch.min.bytes=500000000

You also might need to increase `message.max.bytes` on the broker/cluster side.

If you're still deciding, I'd recommend Kafka over Kinesis because (1) it's open source so more options, e.g. self host or Confluent or AWS MSK and (2) it has a much bigger community, meaning better support, more StackOverflow answers, a plug-and-play CH Kafka connector, etc.

darkbatman · 2024-02-02T21:33:44.000000Z

Thanks these config are helpful

kingnothing · 2024-01-29T20:05:17.000000Z

How does your product differentiate from Yotascale?

ashug · 2024-01-29T20:48:09.000000Z

Yotascale is in many ways similar to Vantage, so a similar answer to this one (https://news.ycombinator.com/item?id=39178753#39181486) applies.

When we were originally researching to see if anyone had done something like Dashdive already (i.e., specifically applying observability tools / high volume event ingestion to cloud costs), I did manage to find a Yotascale video in which they mentioned defining custom usage events in Kubernetes for a specific customer. It seemed more like a custom feature than a generally available part of their product, but I could be mistaken.

i_like_pie1 · 2024-01-29T21:37:58.000000Z

cool/nice work

jrhizor · 2024-01-29T20:18:08.000000Z

How does this compare to Ternary?

ashug · 2024-01-29T22:15:56.000000Z

Ternary is similar to Vantage in its offering so this answer (https://news.ycombinator.com/item?id=39178753#39181486) also applies here.

There are quite a few cloud cost tools out there which use AWS's cost and usage reports (or GCP/Azure equivalents) as their sources of truth, and as a consequence their data is largely based on tagging cloud resources with attributes of interest (e.g. EC2 instance XXX has tags customerId:ABC, teamId:DEF, featureId:GHI). These include Ternary, Vantage, Yotascale, and others (CloudChipr, CloudZero, Archera, Cloudthread, Finout, Spot by NetApp, DoiT, Tailwarden, CAST AI, Densify, GorillaStack, Economize Cloud). Some of these offer AI-based automatic tagging as well.

But even so, if I - for example - have a lots of k8s replica Pods which serve most of my application traffic, I can't use any of these products to figure out which customers, API endpoints, code paths, features, etc. are costing me the most. At best I could tag the entire Deployment, or maybe even each Pod. But the problem is that every Pod is serving lots of endpoints, customers, and features. However, Dashdive can give you this info.

From a technical implementation standpoint, Dashdive is much closer to application performance monitoring (APM) products or usage based billing products than it is to most cloud cost dashboard products.