More

dangoldin · on April 28, 2024

From egress + storage cost standpoint absolutely which ends up being a big factor for these large scale data systems.

There’s a prior discussion on HN about that post: https://news.ycombinator.com/item?id=38118577

And full disclosure but I’m author of both posts - just shifted my writing to be more focused on the company one.

dangoldin · on April 28, 2024

Author here. Basic idea is you want some way of defining metrics. So something like “revenue = sum(sales) - sum(discount)” or “retention = whatever” which need to be generated via SQL at query time vs built in to a table. Then you can have higher confidence multiple access paths all have the same definitions for the metrics.

dangoldin · on April 26, 2024

Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy. Audience here isn't everyone and the goal was to give less experienced data engineers and folk a sense of modern data tools and a possible approach.

But what did you mean by "Read the first paragraph of the `Cost` section"?

llm_trw · on April 27, 2024

>Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy.

It isn't though.

What matters is the memory footprint of the algorithm during execution.

If you're doing transformation that take constant time per item regardless of data size, sure, go for a GPU. If you're doing linear work you can't fit more than 24gb on a desktop card and prices go to the moon quickly after that.

Junior devs doing the equivalent of an outer product on data is the number one reason I've seen data pipelines explode in production.

dangoldin · on April 27, 2024

Yes but most data-heavy tasks are parallelizable. SQL itself is naturally parallelizable. There's a reason Apache RAPIDs, Voltron, Kinetica, Sqream, etc exist.

Full transparency I don't have huge amount of experience at working on this massive scale and to your point you need to understand the problem and constraints before you propose a solution.

llm_trw · on April 27, 2024

There are more asterisks attached to each assertion you're making than you can shake a stick at.

There is always a 'simple' transformation that the business requires which turns out to need n^2 space that kills the server it's running on because people believe everything you said above.

Or in other words: most of the time you don't need a seat belt in a car either.

cgio · on April 27, 2024

You have to revisit the assertion that SQL is naturally paralleliseable. As a guide have a look at the semantics around Spark shuffles.

dangoldin · on March 5, 2024

That's when your batch jobs are running. Partially kidding but companies will wait until there's lower demand to take advantage of spot pricing.

Scoundreller · on March 5, 2024

Or routing latency insensitive traffic there. Do I care where my backups are stored? No. (well, I kinda do for geological/weather reasons, but put it on Mars for all I care)

dangoldin · on Feb 19, 2024

Not to change your direction but something I've been toying around is being able to support Algebraic types when defining tables. That way you can offload a lot of the error checking to the database engine's type system and keep application code simpler.

dvdkon · on Feb 19, 2024

I'd like to do something like that too, if/when I ever get to replacing the DDL. In Postgres you could create custom types for tagged unions, but it might be better to translate table-level unions to a set of constraints, for performance and flexibility (you can't create referential integrity constraints using expressions IIRC).

andyferris · on Feb 19, 2024

Sounds wonderful. I actually think this is the highest value thing anyone could contribute to Postgres (assuming it could handle foreign key constraints inside the sum types).

dangoldin · on Nov 2, 2023

Author here but some ideas I was thinking about: - An open source data pipeline built on top of R2. A way of keeping data on R2/S3 but then having execution handled in Workers/Lambda. Inspired by what https://www.boilingdata.com/ and https://www.bauplanlabs.com/ are doing. - Related to above but taking data that's stored in the various big data formats (Parquet, Iceberg, Hudi, etc) and generating many more combinations of the datasets and choose optimal ones based on the workload. You can do this with existing providers but I think the cost element just makes this easier to stomach. - Abstracting some of the AI/ML products out there and choosing best one for the job by keeping the data on R2 and then shipping it to the relevant providers (since data ingress to them is free) for specific tasks. -

dangoldin · on Nov 2, 2023

Author here - have you tried using R2? As others mentioned there's also Sippy (https://developers.cloudflare.com/r2/data-migration/sippy/) which makes this easy to try.

dangoldin · on Nov 2, 2023

Author here and it is true that costs within a region are free and if you do design your system appropriately you can take advantage of it but I've seen accidental cases where someone will try to access in another region and it's nice to not even have to worry about it. Even that can be handled with better tooling/processes but the bigger point is if you want to have your data be available across clouds to take advantage of the different capabilities. I used AI as an example but imagine you have all your data in S3 but want to use Azure due to the OpenAI partnership. It's that use case that's enabled by R2.

hipadev23 · on Nov 2, 2023

Yeah, for greenfield work building up on R2 is generally a far better deal than S3, but if you have a massive amount of data already on S3, especially if it's small files, you're going to pay a massive penalty to move the data. Sippy is nice but it just spreads the pain over time.

Dylan16807 · on Nov 3, 2023

> Sippy is nice but it just spreads the pain over time.

That egress money was going to be spent with or without sippy. It's not "just spreading" the pain, it's avoiding adding any pain at all.

dangoldin · on Nov 2, 2023

Author here and really cool link to Sippy. I love the idea here since you're really migrating data as needed so the cost you incur is really a function of the workload. It's basically acting as a caching layer.

dangoldin · on Aug 24, 2023

Yes - he gives it credit at the bottom of the page.

shrubble · on Aug 24, 2023

Yes, I see it now - hadn't scrolled all the way to the bottom...