Pretty big caveat; 5 seconds AFTER all data has been loaded into memory - over 2...

philbe77 · 2025-10-24T19:15:14 1761333314

hi sdairs, we did store the data on the worker nodes for the challenge, but not in memory. We wrote the data to the local NVMe SSD storage on the node. Linux may cache the filesystem data, but we didn't load the data directly into memory. We like to preserve the memory for aggregations, joins, etc. as much as possible...

It is true you would need to run the instance(s) 24/7 to get the performance all day, the startup time over a couple minutes is not ideal. We have a lot of work to do on the engine, but it has been a fun learning experience...

otterley · 2025-10-24T19:39:15 1761334755

“Linux may cache the filesystem data” means there’s a non-zero likelihood that the data in memory unless you dropped caches right before you began the benchmark. You don’t have to explicitly load it into memory for this to be true. What’s more, unless you are in charge of how memory is used, the kernel is going to make its own decisions as to what to cache and what to evict, which can make benchmarks unreproducible.

It’s important to know what you are benchmarking before you start and to control for extrinsic factors as explicitly as possible.

sdairs · 2025-10-24T19:46:54 1761335214

Thanks for clarifying; I'm not trying to take anything away from you, I work in the OLAP space too so it's always good to see people pushing it forwards. It would be interesting to see a comparison of totally cold Vs hot caches.

Are you looking at distributed queries directly over S3? We did this in ClickHouse and can do instant virtual sharding over large data sets S3. We call it parallel replicas https://clickhouse.com/blog/clickhouse-parallel-replicas

tanelpoder · 2025-10-24T22:23:01 1761344581

(I submitted this link). My interest in this approach in general is about observability infra at scale - thinking about buffering detailed events, metrics and thread samples at the edge and later only extract things of interest, after early filtering at the edge. I’m a SQL & database nerd, thus this approach looks interesting.

jamesblonde · 2025-10-24T20:40:21 1761338421

With 2 modern NVMe disks per host (15 GB/s) and pcie 5.0, it should only take 15s to read 30 TB into memory on 63 hosts.

You can find those disks on Hetzner. Not AWS, though.

jiggawatts · 2025-10-25T01:36:19 1761356179

I don’t understand why both Azure and AWS have local SSDs that are an order of magnitude slower than what I can get in a laptop. If Hetzner can do it, surely so can they!

Not to mention that Azure now exposes local drives as raw NVMe devices mapped straight through to the guest with no virtualisation overheads.

jamesblonde · 2025-10-25T13:09:11 1761397751

It would undercut all their higher level services - like DynamoDB, CosmosDB, etc.

Databases would suddenly go BRRR in the cloud and show up cloud-native (S3) based databases for the high latency services they are.

lumost · 2025-10-24T16:53:10 1761324790

It does make me wonder whether all of the investment in hot-loading of GPU infrastructure for LLM workloads is portable to databases. 30TB of GPU memory will be roughly 200 B200 cards or roughly 1200 per hour compared to the $240/hour pricing for the CPU based cluster. The GPU cluster would assuredly crush the CPU cluster with a suitable DB given it has 80x the FP32 FLOP capacity. You'd expect the in-memory GPU solution to be cheaper (assuming optimized software) with a 5x growth in GPU memory per card, or today if the workload can be bin-packed efficiently.

eulgro · 2025-10-24T18:04:20 1761329060

Do databases do matrix multiplication? Why would they even use floats?

lisbbb · 2025-10-24T20:24:36 1761337476

That's a great question. I never worked on any cool NASA stuff which would involve large scale number crunching. In the corpo space, that's not been my experience at all. We were trying to solve big data problems of like, how to report on medical claims that are in flight (which are hardly ever static until much later after the claim is long completed and no longer interesting to anyone) and do it at scale of tens of thousands per hour. It never went that well, tbh, because it is so hard to validate what a "claim" even is since it is changing in real time. I don't think excess GPUs would help with that.

lumost · 2025-10-24T18:53:44 1761332024

lot's of columns are float valued, GPU tensor cores can be programmed to do many operations between different float/int valued vectors. Strings can also be processed in this manner as they are simply vectors of integers. NVidia publishes official TPC benchmarks for each GPU release.

The idea of a GPU database has been reasonably well explored, they are extremely fast - but have been cost ineffective due to GPU costs. When the dataset is larger than GPU memory, you also incur slowdowns due to cycling between CPU and GPU memory.

radarsat1 · 2025-10-24T19:11:37 1761333097

what do you think vector databases are? absolutely. i think the idea of a database and a "model" could start to really be merged this way..

CaptainOfCoit · 2025-10-24T16:48:49 1761324529

Yeah, pretty misleading it feels like.

For background, here is the initial ideation of the "One Trillion Row Challenge" challenge this submission originally aimed to participate in: https://docs.coiled.io/blog/1trc.html

lisbbb · 2025-10-24T20:20:44 1761337244

Wow (Owen Wilson voice). That's still impressive that it can be done. Just having 4k cpus going reliably for any period of time is pretty nifty. The problem I have run into is that even big companies say they want this kind of compute until they get the bill for it.

mey · 2025-10-24T17:11:38 1761325898

So how would that compare to DynamoDB or BigQuery? (I have zero interest in paying for running that experiment).

In theory a Zen 5 / Eypc Turin can have up to 4TB of ram. So how would a more traditional non-clustered DB stand up?

1000 k8s pods, each with 30gb of ram, there has to be a bit of overhead/wastage going on.

mulmen · 2025-10-24T17:22:29 1761326549

Are you asking how Dynamo compares at the storage level? Like in comparison to S3? As a key-value database it doesn’t even have a native aggregation capability. It’s a very poor choose for OLAP.

BigQuery is comparable to DuckDB. I’m curious how the various Redshift flavors (provisioned, serverless, spectrum) and Spark compare.

I don’t have a lot of experience with DuckDB but it seems like Spark is the most comparable.

fifilura · 2025-10-24T21:51:15 1761342675

BigQuery is built for the distributed case while DuckDB is single CPU and requires the workarounds described in the article to act like a distributed engine.

tishj · 2025-10-25T14:18:13 1761401893

DuckDB is not single CPU, it's single machine - big difference

fifilura · 2025-10-26T06:42:36 1761460956

Fair enough i slipped. And single RAM.

And yeah these days you can boost a single machine to enormous specifications. I guess the main difference will be the cost. A distributed engine can "lease" a little bit of time here and there, while a single RAM engine needs to keep all that capacity ready for when it is actually needed.

mulmen · 2025-10-24T22:36:59 1761345419

Ah ok. Maybe that does make sense as a comparison to ask if you need an analytics stack or can just grind through your prod Dynamo.

trhway · 2025-10-24T18:00:02 1761328802

the https://sortbenchmark.org has always stipulated "Must sort to and from operating system files on secondary storage." and thus felt as a more reasonable estimate of overall system performance