Exploiting column chunks for faster ingestion and lower memory use

noleary · 2024-08-26T14:44:18 1724683458

(Read this really quickly and bookmarked for a deeper read later.)

This part really caught my attention:

    > That's an improvement of 100x for write and ingestion speed, and 35x for memory overhead!

Where'd you guys get the idea for this approach? Did you know you could get this kind of improvement?

nikonp · 2024-08-26T14:58:10 1724684290

One of the founders of Rerun here. I don't remember exactly where the idea came from but I think it was basically two things. First off, creating chunks of columns to store or pass around is a pretty common approach in data systems. Parquet files have the concept of row groups for instance which is pretty similar (main difference is that chunks don't have to include all columns). Second, it was just quite obvious that we needed to amortize the fixed costs better for small data somehow