Hacker News new | past | comments | ask | show | jobs | submit login

The "Data Storage Internals" section[1] of the README sounds to me like it has its own column-oriented format for these tables, at least that's how I'm reading the part about segments. Is that the case? If so, have you tried using Apache Arrow or Parquet to see how they compare?

[1] https://github.com/dgllghr/stanchion#data-storage-internals




Yes it does. I have found it easier to start simple than try to integrate the complexity of Parquet whole hog. The Parquet format is also designed with the idea of having multiple columns in the same "file". I'm sure there are ways to split the components of a Parquet file across multiple BLOBs and be able to append those BLOBs together to create a valid Parquet file, but that is more complexity and does not lend itself to existing Parquet code. Keeping the blobs separate is valuable because it means not touching data for columns that are not being read in queries.

My understanding is that Arrow has really focused on an in-memory format. It is binary, so it can be written to disk, but they are seemingly just scratching the surface on compression. Compression is a big advantage of columnar storage because really cool compression schemes like bit packing, run-length encoding, dictionary compression, etc. can be used to significantly reduce the size of data on disk.

Arrow interop especially is a great idea, though, regardless of the storage format. And if they are developing a storage-friendly version of arrow, I would certainly consider it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: