Hacker News new | past | comments | ask | show | jobs | submit login

Very cool, I stand corrected. I hope one day I have another opportunity to play with KDB.

As for the speed advantage, you'll have a similar speed advantage with python/pandas/big folder of CSV files. For all of Spark's claims on "speed", it's really just reducing the speed penalty of Hadoop from 500x to 50x. (Here 500x and 50x refer to the performance of loading flat files from a disk.)




Do you really mean flat CSV text files? I get the simplicity of that, but it seems really expensive (speed and size). But I'm used to tables with more than a dozen columns, and with kdb+ you only be pull in the columns of interest, and the rows of interest (due to on-disk sorting and grouping), which is a smaller subset, often much smaller.


By number, my data sets are usually in CSV. I could probably get some additional advantage via HDF5, but a gzipped CSV is usually good enough and simpler. By volume (i.e. on my 2 or 3 biggest data sets) I'll probably be mostly HDF5. I haven't tried feather yet but it looks pretty nice.

KDB would probably be better, but don't underestimate what you can do with just a bunch of files.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: