Hacker News new | past | comments | ask | show | jobs | submit login

It's not so much that SparkSQL doesn't support gz as that gz is slow because you can't parallelize the reads. Regardless, parquet format in hdfs so yarn can allocate containers local to the chunk to be processed. Scales nicely.



Yeah, but you are making an assumption I’ve got these 350Gb of logs in HDFS format. Which takes time to set up.


Not really, you said SparkSQL doesn't support gz, which is incorrect and the thrust of my comment. The anecdote about parquet is orthogonal to gz support.

pedantic sidebar: hdfs isn't a file format, it's a distributed file system layered over a traditional on-disk filesystem. For example you might have: json logs, in a gz-formatted file, tracked in the hdfs filesystem, stored on disk in an ext4-formatted filesystem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: