It's not so much that SparkSQL doesn't support gz as that gz is slow because you...

heavenlyblue · on June 16, 2018

Yeah, but you are making an assumption I’ve got these 350Gb of logs in HDFS format. Which takes time to set up.

Godel_unicode · on June 18, 2018

Not really, you said SparkSQL doesn't support gz, which is incorrect and the thrust of my comment. The anecdote about parquet is orthogonal to gz support.

pedantic sidebar: hdfs isn't a file format, it's a distributed file system layered over a traditional on-disk filesystem. For example you might have: json logs, in a gz-formatted file, tracked in the hdfs filesystem, stored on disk in an ext4-formatted filesystem.