BTables: A fast, compact format for Machine Learning

fjordster · on Oct 2, 2015

HDF5 isn't perfect but it does this kind of job pretty well. The C, C++, and HDF5 APIs are definitely not fun to use, but there are wonderful and intuitive APIs available in some languages---I'm thinking of Python's h5py here.

Let me add that the OP's experience that HDF5 files were less space efficient than comparable CSV files suggest that something was grossly amiss in his use of HDF5.

andrewberls · on Oct 3, 2015

(I'm the author of the library) I'm almost certain you're correct - we (thought) we had compression enabled on our feature builder and never found the root cause, but regardless we're happy with how BTables ended up for the other reasons detailed. For future use cases we'll definitely be re-evaluating HDF5!

transcranial · on Oct 3, 2015

HDF5 is actually a pleasure to work with in python due to h5py - and it's quite an efficient data format.

RazvanPanda · on Oct 3, 2015

Another interesting HDF wrapper in Rust: https://github.com/aldanor/hdf5-rs

icsa · on Oct 3, 2015

The BTables discussion takes me back to memories of my first college computing class (in FORTRAN). We were asked how we might store a sparse matrix in less memory. The solution was exactly the same as BTables. We thought we'd done something novel when the professor pointed out that it had already been implemented in the 60s.

Great ideas never fade. They do get reinvented :).

rspeer · on Oct 2, 2015

I haven't tried the BTables format, but I agree with their criticism of HDF5. It seems to be an incredibly over-designed format with under-designed APIs.

(Why would I need a directory tree inside a file that only one process can write to anyway? Why wouldn't I just use the filesystem I already have?)

felixr · on Oct 2, 2015

> Why would I need a directory tree inside a file that only one process can write to anyway? > Why wouldn't I just use the filesystem I already have?

If you have multiple "tables" that belong together and you need one table to interpret the data in the other table, wouldn't you want them to be grouped together? If they are separate files on the filesystem there is always the risk of forgetting something when you share the data with somebody.

If you can put all the data of an experiment into one file, I think that is very convenient. After all, you don't have to read the complete HDF5 file if you are interested just in a subset of the data.

rspeer · on Oct 2, 2015

This is a poor reinvention of archive formats such as .zip.

If I have multiple data files that need to go together, I would like to put them together with a widely-understood tool that has good APIs in many programming languages and can even be interacted with from the shell.

glogla · on Oct 2, 2015

I wonder if it weren't more practical to just use sqlite file for data like that. I mean, it's not plaintext, but sqlite is available pretty much anywhere and provides convenient interface for data.

felixr · on Oct 2, 2015

HDF5 is much better suited for a lot of scientific data sets. How would you store multidimensional data in sqlite? Not everything is a table or matrix. HDF5 also allows you to pick compression filters that are especially suited for the data you have. If you are looking to replace a CSV file, then sqlite is obviously a pragmatic solution.

xaa · on Oct 2, 2015

It's too bad that this is for sparse data only. ML datasets have differing degrees of sparsity, and when the sparsity gets low enough, it's more efficient to use dense matrices, even when there are still missing values.

Also if you have dense data, you can use mmap, which isn't very space efficient but is very fast. I guess it could also be made to be space efficient if you use a filesystem with transparent compression.

Someone · on Oct 3, 2015

If you combine mmap with "filesystem with transparent compression", and want the efficiency of mmap, mmap will only see the compressed data.

If you want your mmap to magically see the uncompressed data, your file system will have to do decompress the data, and that doesn't come for free.

I would try and aim for compression in the application, as data size likely will be the bottleneck in reading and writing such files. If your data isn't very sparse, you could delta encode the indices of the non-zero columns, and use some variable-length encoding for it. Compressing each row of deltas may help after the delta encoding (especially if it is reasonably dense, because you expect the deltas to be small).

Once you go down that route, you have sacrificed simplicity, so you might just as well encode your floats, too.

blt · on Oct 3, 2015

Wondering why they chose row-major storage. I think it's far more common to only care about a subset of columns than a subset of rows.

nostrademons · on Oct 3, 2015

FTA:

"First, we knew we only cared about row-by-row access over the entire file; we do not need things like random row or column reads."

It sounds like they don't care about subsets of either columns or rows, and are looking to optimize table size and time for full table scans.

zobzu · on Oct 3, 2015

interesting how it jumps from csv to rewrite stuff without just doing SQL and be done with it. since csv did the job almost good enough, it seem like SQL would just fine and dandy while easier to manage and implement (minutes, literally)

note: after reading a little more I suspect SQL would be faster, in fact.