Random access string compression with FSST and Rust

judofyr · 2024-09-13T08:44:32.000000Z

I implemented this in Zig earlier: https://github.com/judofyr/minz

It’s a quite neat algorithm. I saw compression ratios in the 2-3x range. However, I remember that the algorithm for finding the dictionary was a bit unclear. I wasn’t convinced that what was explained in the paper found the “optimal” dictionary. With some slight tweaks I got widely different results. I wonder if this implementation improves on this.

aduffy · 2024-09-13T11:39:44.000000Z

The dictionary quality was definitely highly sensitive to some of the tricks that the original authors implemented in their C++ code, many were documented in the paper but a few were not:

1. Always promoting single-bytes by boosting their scores by a factor of 8 in candidate search

2. Boosting the calculated gains of single-byte candidates by a factor of 8 to prevent them from falling off in later generations

3. Having an adaptive threshold for which symbols are included as the rounds go on

I didn't document these in the blog post to keep the content accessible, but it's definitely something you find once you start digging into compression ratios! Perhaps they will end up in a part 2 at some point.

[1]: https://github.com/spiraldb/fsst/blob/develop/src/builder.rs...

Epicism · 2024-09-13T00:30:48.000000Z

Super interesting! I’m curious how this differs from InfluxDB’s German strings implementation https://www.influxdata.com/blog/faster-queries-with-stringvi...

aduffy · 2024-09-13T01:20:08.000000Z

German strings are cool, and we're also using them in Vortex! They're also commonly referred to as "variable-length view arrays", which is what Arrow calls them [1]. They were first published by folks at TUM as part of the Umbra database (checkout Figure 4) [2].

German-style strings/views are not a compression algorithm, they're just a way for storing string data and making it quick to compare them in-memory. You can in fact store views, while storing the corresponding full-length strings in compressed format with FSST. We don't currently do that but we're working on it.

[1] https://arrow.apache.org/docs/format/Columnar.html#variable-...

[2] https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

Epicism · 2024-09-13T01:45:52.000000Z

Thanks for the reply!

jcgrillo · 2024-09-13T16:20:40.000000Z

I really like the look of vortex[1]! One of my industry pet peeves is all the useless utf-8 server log bytes. I'd like to log data in a sane, schemaful, binary format and this looks like it could be a good way to do that. Bonus points if we can wire this up as a physical layer for e.g. datafusion[2] so I can analyze my logs with the dataframe abstraction.

EDIT: Question about FSST--lets say I build a strings table like:

  struct Strings {
      compressor: fsst::Compressor,
      compressed: Vec<Vec<u8>>
  }

Is there some optimal length for compressed given the 255 symbols limit?

[1] https://github.com/spiraldb/vortex [2] https://github.com/apache/datafusion

lor_louis · 2024-09-14T01:55:22.000000Z

I've worked on a database that considered FSST as a way to query over compressed log lines. We found that the compression ratio was highly dependent on how repetitive the data was. In the end segmenting by service (Apache, our go stuff, our rust stuff, etc) yielded pretty good results and log lengths of ~200 bytes were pretty well compressed.

We ended up not using it in production because the worst cases were absolutely terrible compared to our dumber skippable zstd.

aidenn0 · 2024-09-13T03:00:28.000000Z

What is the meaning of "Arrow" in this context?

ddulaney · 2024-09-13T03:12:01.000000Z

https://arrow.apache.org/

It’s a format for handling bulk columnar data.

chgo1 · 2024-09-13T04:04:01.000000Z

A question regarding the second generation in the example: Why is the symbol "um" (0) only counted once?

aduffy · 2024-09-13T11:53:57.000000Z

Thank you for the close reading! That’s definitely a mistake on my part, I’ll fix it shortly.

scotty79 · 2024-09-13T12:36:54.000000Z

So this lets you compress a collection of strings and cheaply decompress any of them individually?

jcgrillo · 2024-09-13T16:12:42.000000Z

Yes, you can train a compressor[1] on a corpus of text, then compress individual words in the text and store their compressed bytes in e.g. a Vec<Vec<u8>>[2] and finally for any of the Vec<u8>s you can access the decompressor[3] and decompress (a slice of) the Vec<u8>[4].

[1] https://docs.rs/fsst-rs/0.4.1/fsst/struct.Compressor.html#me... [2] https://docs.rs/fsst-rs/0.4.1/fsst/struct.Compressor.html#me... [3] https://docs.rs/fsst-rs/0.4.1/fsst/struct.Compressor.html#me... [4] https://docs.rs/fsst-rs/0.4.1/fsst/struct.Decompressor.html#...