Hacker News new | past | comments | ask | show | jobs | submit login

Deduplication can literally save petabytes.



deduplication is probably the biggest "we don't do that here" in the ZFS world lol, at this point I think even the authors of that feature have disowned it.

it does what it says on the tin, but this comes at a much higher price than almost any other ZFS feature: you have to store the dedup tables in memory, permanently, to get any performance out of the system, so the rule of thumb you need at least 20GB of RAM per TB stored. In practice you only want to do it if your data is HIGHLY duplicated, and that's often a smell that building a layered image from a common ancestor using the snapshot functionality is going to be a better option.

and once you've committed to deduplication, you're committed... dedup metadata builds up over time and the only time it gets purged is if you remove ALL references to ANY dedup'd blocks on that pool. So practically speaking this is a commitment to running multiple pools and migrating them at some point. That's not a huge problem for enterprise, but, most people usually want to run "one big pool" for their home stuff. But all in all, even for enterprise, you have to really know that you want it and it's going to produce big gains for your specific use-case.

in contrast LZ4 compression is basically free (actually it's usually faster due to reduced IOPS) and still performs very well on things like column-oriented stores, or even just unstructured json blobs, and imposes no particular limitations on the pool, it's just compressed blocks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: