Hacker News new | past | comments | ask | show | jobs | submit login
Erasure Coding for Distributed Systems (transactional.blog)
284 points by eatonphil 4 months ago | hide | past | favorite | 57 comments



I'm surprised rateless fountain codes aren't mentioned! If you enjoy this sort of thing, you'll find the Luby Transform Code fascinating: https://en.wikipedia.org/wiki/Luby_transform_code

This paper is a really nice overview with more detail: https://switzernet.com/people/emin-gabrielyan/060112-capilla...

LT codes are used as the "outer code" in the linear time RaptorQ encoding specified in RFC6330: https://www.rfc-editor.org/rfc/rfc6330


I have implemented RaptorQ and RFC6330.

First, the rfc is pointlessly complex and optimized for files, not for streaming. if you want to play with it, manage blocks by yourself, ignore the asinine interleaving and block size management.

Second, the algorithm is actually split in two parts, and while the second (generation of repair blocks) is linear, the first is cubic on the number of messages that you put together in a block (~~ matrix gaussian elimination).

And while parts of both encoding and decoding can be cached, I think that "linear time" encoding for raptorq is actually just false marketing speak.


Are rateless fountain codes the better solution and if are there any systems that are using them?


Amplidata did this https://en.wikipedia.org/wiki/Amplidata

It's a great solution (fast, storage overhead of about 1.2%) iff your data is immutable.


Aren't there patent problems with fountain codes?


Luby's original paper was published in 2002. Not sure about RaptorQ though...


IIRC, Qualcomm still holds patents on RaptorQ. They do provide a blanket license exemption for implementing RFC6330.


Years back someone proposed a cute algorithm for erasure codes that depended not on spinning rust but on multipath networking.

I believe they called it network coding and the idea was in a network with multiple routes I might get a file faster by pulling an erasure code that used two parts of the file or even two files from one upstream instead of waiting for the entire file from the primary server.


This has been used in Ceph for a long time:

https://docs.ceph.com/en/latest/rados/operations/erasure-cod...

I would not be surprised if there was a lot of stuff like this behind S3 and other cloud storage systems too, at least in the lower-access tiers of storage, but I have no actual knowledge of AWS or GCP systems.


Yeah, I use it in my homelab and it is really awesome to have "RAID(5/6)" basically work over the network.


Network coding is more than that, participants in the graph can synthesize new parts on the fly from parts they just got without having the whole thing.

FWIW, freenet at least uses fec-coded files so that you can have some flexibility in what parts you get and durability against a file becoming broken just because a single part gets lost.


and usenet binaries with https://en.wikipedia.org/wiki/Parchive and even earlier with RAR's recovery block, and probably even earlier with BBSes, but my memory is failing me what I was using before it in the 1990s.

edit: I see immediately below while I was composing this, someone mentioned pararchive..


I thought about something like that to make video calls more stable. For example I'll get completely different routes (sometimes noticeably lower/more predictable latency) when I use a VPN to connect to some peer in the US (from Europe.) Would be cool to combine different routes.


One difficulty with using multiple routes is you'll probably need to spend a lot of bytes on active probing, because the quality of a connection may change during the call and when the active connection loses quality it's not apparent what the other connections will do, so you need recent probing from them as well to make a smart change.

But in theory, you should have several routing options in a well supported calling service. I'll illustrate one direction, but the same options apply in the other direction, and there's no need for both peers to use the same connection to send (although WebRTC will)

Peer A -> Peer B

Peer A -> Relay near A -> Peer B

Peer A -> Relay near B -> Peer B

Peer A -> Relay near A -> Relay near B -> Peer B

If at least two of the four hosts mentioned have IPv4 and IPv6, you can also add those permutations. It's pretty typical to have different routing for v4 and v6.


It’s called SD-WAN


can you explain? I tried looking it up but I didn't quite understand how it is called SD-WAN.


With something like https://github.com/cberner/raptorq you can do several gbits/s over high latency/lossy UDP.


Erasure coding has been around for a very long time. Remember PAR2 files on Usenet? https://en.wikipedia.org/wiki/Parchive


I was unpleasantly surprised by but thankful to have found eclecticlight.co’s findings about PAR2. When I learned about PAR2 I immediately wanted to make par files for everything because bit rot scares me. But, from https://eclecticlight.co/2020/04/20/file-integrity-5-how-wel... :

> This has serious implications for the use of Par2 with files much larger than 20 MB, and probably rules it out as a method of ECC for those larger than 1 GB.

I assumed 10% PAR file size == resistance to 10% of the input file being corrupted, but that’s not how it works. The article shows some nonlinear and non-obvious relationships between input file size, par file size, and maximum number of recoverable errors.


Bitrot is reasonably handled by erasure codes by simply having CRC32 checksums (or similar) verifying the parts.

If a piece has bitrotted away, then you throw away the whole segment.

CRC32 is closely related to ReedSolomon / Galois Fields. It's basically a repeated division + remainders in Galois Field. And as we all know: Division is very good at mixing up bits (true in normal math as well as Galois Fields).

The real benefit of cyclical codes is the guarantee to catch any burst error of size 32 or less (for a CRC32). You only get a chance of false negatives if the error region is larger than the CRC size.

------

Indeed: the whole erasure code / correction code thing has complex math constructs so that these tight guarantees can be made. (Be it CRC32 or ReedSolomon, or any old school biterror algorithm).


> ecause bit rot scares me.

Use WinRAR/RAR recovery record for the important things.

There is one site what still mandates 5% RR for the archives because before the ubiquitous HTTPS the trashed in transit archives were the norm.


It's my understanding that par2 is designed for missing files (parts of a multi part archive), not the uniform random bit rot corruption used in that article. I think it can recover a much larger corrupted or missing block, approaching the size of the parity files.

But yeah if that's your data loss model then par2 isn't the right approach. (Not sure what is.)


I think something about the test methodology in that article is severely flawed.


When I was younger, I literally thought PAR's were magic files. I had no idea how they worked, and from a distance it was magic.


"Any sufficiently advanced technology is indistinguishable from magic." -- Arthur C. Clarke

I thought the same thing when using PAR files. They're still useful today if you save things on media that can be damaged (CD, DVD, Blue-Ray) or across multiple multiple media.

Eventually, I decided to dig into the math behind it. It is a surprisingly simple principle:

Given polynomial of a degree X and an array of data points of size X, there is one and only one solution to the polynomial's coefficients such that it will pass through those data points.

So, stripe the data into bands of arrays, compute the polynomial, and compute additional data points of the curve, and save it with the original data. If you have at least the array's size of data points ( original array and/or parity values) and know the place in the list for each data point (thus which data is missing), there is one and only one solution to the polynomial equation. Once you solve the polynomial again, you can compute any point, including the missing ones. Again, because there is one and only one solution for the curve.

The devil is the math necessary solve the polynomials, which is why it is so computationally intensive.


> "Any sufficiently advanced technology is indistinguishable from magic." -- Arthur C. Clarke

"If anything seems like magic you're not asking enough questions." -- Mario Romero Vega


> They're still useful today if you save things on media that can be damaged (CD, DVD, Blue-Ray)

For writable disk media, there is also dvdisaster:

https://dvdisaster.jcea.es/


PAR files use ReedSolomon error correction which IMO might as well be magic.

Galois Fields are really awesome (and are related to CRC codes). The level of effort to learn is quite high. NASAs guide to ReedSolomon was amazing though

-----------

XOR codes are easier and sloppier. But are actually what's used today despite being imperfect.

Let's say you have A, B, C... Z as your data.

Parity#1 could be XOR(A, B, Z). If B were missing, Parity#1 XOR A XOR Z can back-calculate B.

Parity#2 can be a random set of all previous entries (including Parity#1). Etc. etc. etc.

Keep adding XOR parity codes until your probability of reconstruction is high enough to please you.

I believe this concept is called a Fountain Code.



Yes, this is it.

It's a very shallow introduction to Galois Fields but it's just barely enough to reach Reed Solomon encoding and understand error correction codes.

As I said earlier: it's a lot of math, even in this simplified form. Abstract conceptual math. But I do find this kind of abstractness very fun.


Also I’ve found that while most people (non-tech) don’t have a concept of XOR, they probably took basic algebra and understand 1+?=3

Arithmetic wouldn’t be a good implementation due to integer overflow (a problem XOR doesn’t have) but it’s helpful if you ever have to explain it to the less technical business person who you need to sign off on the purchasing decision.


that's how i used to explain what a nonce was to explain what all the computers were doing to "mine" bitcoin. and then explain "they're instead trying to get a number that has a certain number of zeros in a specific place"


Yes, also RAID5 has been in use at least since the 1980’s


If youre interested in EC you might want to consider larger multi dimensional cases. Think of encoding not just across spindles, but another failure domain like rack, room, DC, or region. The goal being to tolerate common component failures, and larger system failures (or partitions) as well. A nice intro https://chameleoncloud.org/blog/2023/12/12/design-considerat...


You also need to take into account other constraints, like recovery time. If one of your companies datacenters gets destroyed, then it's great to be able to recover from the other 6 using clever erasure codes, but if that recovery requires reading every byte of data in every other DC and sending it over the network, it's gonna take 6 months+ to transfer all that data over a cross ocean fiber which might only be 1 Tbps.


Has anybody used Wirehair in a project? https://github.com/catid/wirehair

I'm curious if it's well-defined enough to base a standard around--informally if not formally--for building a large file archiving/data recovery project I've been mulling over for nearly 10 years. It's the only large block erasure code I've found that has both the ideal (or nearly ideal) algorithmic performance and API. That makes it a nice blackbox for my use case, unlike something like RaptorQ, which leaks little details all over the place, driving up the complexity and rigidness of the rest of the stack. But Wirehair isn't a spec, it's an (experimental?) implementation of an idea. It seems stable, but unless/until I try writing a second implementation, or it's seen substantial use (exposing any sharp edges in the algorithm) I worry how easily it would translate to a reliable specification or second implementation.


We previously used it in Bitcoin Fibre (a fork of the Bitcoin node software with special enhancements for block relay). It's extremely nice.

Be aware that Qualcomm might claim that its covered by RaptorQ patents (it is conceptually related), though the earliest of those are about to expire (or just expired, haven't checked the file wrapper lately) and QC has made some commitment to not apply the RaptorQ patents outside of wireless (but that might be only for conforming implementations, I don't recall).

I've looked at what it would take to specify it-- which would be something that we'd want to do if using it in the bitcoin protocol proper and wasn't super excited about doing it-- even though myself and several other bitcoin developers are quite comfortable with number theory and error correcting codes. It's just that wirehairs structure has a fair amount of adhoc-ish details and knowing us we might get sucked into a trap of improving it. :)

There might be some renewed interest in bitcoin land at getting a fountain code into wide use, if so waiting a while might result in someone else writing a spec.

Depending on your exact application you might find https://github.com/catid/fecal interesting too... if your expected erasures count is very low it could be faster than wirehair.

Leopard is mentioned in the article-- it's not a fountain code but it has a pretty big block size. It has a nice advantage for specification: it's merely a very fast implementation of a boring RS code (so a spec would arguably only need to document the field and generator choice).


> Bitcoin Fibre

Wow, been a long time since I've heard that project named. What ever happened with that?


The bitcoin satellite broadcast stuff uses a fork of the fibre codebase with changes to the FEC to better adapt it to recovering from signal disruptions. Otherwise it's not maintained anymore.

Fibre contained a number of complementary optimizations. A couple have been merged into popular node software-- that's what compact blocks (BIP152) is. The rest are still valuable particularly now with miners frequently including never-relayed transactions, CPUs having become ever more faster relative to latency, and some of the underlying number theory toys from proposed but not ever implemented further fibre enhancements having since been implemented for other purposes. Though the unmerged parts were the harder parts to upstream.

But with many of the contributors who worked on that sort of stuff driven out by vexatious litigation and other forms of harassment, I dunno if anyone will continue that work any time soon. Although, I think there has been some interest. The remaining contributors that I think are more likely to work on things like that have been busy developing an improved combinitoric optimizer for transaction selection, but that work is almost finished.

The two major things fibre had that are missing now is using a fountain code to fill in missing data from blocks and transmission over UDP. The two can be implemented independently but they're much more powerful together. Both are independently kind of complicated to implement in a non-yolo way: the error correcting code because it seems rocketsciency and so it narrows the pool of people willing to try working on it, and the UDP because it's a whole separate P2P protocol arguably, and has a lot of connection and resource management questions that have to be competently answered (some of which were just punted on in fibre).


Great answer @nullc, thank you for the context details. I always felt like having a subnetwork that was super fast optimized for sending transactions around the mempool would be really useful.

The optimizer is a fascinating knapsack problem to me. A lot of work in ETH land went on with this, especially with the work on Flashbots/MEV. But we hear less and less about it in the public these days.

Cheers


Yep, this is the key tech behind Ceph's Erasure Code pool: https://docs.ceph.com/en/latest/rados/operations/erasure-cod...

This does not come without trade-offs though, you cannot update the coding parameters (k, m) afterwards, so you either have to be very sure that those parameters are going to work in a long time, or you have to start from scratch. This inelasticity is also the reason why replicas are still the dominant choice for HA fault tolerant data storage.


Funny story, if you're using Rook Ceph and say "I'm going to try to update these parameters to see if it will let me (and trigger a re-encoding)" it absolutely will let you change them, but it does not trigger anything to re-encode.

It just uses --force and leaves you with a corrupt filesystem.

I suppose that's only really funny in a "you had to be there, and not be me" sense.


Am I right in thinking that products made during an M of N incident are coded differently to when all N are available? If so, you might want a bitflag to denote "needs to be re-encoded when the N is restored" or else you have some files with less than stellar recovery for a random loss in the N set.


Whenever you have a stripe with missing chunks they need to be re-encoded ASAP because those stripes will be lost if they lose enough chunks. Every distributed storage system needs some kind of librarian to go around grooming the stripes to keep them out of danger.


My point was specifically to new things created during the period. Resilvering happens on addition of a replacement drive in ZFS. I am less sure that otherwise valid, complete checksum states of files made during the loss period get uplifted to the wider stripe count when that is done.

I say that because I have seen some stuff which suggests when you grow the FS with new VDEV in ZFS there are circumstances where the balance is not fixed and you can have persisting unbalanced IO state.


In distributed systems you don't need to be constrained to writing to a set of devices one of which is not available. You can just write anywhere, and remember where.


just a suggestion: let the curator's activity be called "preening" rather than "grooming"...


Reminds me also of Rabin’s Information Dispersal Algorithm, described in a paper here:

https://dl.acm.org/doi/10.1145/62044.62050


Is it the case that it's really only practical for read-only or very read intensive workloads?


This is one of the replication strategies that ceph uses for its distributed blob stores


unfortunately modern cpus are still pretty sparse on tools to make erasure codes extremely fast. E.g. no vector clmuls.


Almost all (x86) CPUs sold have GFNI. That can pretty much saturate memory bandwidth on a single core or two. You can use SSSE3 pshufb for the rest which is about half the speed.

ARM has NEON and SVE/SVE 2. They also operate very fast.

So not sure what you are thinking of.


GFNI only does 8-bits and sticks you with a single modulus. But for some reason I'd forgotten it completely, you're totally right to give me a mystified response.

(FWIW, it's possible to project elements from one field to another isomorphic field, though it takes enough operations that for fast code like RS decoding the conversion is probably performance limiting).

For hybrid codes GFNI should be sufficient, though for things like using RS at 16/32 bit sizes it's not.


The matrix multiplication needed to correct grows at n^2 despite the code only growing at n. There are asymptotic faster matrix multiplications in theory than O(n^2) but in practice every algorithm is O(n^2).

As such, large ReedSolomon codes are impractical. If you need a larger code than what GF(2^8) can offer, you grow with 2-dimension codes, slicing or other features.

In practice, this sacrifices Minimum Distance property, meaning you should use a Turbo Code (or other XOR code) which are O(n) but imperfect.

---------

CRC32 can be implemented in GFNI. And AES is also GF(2^8).

----------

I don't think there are many algorithms where GF(2^16) or bigger are needed.

And if they did, it's possible to turn 8x8 into 16x16 or 32x32 anyway.


RS codes can be done in N log N time for both encoding and erasure decoding, and sometimes achieving MDS is useful for attack resistance... e.g. that you can always decode with N blocks, vs with non-minimum distance codes you can have adversarial subsets that will fail to decode even when you have significantly more blocks than needed.. Larger word sizes for RS codes is also useful for list decoding even when the number of data blocks isn't that great.

And sure, list decoding is slow. But I think there are two distinct groups of applications for good instruction sets: one is where you're trying to make a fast thing like an 8-bit RS code or something "free" by having it run at near memory, wire, or disk speeds (or make it consume little power). but the other is where you're doing something legitimately slow, including things that are O(N^2) (or worse). In those cases sometimes a small constant factor makes a big difference between usable and very much not usable.


In those cases, I know that ARM has PMULL / PMULL2 and x86 has PCLMULQDQ, which go up to 64x64 == 128-bit multiplication.

There is even an AVX512 version of PCLMULQDQ.


In my experience, the hybrid approach is practically better due to cache locality anyway, so I don't think even native GF16 or GF32 would be of much help.

FWIW, I've ported rs-leopard to Go and found it very effective, except in the "recover one" scenario, where it is only at 50% speed of the "recover all" scenario, since it has to do several reconstructions to get one output. But even so, I am not too sure it would be much better with plain GF16, since you would still need to touch most shards to get one shard out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: