>Are you ready to do this for your Whatsapp? I don't need to thanks to the fact ...

dmitriid · on Sept 30, 2019

Note how I said it was an incomplete list of patches only.

There's also signinficant tuning and optimisation, both for the Erlang VM and FreeBSD.

There also things like (quotes from Highscalability):

"Mnesia: Using no transactions, but with remote replication ran into a backlog. Parallelized replication for each table to increase throughput."

"When Rick is going through all the changes that he made to get to 2 million connections a server it was mind numbing. Notice the immense amount of work that went into writing tools, running tests, backporting code, adding gobs of instrumentation to nearly every level of the stack, tuning the system, looking at traces, mucking with very low level details and just trying to understand everything. That’s what it takes to remove the bottlenecks in order to increase performance and scalability to extreme levels."

Or even the things like "What has hundreds of nodes, thousands of cores, hundreds of terabytes of RAM? The Erlang/FreeBSD-based server infrastructure at WhatsApp". Oh, wait. Erlang's default distribution mechanism grinds to a halt when there are more than ~60-80 nodes. And Mnesia has a 2GB limit on table sizes. So you have to work around those limitations yourself.

There are no magic bullets. Erlang will only take you so far. The rest (80-90% of the way) you have to take on your own, and you have to know what you're doing, and what needs to be done: patches, tuning, workarounds, limits of the systems you work with etc.

toast0 · on Sept 30, 2019

> Erlang's default distribution mechanism grinds to a halt when there are more than ~60-80 nodes. And Mnesia has a 2GB limit on table sizes. So you have to work around those limitations yourself.

I've seen people say these, and I have no idea where they come from. If you have a decent network, dist works fine at well over 80 nodes, but everyone says it doesn't work. pg2/global has some sharp edges if you're trying to have many nodes acquire the same global lock when you have a lot of nodes (a few hundred) or a smaller number if you have a lot of latency between them. There's options though -- maybe you don't need to acquire the same lock on all nodes, or maybe you can look in pg2.erl and global.erl and wiggle the locking code until it no longer live locks.

The Mnesia supposed 2GB limit is a bunch of hooey. Yes, disc_only_tables has (or had) that limit, because dets has that limit. Yes, it's a sharp edge, because there's no warning about it. However, a 2GB dets table is awful to work with anyway. You want to use disc_copies or ram_copies for big tables. Also, mnesia_frag is well supported, so if you really wanted to, you could make your disc_only_copies table 1024 fragments, and have 2 TB of dets, if that's how you wanted to role.

And yes, if you're going to hyperscale, you're going to need a couple people who know how to figure out what your system is doing. Is there a language/environment where that's not true?

I claim, without real proof, that Erlang's BEAM VM and OTP standard library are easier to understand and tweak when you do hit problems. You'll note however, that Rick Reed's first presentation was when he had been at WhatsApp for about a year, and he had zero experience with Erlang before that.

fouc · on Sept 30, 2019

Honestly, I'll worry about it when I get there.

EdwardDiego · on Sept 30, 2019

It's like open source works or something.