Behind the Scenes at Facebook: Scaling Up FBChat Using Erlang

jfarmer · on Aug 18, 2008

FWIW, this is just a republication of a 3-month-old blog post: http://www.facebook.com/note.php?note_id=14218138919&id=...

KirinDave · on Aug 18, 2008

I'm surprised to hear that they're using the alt erlang-thrift binding in production. In Powerset's application of it, it begins to rattle and break down when you push it about 90r/s. I wonder if FBChat's per-connection load is lower than that?

Thrift is a great idea and I've used it alot, but it's frustrating to see it so tied to C++. The idea of code generation for your IDL interfaces is so last decade. We've got experimental thrift bindings that treat the IDL as a runtime-compiled source, so on the Ruby or Erlang side you can use their dynamic strong typing to avoid the necessity of code generation that the C++ side suffers.

ComputerGuru · on Aug 18, 2008

Current fbChat loads are really bad.

I believe there's some sort of prioritization going on - connecting from a "3rd world" ISP to Facebook gives me some really bad lag and disconnects, while going through a VPN (server in LA) provides much better response times. If it were just this one ISP, I'd blame it on them; but in my months of using fbChat, I've noticed some clear scaling problems for their "real-time" service compared to the other, more static parts of their site.

Obviously this isn't a scientific study in any way, but fbChat has always been a bad performer.

neilc · on Aug 19, 2008

Does anyone actually use Facebook chat? Leaving aside the details of the implementation, chatting via the FB web interface seems terribly awkward. And while I can use XMPP, most of the people you'd be chatting with won't be.

prakash · on Aug 18, 2008

it begins to rattle and break down when you push it about 90r/s

Can you explain more about this? What were the main bottlenecks?

KirinDave · on Aug 18, 2008

Honestly we only hit this traffic in trials, so the details aren't at the top of my mind.

There seem to be some exceptions thrown once the traffic goes above 75r/s, and they slowly degrade performance until 90r/s where we were maxing out our xen slice.

I didn't bother to isolate exactly what was going on because we stopped using thrift at that layer for other reasons (more ruby was introduced, and until Kevin Clark's second rewrite the ruby thrift bindings were pretty painful to use).

I can dredge up the logs if you're interested. I'll contact you outside this venue.

stcredzero · on Aug 18, 2008

Wow. Testing with a "Dark Launch." Basically, they used cycles from every user machine with an open Facebook page to do their load testing by distributing a version of their app with no UI and a test script. This is very powerful and kind of scary.

arockwell · on Aug 18, 2008

Google did something really similar with their gchat launch on Orkut. They started out with just 1% of the userbase running it in the background and scaled it up from there till they were confident that the implementation details were mostly worked out. Its really a great idea for testing the scalability of new features without completely breaking the production site.

axod · on Aug 18, 2008

I wonder what their stats are like so far. Are people using it to chat?