Hacker News new | past | comments | ask | show | jobs | submit login
In Twitter’s early days, only one celebrity could tweet at a time (theoutline.com)
243 points by evanweaver on May 24, 2018 | hide | past | favorite | 137 comments



If anybody is interested in random Twitter internal stuff I might bang up a medium post one day. I was neck deep in the infra side of things for years and have all sorts of funny stories.

Our managed hosting provider wouldn't let us use VPNs or anything that allowed direct access to the managed network they provided, but we wanted to make internal only services that were not on the internet so I setup a simple little system that used DNS to point to private space in the office and a SSH tunnel to forward the ports to the right places. Worked great, but over time the internal stuff grew up, and our IT team refused to let me have a server in the office so it was all running of a pair of mac mini's. We called them the "load bearing mac minis" since basically 90% of the production management traffic went over the SSH tunnels they hosted. =)


Posting that here is like showing a big juicy steak to a pit full of hungry lions: of course we're interested!


Make a mailing list please. Don’t want to miss that


Extremely interested! Sounds like some great stories from the trenches.


I like the name of "load bearing mac minis".

Reading that I wonder how much mac minis are used by actual user vs used headless in offices for various purpose.

There is one such mac mini in my office (named "buildbot", because that was its original purpose) that is heavily used by all the dev team (thankfully not connected in anyway to production, apart from simple uptime monitoring).

Obviously, like any other I'm interested in those funny stories you have.


Please make that post - would love to read it.


Please do! Sounds very interesting!


2010: "At any moment, Justin Bieber uses 3% of our infrastructure. Racks of servers are dedicated to him"

https://gizmodo.com/5632095/justin-bieber-has-dedicated-serv...


That was a period when the only people talking outside the company were the ones not trying to fix things inside of the company.

Twitter never had dedicated machines for individual users. Its just not how the infra ever worked (at least until I left). Requests landed on random boxes behind load balancers, those boxes talked to pools of memcache or mysql, etc. At no point was there ever "racks" or "machines" dedicated to specific individuals like the article claims.

That being said, some users created crazy hot shards when specific tweets went absolutely madhouse, especially Beiber and those like him.

Random twitter internals tidbit: We had a unit of measurement called a "MJ". Its the number of tweets per second that we had when the rumors of Michael Jacksons death were circling. It basically overloaded the system and had us running around on fire. It was 465 tweets a second. Within a year we had crossed a line where we never were below that number again. Hence "we are at about 12 MJ's" was jokingly used to compare "hair on fire" to every day a couple of years later. =)


Sorry, we just kept the Bieber box hidden from you in the "special closet". It was a SPARCstation 20 and we were afraid if ops found it they would shut it down.


You know.. I found a special box (PowerMac Pro) in the "special closet" that had been moved like 5 times. It was Blane's old desktop and it had the twitter codebase checked out from like ~2007 era? Kind of nifty seeing just how much had changed since that got tucked away.


But, but, but...why does Twitter have so many engineers? I could write Twitter in a weekend!

--95% of anti-TWTR posters circa 2010-2016.


Before bring acquired, WhatsApp had what, 30 employees?

How did they do it? I know they used custom BSD servers so that a single box could keep close to 1M TCP connections open. I'm sure with a fixed target to aim for and all scope known upfront a small crack team of devs could do something similar for Twitter.


One-to-one vs. many-to-many messaging. The amount of work you need to do to deliver a WhatsApp message is constant and small -- just route the message to a single recipient's mailbox. The amount of work Twitter has to do to deliver a message grows as a function of followers. One celebrity tweeting another celebrity means you have to deliver the message to the mailboxes of the followers of both -- millions of times more work than WhatsApp per message. In addition, Twitter persists all the messages while WhatsApp doesn't.


This! Everyone keep's saying it's BSD Erlang and jumping on the Erlang train. Fine tools btw, but WhatsApp is super simple compared to Twitter. rolls eyes


The beauty of Erlang/OTP!


Especially if you completely rewrite Mnesia.


I think this post is relevant here: https://danluu.com/sounds-easy/


I mean, obviously there must be a lot more going on inside twitter besides being a massive many-to-many pubsub messaging infrastructure. And frankly twitter can be improved by leaving a lot of stuff out and up to the client app, so it's not entirely unreasonabl?

> In one incident, he wrote, an error caused every user to log in as somebody else each time they refreshed the page.

Gotta admit, this story raised my eyebrow, this wasn't the dark ages of the web. What sort of crazy experimental authentication voodoo were they running?


Instagram has 13 employees when it was sold to FB.

Did any major companies have the technology growing pains that Twitter did - ie Facebook, Google, Amazon, YouTube, Netflix (streaming), Dropbox, etc.?


Bieber-related conversation might reasonably have been 3%. Updating his friend count was a constant source of lock contention on a particular mysql.

There were never explicitly dedicated databases/servers, unless you count the hotspot caused by whatever shard he hashed into.


You would think a HyperLogLog would be approximately good enough when follower counts reach tens of millions.


Hyperloglog was only just invented, and hadn't really made its way into the popular developer consciousness.

Anyhow, the solution is easier than that (implement lock striping, i.e. https://stackoverflow.com/questions/16151606/need-simple-exp...), but I was reorg'd away before my code hit production, so I'm not sure what happened there.


Approximate counters are much much older than HyperLogLog. Here’s something from 1978: https://www.inf.ed.ac.uk/teaching/courses/exc/reading/morris.... Of course this solves a different problem because HyperLogLog is an, uh, interesting way to count followers (why would you need to do a count-distinct query? You can’t follow someone multiple times). In any case, the Flajolet Martin sketch dates back to 1985 and solves the same problem as HyperLogLog: https://en.wikipedia.org/wiki/Flajolet%E2%80%93Martin_algori...


Here is nice presentation covering those: (link: https://www.cs.princeton.edu/~rs/talks/AC11-Cardinality.pdf)


Did they actually need to know Bieber's friend count both accurately and in real time?

I would expect that for top Twitter accounts like his, you could get a very good approximation by logging events that increment or decrement the friend count, and using the rates those events are occurring and the last known exact friend count to forecast the current friend count. That would not be exact, but I bet you could get it close enough that users would not see anything off about it.


Yes they made Storm for this


Ha, I always joke that Twitter is a good example of why one should avoid distributed systems. Every often, you see it showing the wrong number of friends, tweets, likes, etc. I don't care, I suppose some people care about such metrics.


That's the essence of the CAP theorem. You can have two of three - consistency, availability, and partition tolerance. Sometimes it makes sense to choose availability over consistency.


A couple years ago, Bieber tweeted out a Twilio phone number for people to text. Luckily it happened on a Friday night, or it would have caused widespread outages.

Celebrity is an edge case that you have to be prepared for.



Clicking that was already worth it for the lead picture, best laugh I had today, thank you!


If you want to read more about activity feeds there are a ton of papers listed here: https://github.com/tschellenbach/stream-framework I've been working on this stuff for years. Recently I've also enjoyed reading Linkedin's posts about their feed tech. Three are a few different posts but here's one of them: https://engineering.linkedin.com/blog/2016/03/followfeed--li...

Scaling a social network is just inherently a very hard problem. Especially if you have a large userbase with a few very popular users. Stackshare recently did a nice blogpost about how we at Stream solve this for 300 million users with Go, RocksDB and Raft: https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...

I think the most important part is using a combination of push and pull. So you keep the most popular users in memory and for the other users you use the traditional fanout on-write approach. The other thing that helped us scale was using Go+RocksDB. The throughput is just so much higher compared to traditional databases like Cassandra.

It's also interesting to note how other companies solved it. Instagram used a fanout on write approach with Redis, later on Cassandra and eventually a flavor of Cassandra based on RocksDB. They managed to use a full fanout approach using a combination of great optimization, a relatively lower posting volume (compared to Twitter at least) and a ton of VC money.

Friendster and Hyves are two stories of companies that didn't really manage to solve this and went out of business. (there were probably other factors as well, but still.) I also heard one investor mention how Tumblr struggled with technical debt related to their feed. A more recent example is Vero that basically collapsed under scaling issues.


I used to work at Hyves. Hyves overcame it's scalability issues, but went out of business for other reasons. Hyves used MySQL and Memcache similar to facebook at that time.

By the way, RocksDB which is now Facebook's main database (afaik) is written on top of LevelDB. So both Google and Facebook run on software written by Jeff Dean...


Its a leveldb fork with substantial changes. For example, a non binary search file format option: https://github.com/facebook/rocksdb/wiki/PlainTable-Format

Pull down the code some time. Everything and the kitchen sink is in there somewhere. It's a crazy project.


> I also heard one investor mention how Tumblr struggled with technical debt related to their feed

Not sure I'd agree with that, but I suppose it depends on the context and timing of the statement.

Tumblr's solution for reverse-chrono activity feed is, at its core, <1000 lines of PHP and a few extremely heavily optimized sharded MySQL tables. It is creaky and old, but its relatively small code footprint means it isn't terrible on the tech debt scale.

Tumblr's feed is computed entirely at read-time; there's no write fan-out / no materialized inboxes. The key insights that make the system fast (under 10ms latency to identify which posts go in the feed) even at scale:

* If you are grabbing the first page of a feed (most recent N posts from followed users), the worst-case is N posts all by different authors. If you have a lookup table of (user, most recent post time) you can quickly find the N followed users who posted most recently, and only examine their content, rather than looking at content from all followed users.

* For subsequent pages, use a timestamp (or monotonically increasing post id) as a cursor, i.e. the id or time of the last post on the previous page. Then to figure out which followed users to examine for the new page of results, you only need to look at followed users who posted since that cursor timestamp (since they may have older posts on the current page) plus N more users (to satisfy the worst-case of all N posts on the current page being by different authors that were not on previous pages).

* InnoDB's clustered index PK means it is very very fast at PK range scans. This is true even for disjointed sets of range scans, e.g. with PK of (a,b) you can do a query like "WHERE a IN (...long list of IDs...) AND b >= x AND b <= y" and it is still extremely fast.

* You can optimize out the access pattern edge cases and make them fuzzier. Most non-bot users only ever scroll so far; even power users that follow a lot tend to check the feed very often, so they don't go deep each time. On the extreme edge, users who follow thousands don't comprehensively try to view every piece of content in their feed anyway. This means you can measure user activity to determine where to place limits in the system that help query performance but are completely unnoticeable by users.


I've found the Yahoo paper from 2010 to be the best distillation of the general problem of feed following and how to solve it (blending push/pull as you say):http://jeffterrace.com/docs/feeding-frenzy-sigmod10-web.pdf


Social network scaling is not a technology problem.

Technology has imposed scaling on societies and patted it self on the head for the unintended benefits derived, while burying its head in the sand on the unintended consequences.

The roman empire, Genghis Khan, the East India Company, Napoleon, Hitler etc etc etc achieved scale and then what happened?


I haven't seen anyone touch on this, but I remember reading about this in Data Intensive Applications[1]. The way that they solved the celebrity feed issue was to decouple users with high amounts of followers from normal users.

Here is a quick excerpt, this book is filled to the brim with these gems.

> The final twist of the Twitter anecdote: now that approach 2 is robustly implemented,Twitter is moving to a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out. Tweets from any celebrities that a user may follow are fetched separately and merged with that user’s home timeline when it is read, like in approach 1. This hybrid approach is able to deliver consistently good performance.

Approach 1 is a global collection of tweets, the tweets are discovered and merged in that order.

Approach 2 involves posting a tweet from each user into each follower's timeline, with a cache similar to how a mailbox would work.

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...


It’s an oft overlooked inequality in these systems. People get so wrapped up in some whiz bang thing that they don’t stop to think if they should.

At the end of the day, one of the most important aspects of your information architecture is how many times is each write to the system actually observed? That answer can dictate a lot about your best processing strategies.


This. It took me a while to learn to look at this and eventually focusing less on caching on some items that don't enough read in it to justify the caching or pre calculation.


This isn't shocking - Twitter was notorious for being held together with Scotch tape technically.

Honestly this hands-on approach is an impressive example of doing things that don't scale.


Oh good. This is where I get to chime in. Twitter wouldn’t have existed with out Rails. It was a side project inside of a failing startup. It took Jack and Florian like two weeks to have a solid version up that we could all use and love. That’s two weeks of designing and then having Rails make it super easy to prototype. I was in the management meetings and my read was that if they’d wasted any time fighting with technology or gold plating infrastructure then we would have moved on.

So I do see Twitter as a huge Rails success story. To use the Rails term, Rails was the scaffolding. Replacing any part of Rails later isn’t a failure of Rails.


I found it amusing that Twitter was Rails' biggest advertisement. Everyone wanted to use Rails but Twitter turned into a franken app with different stacks to keep it running


Twitter was Rails' worst advertisement. They used Rails as a scapegoat to hide their bad tech. I still hear things like "Rails can't scale; remember Twitter?"


I had the same thought when I read this. I wasn't into Rails back then (or development) so I don't have a sense of context for what the framework was like at that point in time, but the more articles I read about Twitter in the early days, the more of a sense I get that maybe they didn't write the best code.


But Rails is nice to work with and whispers 99% of websites will never need to reach Twitter-scale.


At least 3 large companies I know my friends developers work at, over the last 5 years switched from Rails to PHP. All told me same story "after twitter, noone wants to work or touch rails anymore"


Meanwhile all 3 large companies probably host their code on Github :-)


And even if they don’t, many if not most of their dependencies do.


What will they do when they find out phpBB and WordPress are written in PHP?


Airbnb is still on rails.. Probably the largest rails app out there for now.


Except that I'd bet Airbnb's read qps requirement is less than 1% of Twitter's. Write qps would be even smaller.


I would imagine Shopify on Black Friday is under heavier load.


True, but on an absolute scale AirBNB is still very big.


By RPS, I imagined Shopify is largest. But then from what I understand, Shopify operate in such a way that every shop is its own app. i.e there is 100s of thousands of Basecamp / AirBnb running on the same code base and every shop is somewhat isolated.

Purely in terms of App in think the largest would be Cookpad. AirBnB never shared their numbers so I don't know.

https://speakerdeck.com/a_matsuda/the-recipe-for-the-worlds-...


I don't think so. Seriously think about it. How many active users does AirBnB have? How many put their home up for rent, and how many rent in a year? I reckon you get more tweets in a day than AirBnB rentals in a year. That's how big Twitter is or small AirBnB is. :D


Twitter was bigger but they don’t run on Rails anymore where I gather AirBNB does.


From 2008:

Scaling is fundamentally about the ability of a system to easily support many servers. So something is scalable if you can easily start with one server and go easily to 100, 1000, or 10,000 servers and get performance improvement commensurate with the increase in resources.

When people talk about languages scaling, this is silly, because it is really the architecture that determines the scalability. One language may be slower than another, but this will not affect the ability of the system to add more servers.

Typically one language could be two or three, or even ten times slower. But all this would mean in a highly scalable system is that you would need two or three or ten times the number of servers to handle a given load. Servers aren't free (just ask Facebook), but a well-capitalized company can certainly afford them.

http://www.businessinsider.com/2008/5/why-can-t-twitter-scal...


Yes, well, that's a nice idea in theory. In practice, you could get over 10x (sometimes 100x) the rps off a box running the new, JVM-based services vs their Rails-equivalents. Orders of magnitude probably matter a little less when you're well-funded and have hundreds of servers, but when you are thinking about trying to go public and your bottom line is being scrutinized and you have 10's of thousands of servers, it starts to matter.


That's exactly what he said.

but a well-capitalized company can certainly afford them

But these days when you don't have to buy servers and make a long term capital commitment and you can use something like AWS, if you have a scalable but not efficient architecture and you have the faith of the investors, you can get enough servers to get you over the hump temporarily, slowly start replacing the most performance sensitive part of your architecture and then scale down.

Look at what HN darling Dropbox did, they bootstrapped on AWS, got big and then when the time was right, they moved to a cheaper architecture - off of AWS and built their own infrastructure.


And this is exactly what Twitter did, and how Twitter replaced Ruby and Rails with the JVM.


I doubt they just "replaced" Ruby on Rails with the JVM without making any architectural changes based on the lessons they learned from thier first implementations.


Did I say that? We (I spent 4.5 years there, starting with the writing of the first service extracted from the monorail and left shortly before the 'twitter' repo was deleted) absolutely went through a huge architectural transition, arguably multiple transitions. The biggest was the breakup of the monolithic Rails-based application into microservices running on the JVM. These services generally scaled ten to one hundred times better than the code they replaced. (By "scaled" here I specifically mean that RPS was 10-100X higher per physical machine for services on the JVM as compared to the RPS the Rails stack could handle before falling over).


I was replying to this:

And this is exactly what Twitter did, and how Twitter replaced Ruby and Rails with the JVM

In the context of my original post where the contention was that languages don't scale that architectures do. Your post was that it was exactly what you did - replaced Ruby with Java. Not that you replaced Ruby with Java and rearchitected the entire stack - exactly what my original post said waa the problem with Twitter - the architecture.


Well, that's true to a degree, but only to a degree. If I wrote my database in Ruby, it would be slow compared to the same database written in C++ (assuming equivalent competency in the developers and equivalent architecture). Even a database written in Java benchmarks slower than the same database with the same architecture written in C/C++. Of course architectural changes can make further improvements.

To the point of Twitter, what we _didn't_ do, despite a lot of Ruby expertise on the team, is write a lot of microservices in Ruby. The reason for that is that I don't think you can get the same RPS out of a Ruby service that you can out of a JVM service, all else being equal. In fact HTTP benchmarks for various platforms show this, if you bother to look.


I'm not disagreeing with you. Looking on the outside, Twitter had two issues.

1. Twitter wasn't built on a scalable architecture

2. Ruby didn't use resources efficiently -- it was slower than other stacks.

If Twitter had been scalable, even if it were 10x slower than Java, you could throw 10x the number of servers at it until you optimized the stack then reduce the number of servers needed and the customers would have been none the wiser. Of course the investors wouldn't have been happy. Well at least in today's world. I don't know what the state of cloud services were in 2008. Then you could focus on efficiency.

But since Twitter wasn't scalable, you had to fix the stack while the customers were effected. I'm almost sure even in 2008, with the growth of Twitter they could have gotten the capital to invest in more servers if they needed them.

It's not completely analogous but Dropbox is a good counterexample. Dropbox was hosted on AWS at first. Dropbox never had to worry about running out of storage space no matter how big it grew (it had a scalable architecture) but for their use case, they weren't as efficient (ie cost not computer resources). Their customers were never affected by their lack of efficiency because they could operate at scale. They had breathing room to re-architect a more efficient solution.


These are totally different problems, though. Dropbox is a trivial scaling exercise compared to Twitter. (Some Dropbox engineers are going to be all up on me now, but it's simpler by far to shard than Twitter was -- and yes some functionality in Dropbox is probably harder to shard, but the core use case of Twitter generated hot spots by definition).

FWIW, Twitter did what you're describing, we had 4 or 5 thousands hosts running the Ruby stack at its peak. Unicorns and Rainbows, oh my. Then it started shrinking until it shrank to nothing. That period was actual the relatively stable period. The crazy period, the one that I wasn't there for, was probably impossible to architect your way out of because it was simply a crazy amount of growth in a really short amount of time, and it had a number of ways in which unpredictable events could bring it to its knees. You needed the existing architecture to stay functional for more than a week at a time for a solid 6 months to be able to start taking load off the system and putting it onto more scalable software.

Any startup would be making a mistake to architect for Twitter scale. Some startups have "embarrassingly parallel" problems -- Salesforce had one of these, although they had growing pains that customers mostly didn't notice in 2004 timeframe. Dropbox is another one. If you're lucky enough to be able to horizontally scale forever, then great, throw money at the problem. Twitter, at certain points in its evolution (remember AWS was not a thing) was literally out of room/power. That happened twice with two different providers.


10x, Sure, possible. 100x?

Most of the problem lies in Database. Rails may not be best architecture for scale. But I doubt you could even get 100x difference if the bottleneck is in Database.

I don't know any large, JVM based WebSite in large scale on top of my head, but I consider Stackoverflow, written in ASP.NET to be one of the best and most optimised site. Near 700M Pageview per month, with 10 Front End Servers. At peak it does close to 5000 RPS, Cookpad does 15,000 RPS with 300 Rails Server. But the SO servers are at least twice as powerful, so that scale is like 500 RPS / Server to 100RPS / Server. 5x Difference.


Does SO probably has a lot more cacheable content and their core content size is small < 2TB full DB size with DB memory hitting 768GB or so. Not that it doesn't require good engineering. I do love their stats.


You're making the assumption that there is a database in the mix for the 100X case. There wasn't, except in-memory. It wasn't a 100X improvement across the board, it was 10X to 100X.


Did Twitter ever upgrade past Ruby 1.8? If not these numbers won't be particularly relevant to modern Ruby. (for anyone else reading 1.8 was a simple interpreter and 1.9+ a modern high-performance bytecode VM)


Only if you look at these ridiculous benchmarks serving "hello world" and json encoding/decoding a basic dataset. The real limits to scale is I/O. CPU I/O, Storage I/O, Network I/O. JVM doesn't give you an edge on it. Once your app starts doing useful non localized work, the advantage of JVM is 2-3x at best.


No, I'm looking at Twitter's numbers in production. Before and after very much tells the tale -- and it was mostly the same engineers, so it wasn't "oh well you had the B team doing Rails and the A team doing JVM code" or whatever other excuse you want to look for. Microbenchmarks aren't the end-all-be-all, but you see the same story there.


Rails strength isn't the speed of it's code but it's speed of development, the JVM-based service equivalent you are comparing it with probably needs 5x the amount of engineering time to build the same as what you have for free with Rails. (which also needs to be included in the engineering costs)


Rails speed is speed of prototype. When you have a sufficiently large system with Rails, your development slows down. This is not just so for Rails, but languages like PHP, Python, Javascript.


I doubt that any language gives you 5x the productivity of another language. I don't think I would be 5x slower writing a Back end web code in C than I would in C# and C is probably the worse language to do an API in.


It's not about the language, it's about the batteries-included mindset of rails.

You have your nice web framework in X language and then you need something else quickly is already built-in in Rails or a very powerful feature that in Rails you could just install a gem and call it a day.

I'm working with Express/Node now, I've worked with Symfony/Laravel in php, I've worked with Django in Python, I like them all but there's nothing which can truly replace the speed of coding with Rails.


Rails, and the way it used MySQL/Postgres, was designed for building CMSs. Very amenable to CDN caching to scale.

Twitter was a realtime messaging platform (fundamentally not edge-cacheable) that evolved from that CMS foundation. So the reason for the difficult evolution should be clear.

It's not really a coincidence that before Twitter I worked at CNET on Urbanbaby.com which was also a realtime, threaded, short message web chat implemented in Rails.

Anyway the point is: use my new project/company https://fauna.com/ to scale your operational data. :-)


I take it from a quick glance: you either own/work for Fauna and it’s a completely proprietary product? If so .. I’ll pass, I like building things atop stuff I can control. :)


Yes, it's not open source (for now). Managed cloud and on-premises edition. We will have a free download soon.


Might I suggest disclosing it when you post about it? Clicking through your profile revealed it, but it seems like it could read off better for you in the future. :)

EDIT: either it was edited above, or I’m blind. I think it was the earlier?


Yeah I fixed it. Thanks.


Just wondering how is engineering stack at Twitter now? Is it still bandaging things like everyone is talking about here (in Rails days) or they have matured to good engineering practices. Interested, as I was considering applying for engineering job there.


The fail whale image was retired over 4 years ago now because it almost never showed up anymore. That should tell you enough.


It's fine. Very Scala/Java/Mesos heavy. Big company integration problems now instead of startup scaling problems.

They still use some of the distributed storage we built almost a decade ago though.


If you check their Github account it looks like they built a lot of interesting Scala tech.


There was a fun high-scalability article around their 'fan-out' approaches to disseminating tweets by popular users etc : http://highscalability.com/blog/2013/7/8/the-architecture-tw... .

When I was working on something with similar technical requirements I also came across this paper (http://jeffterrace.com/docs/feeding-frenzy-sigmod10-web.pdf) that outlined the approach in a more 'formal' manner.


Ah, I read that Twitter thread a few days (weeks?) ago and it was much longer. As far as I remember, it started with someone asking Twitter ops people, former and current, to share some stories about things that went spectacularly wrong.

It contained a lot of Twitter ops battle stories, some very interesting. I was pretty impressed to read Twitter internals in the given level of detail, but now it seems that the thread that held them all together is protected (probably didn't expect it to be so popular, or just wanted to continue more privately).


And yet I bet nobody mentioned the "fire rain" in our first data center, the load bearing mac mini, or the surprise Snoop Dog visit/party! =)


Hah, I don't think anybody has, but I'd love to hear those stories!


Aww. If only there was an archive of it somewhere... (I'm curious if there is)


In the early days of Twitter the "fail whale" was so common it got assimilated into culture as a term to use for anytime any site gets overloaded. Nowadays it seems like that term is "hugged to death"

https://www.theatlantic.com/technology/archive/2015/01/the-s...


Everybody knows about the "fail whale".. nobody knows about the "moan cone".. The image is lost to history but it was captured by this account ages ago: https://twitter.com/moan_cone

IT was thrown when the system was failing at the rails layer rather than the apache layer. I believe that Ryan King and I were the last people to ever see the moan cone in production. =)


Do you remember the fixit kitten?


That got nuked right about the time that I started! I never saw it live.


For us old timers the term is “slashdotted.”

https://en.m.wikipedia.org/wiki/Slashdot_effect


As slashdot's active user count dwindles and as it has gotten a bit easier to build a site that can handle decent traffic, I wonder if there was (or will be) a specific last site to get literally slashdotted.


I doubt you get many/any sites slashdotted from the actual slashdot these days. But you do see sites slashdotted from HN articles every day.

And you will be going forwards as well. People will not create horizontally autoscalable, cachachable sites as that cost money and effort for 99.9999% sites that will never need it.

True we are at an age were it is easier, more well known how to do it, and less and less costly, but still not free in money and time.


Missed that this went to the front page! I will answer questions if I can.

I am now CEO @ https://fauna.com/, making whales and fails a thing of the past for everybody. We are hiring, if you want to work on distributed databases.


Long time no see man! =)


You too! Hope Zookeeper isn't still keeping you up.


Funny thing is that the place I am at now uses Zookeeper.. luckily it has not been trouble for us though. =)


It's saving it up for you...remember deleting thousands of Nagios notifications on the bus? Those were the days.


I remember all those iPhone suckers having to download special apps to remove conversations in the early days since it only allowed you to delete them one at a time. =)

I also remember getting on a bend to make all fake alerting go away after I started only to find that like 50% of the configured alerts couldn't ever succeed and had been paging every 10 minutes for over a year. We got that stuff sorted out pretty quick. Once we started getting good visibility it helped us make much better decisions about where the problems actually were.

The first six months felt like everything was just on fire constantly and nobody know what was going on, but then things started falling in place and it felt like everything was still on fire, but some people actually had fire extinguishers for a change. =)


The best monitoring system was the TV in the Ops area scrolling a constantly updating search for #failwhale.


Fun fact, Twitter's feed is still kinda broken. If you visit the site after being gone for a week or so it tells your timeline is empty. It recovers after a few minutes, but its still a pretty poor user experience.


I see a fair amount of stuff that's badly broken.

- Analytics.twitter.com numbers changing unexpectedly, being zeroed out temporarily, inconsistent numbers

- Deleting tweets and having them not show up on my timeline but still show up in search weeks later.

- The repeating timeline bug, you're going down the timeline and after a while it loses its place and repeats tweets you saw a couple of minutes ago

- People who have blocked me (gah) but I still see their tweets in certain apps.

tbh I see some strange stuff on Facebook and Reddit sometimes and granted it's hard to do what they do at scale. But Facebook and Google products seem to evolve a lot more over the years.


I always get a message saying the request took too long to execute, if I click a link leading to Twitter. Have to reload the page a couple of times to make it work.

Usually don't click on Twitter links for this reason.


Yes, I get "rate limited" on a single request on mobile web quite often. It's strange.


If I go to Twitter on my mobile phone (using Brave, not their interface), I consistently got an error message preventing me from seeing anything. It says I apparently hit Twitter too many times (when I try to load a single tweet).


Twitter prioritizes availability over Consistency.


Say what you will, Twitter managed to make a $25.23B company out of printing

   char whatever[140];


should be 141 if you want to print out a string that would be 140 characters, NULL termination is a thing.


What frameworks would you use to handle such an steep growth curve? Most startups I know of start of with rails or the like - and obviously they couldn't handle the strain. So what would you use?


The web part of it is pretty easy to scale. You simply add more web servers. The problem is in the storage/db layer. I imagine the feed was the primary challenge scaling Twitter.

Other components such as the search were probably also quite tricky. One thing i've never figured out is how Facebook handles search on your timeline. That a seriously complex problem.

Linkedin Recently published a bunch of papers about their feed tech:

https://engineering.linkedin.com/blog/2016/03/followfeed--li...

And Stream's stackshare is also interesting: https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...


>One thing i've never figured out is how Facebook handles search on your timeline. That a seriously complex problem.

From a user's point of view, they don't. I search for things a lot (or at least, I used to before I realized how bad it is) and things that I know I saw yesterday don't come up - even if I put in a very specific search. Sometimes it'll say "no results found", but I'll find it in a tab I hadn't closed yet.


What seems difficult about facebook search ?


The JVM. I know this is the boring answer but the tooling and performance combination is tough to beat. It's why Twitter themselves switched to the JVM (a mix of Scala and Java).


Very few sites have had and will have that kind of growth curve. Moreover, the cloud and the workloads you can run on it are substantially different now than they were 10+ years ago. So what would I use? Probably the most productive framework I could find, not necessarily the most performant one, because odds are that my app would never attain the scale I'm preparing for, but that preparation overhead may be the handicap that prevents that level of success in the first place.


The one my team knows and / or likes best. When you hit really big scale, architecture is more of a differentiator than language, but you can always move inefficiencies into their own services written in a more performant technology. The thing you need most when starting is the ability to produce, test, and iterate quickly.


I would use Rails, but I wouldn't do things like try to write an asynchronous message queue in Ruby (like Twitter did).

I mean sites like Shopify, Hulu, YellowPages.com, etc. "handle the strain" just fine.


How much of the content in those sites change as fast as Twitter does? Mainly static content can be displayed using about any framework/language. Moreso when you include caching. I don’t think Rails was simply a good choice for the workload Twitter needed to support (not that they knew it at the time).


No framework will save you, it's not about language or framework at that point, they necessarily won't help you, but they can sure hurt you. It's more about your architecture.


obviously they couldn't handle the strain

They could handle the strain, within what their audience considered acceptable. That's the important thing. The key to scaling a startup is to always be just ahead of where your audience gives up. Anything less and your precious users will abandon ship and you'll fail; anything more you're burning too much cash, you'll run out of runway and you'll fail.


WhatsApp was a pretty well known case for Erlang[1], given that I'd think Phoenix would be a great fit.

[1] https://www.wired.com/2015/09/whatsapp-serves-900-million-us...


Erlang won't have helped twitter. Erlang has a very simple architecture, once they sent you a message and know it has reached your device, they no longer care about it. Twitter has to keep all the tweets, likes, retweets, show it all followers in real time. 100x more demanding than WhatsApp simple deliver and forget.


Phoenix will be much better than Rails out of the box for things that need a certain level of concurrency. However Twitter's problems were about scaling dependent writes, this is a data-store / architecture problem that really has nothing to do with the chosen language.


It's a much bigger mistake to overengineer for the highly unlikely event of such a steep growth curve than it is to get to that part of the curve and flail about semi-ineffectually for a couple of years like twitter did at the time.


The first question should be: do I need a framework at all. There's almost always overhead, you have to ask if it's worth it


You're getting downvoted, but this is one of the arguments for going with unikernels.


This made me laugh and feel better. Being in a startup is tough!


It would be fascinating if they returned to this capability. One celebrity gets the podium at a time.


> Jason Goldman, who served as Vice President of Product at Twitter between 2007 and 2010, responded to Weaver’s tweets with the observation that early Twitter was “held together by sheer force of will.”

I would dispute that, I don't think they can take that much credit. Regardless of their "sheer force of will" the site was down very, very frequently.


Much of that was just bad design decisions in the early days creating a momentum that was unstoppable and unreplacable at the speed we were growing.

In the early days they implemented a Kitama system for memcache. It worked great so long as nothing failed since a node coming in and going out could lead to bad results. In the old days they would just flush caches when that happened. Later, when memcache was the only way we could serve the loads we had it became imperative that memcache never be restarted. A single memcache restarting would overload the mysql backend so bad that the site would be down for hours. Adding memcache had more or less the same issues though it was a little easier to prewarm things. We wrote a kernel module that allowed us to change the ulimits of a running process so we could increase the file descriptor limits for memcache without restarting it.. Replacing it completely was damn near impossible given the growth rate and inability to get the data out of mysql quickly enough.

Us reliability guys worked 16 hour days 7 days a week for years trying to keep things working well enough to not fail completely. Some of the crazy hacks we did just to survive were fantastic and impressive and I am still proud and disgusted by them to this day. =)


Indeed. "It was amazing that it worked, and terrible that we had to do it."


I believe it was James May that said: "You have come up with a clever solution to a problem you shouldn't have had in the first place!" =)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: