Hacker News new | past | comments | ask | show | jobs | submit login

If you want to read more about activity feeds there are a ton of papers listed here: https://github.com/tschellenbach/stream-framework I've been working on this stuff for years. Recently I've also enjoyed reading Linkedin's posts about their feed tech. Three are a few different posts but here's one of them: https://engineering.linkedin.com/blog/2016/03/followfeed--li...

Scaling a social network is just inherently a very hard problem. Especially if you have a large userbase with a few very popular users. Stackshare recently did a nice blogpost about how we at Stream solve this for 300 million users with Go, RocksDB and Raft: https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...

I think the most important part is using a combination of push and pull. So you keep the most popular users in memory and for the other users you use the traditional fanout on-write approach. The other thing that helped us scale was using Go+RocksDB. The throughput is just so much higher compared to traditional databases like Cassandra.

It's also interesting to note how other companies solved it. Instagram used a fanout on write approach with Redis, later on Cassandra and eventually a flavor of Cassandra based on RocksDB. They managed to use a full fanout approach using a combination of great optimization, a relatively lower posting volume (compared to Twitter at least) and a ton of VC money.

Friendster and Hyves are two stories of companies that didn't really manage to solve this and went out of business. (there were probably other factors as well, but still.) I also heard one investor mention how Tumblr struggled with technical debt related to their feed. A more recent example is Vero that basically collapsed under scaling issues.




I used to work at Hyves. Hyves overcame it's scalability issues, but went out of business for other reasons. Hyves used MySQL and Memcache similar to facebook at that time.

By the way, RocksDB which is now Facebook's main database (afaik) is written on top of LevelDB. So both Google and Facebook run on software written by Jeff Dean...


Its a leveldb fork with substantial changes. For example, a non binary search file format option: https://github.com/facebook/rocksdb/wiki/PlainTable-Format

Pull down the code some time. Everything and the kitchen sink is in there somewhere. It's a crazy project.


> I also heard one investor mention how Tumblr struggled with technical debt related to their feed

Not sure I'd agree with that, but I suppose it depends on the context and timing of the statement.

Tumblr's solution for reverse-chrono activity feed is, at its core, <1000 lines of PHP and a few extremely heavily optimized sharded MySQL tables. It is creaky and old, but its relatively small code footprint means it isn't terrible on the tech debt scale.

Tumblr's feed is computed entirely at read-time; there's no write fan-out / no materialized inboxes. The key insights that make the system fast (under 10ms latency to identify which posts go in the feed) even at scale:

* If you are grabbing the first page of a feed (most recent N posts from followed users), the worst-case is N posts all by different authors. If you have a lookup table of (user, most recent post time) you can quickly find the N followed users who posted most recently, and only examine their content, rather than looking at content from all followed users.

* For subsequent pages, use a timestamp (or monotonically increasing post id) as a cursor, i.e. the id or time of the last post on the previous page. Then to figure out which followed users to examine for the new page of results, you only need to look at followed users who posted since that cursor timestamp (since they may have older posts on the current page) plus N more users (to satisfy the worst-case of all N posts on the current page being by different authors that were not on previous pages).

* InnoDB's clustered index PK means it is very very fast at PK range scans. This is true even for disjointed sets of range scans, e.g. with PK of (a,b) you can do a query like "WHERE a IN (...long list of IDs...) AND b >= x AND b <= y" and it is still extremely fast.

* You can optimize out the access pattern edge cases and make them fuzzier. Most non-bot users only ever scroll so far; even power users that follow a lot tend to check the feed very often, so they don't go deep each time. On the extreme edge, users who follow thousands don't comprehensively try to view every piece of content in their feed anyway. This means you can measure user activity to determine where to place limits in the system that help query performance but are completely unnoticeable by users.


I've found the Yahoo paper from 2010 to be the best distillation of the general problem of feed following and how to solve it (blending push/pull as you say):http://jeffterrace.com/docs/feeding-frenzy-sigmod10-web.pdf


Social network scaling is not a technology problem.

Technology has imposed scaling on societies and patted it self on the head for the unintended benefits derived, while burying its head in the sand on the unintended consequences.

The roman empire, Genghis Khan, the East India Company, Napoleon, Hitler etc etc etc achieved scale and then what happened?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: