Hacker News new | past | comments | ask | show | jobs | submit login

I'm pretty interested in how this is implemented, actually. Naively you could search a global index and then filter results down to your friends, but that filter seems impossibly slow. Alternatively you could maintain a separate index per person, but querying 500 indexes seems unreasonable.

Maybe they partition the graph into cliques which they index, search and then fix up with a second pass to add and remove noncliqued friends?




There're two posts on Facebook's engineering blog that discuss this (they're from 2013 when they first started working on the feature):

https://www.facebook.com/notes/facebook-engineering/under-th...

https://www.facebook.com/notes/facebook-engineering/under-th...


You could build an index per person, but rather than search 500 indexes, the index of each person would be of all their and their friends' posts, so you would only search one index.


That's how Twitter does it. A new tweet is not immediately visible in all "subscribed" timelines. But eventually it is.


Doesn't that make updating rather expensive - for each new post it needs to go into a lot of different search indexes?

Off-topic: Wasn't there a start-up a while back that was allowing people to build their own personal search index for their social media content?


Yes. Writes can be delayed and queued though. Reads need to be fast.


Sounds about right, but the benefit is that reading through every single person's personal index would be very fast. This can be useful in many ways, and it alone was probably the goal.

In fact it was probably already developed long before this feature was deployed to the general populace, and this new feature is just a token gesture of "giving back to users."

I doubt it'll see much use from end-users themselves.

Yes the search terms from users will be useful, but I don't see this as being a replacement for Google.


Doesn't having a time-series index -- where new entries are only being added to the end of the index, never updating the middle of the index -- make it a little less expensive?


You may be thinking of Greplin. I hadn't followed their story but it looks like they rebranded to Cue and were acquired by Apple.


I'm not sure how similar this will be but Twitter have a bit of information on how they indexed "roughly half a trillion documents " when they moved to indexing all tweets here: https://blog.twitter.com/2014/building-a-complete-tweet-inde...

Edit: I don't think Twitter do nearly the same filtering with this and their social graph so maybe not as close as I first thought.


It is pretty straightforward. You could denormalize as much as you want upfront, or you could parallelize. Facebook does some of both.

Indexing is an exercise in denormalization. At the time I post X, the data is also stored in an index keyed by eg every phrase I wrote and maybe even stemmed by synonyms. This makes search fast. It is a memory-time tradeoff which basically precomputes the answer for every query, limited to my feed.

As for searching 500 indexes in parallel - this is hardly unreasonable. The alternative -- updating N indexes for every post where N is the number of friends -- is more unreasonable, since it does the maximum amount of work, when the vast majority of it is wasted. No friend is going to search for every single word. On the contrary, some friends will eventually search for a large number of words in YOUR index.

Facebook can easily do parallel queries and combine them. Even with a simple MySQL partitioned / sharded setup a site can do it. We do it at Qbix. Let alone Facebook which has improved on mapreduce in their architecture to be more dynamic (http://thenextweb.com/facebook/2012/11/08/facebook-engineeri...) and then there's this: http://www.semantikoz.com/blog/hadoop-2-0-beyond-mapreduce-w...

So yeah. That's how they probably do it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: