The Implications of Facebook Indexing a Trillion Posts

lumberjack · on Dec 29, 2014

I wonder what Stallman might be thinking. He has been against Facebook since inception and in those days the main concern was not the state but private entities getting too much power over the consumers/users. I still think that this should be the main worry for the average Joe.

https://stallman.org/facebook.html

adventured · on Dec 29, 2014

Serious question - what's the fear in terms of what Facebook can or will do with that? That is, what can they do that would qualify what they have as power over someone?

zxcdw · on Dec 29, 2014

Depends entirely on what they (and people related to them) have shared with Facebook directly and indirectly, and what Facebook actually can infer from the information.

Apart from that I have no answer. However, I find it important to realize the question shouldn't be about what Facebook/whateverelse can "do" now, but what will "they" be able to "do" in the future.

Imagine if there was some Superior Entity which knew everything about every single person born after some year X. What could that Superior Entity do? What could it entail if the knowledge of this Superior Entity got leaked to some hands? I mean, everything about politicians, press, authorities, laymen, teachers, children... Everything.

What are the actual practical worst-case implications of total ubiquitous surveillance for an individual? For society? This certainly depends on the society. Implications are certainly different for people living in, say Iran, Sweden, USA, Canada, Brazil, Russia, China, Germany, North Korea ... For example in some countries religious matters, although very personal, can be very serious. Go being an open atheist in say Indonesia.

I really have no clue, but everything seems like a huge mess with surveillance.

hnnewguy · on Dec 29, 2014

>However, I find it important to realize the question shouldn't be about what Facebook/whateverelse can "do" now, but what will "they" be able to "do" in the future.

Exactly. It's not about Facebook or the authorities mining data for criminal activity. The problem occurs once the authorities have you in their sights, for whatever reason. With reams of private data available and no context, they can (and will) use it to paint any picture of you they need to.

jrochkind1 · on Dec 29, 2014

> Most obviously, the News Feed could learn to mimic our external dialogue, showing us posts with similar content to what we spread. Never talk about sports or babies? Facebook could eventually filter those out of your feed. Just shared your thoughts on Syria, celebrity gossip, or the police state? The algorithm could pull an audible and show you more about related news.

I was pretty sure Facebook was already doing this, even before the public search. No?

pc86 · on Dec 29, 2014

Anecdotal of course, but I have a Facebook friend from political circles, and I see posts from him every day about the Dallas Cowboys. I ignore those and typically like or comment on political posts, but I still see Cowboys all the time.

TeMPOraL · on Dec 29, 2014

I thought so too, though the only case I clearly observed is that fanpages you liked tend to show more often in your feed the more you interact with them. It feels like it follows some kind of exponential curve wrt. how often you like posts.

> Never talk about (...) babies? Facebook could eventually filter those out of your feed.

That would be a very welcome change.

drewcrawford · on Dec 29, 2014

I'm pretty interested in how this is implemented, actually. Naively you could search a global index and then filter results down to your friends, but that filter seems impossibly slow. Alternatively you could maintain a separate index per person, but querying 500 indexes seems unreasonable.

Maybe they partition the graph into cliques which they index, search and then fix up with a second pass to add and remove noncliqued friends?

ssclafani · on Dec 29, 2014

There're two posts on Facebook's engineering blog that discuss this (they're from 2013 when they first started working on the feature):

https://www.facebook.com/notes/facebook-engineering/under-th...

yincrash · on Dec 29, 2014

You could build an index per person, but rather than search 500 indexes, the index of each person would be of all their and their friends' posts, so you would only search one index.

daigoba66 · on Dec 29, 2014

That's how Twitter does it. A new tweet is not immediately visible in all "subscribed" timelines. But eventually it is.

arethuza · on Dec 29, 2014

Doesn't that make updating rather expensive - for each new post it needs to go into a lot of different search indexes?

Off-topic: Wasn't there a start-up a while back that was allowing people to build their own personal search index for their social media content?

yincrash · on Dec 29, 2014

Yes. Writes can be delayed and queued though. Reads need to be fast.

ihsw · on Dec 29, 2014

Sounds about right, but the benefit is that reading through every single person's personal index would be very fast. This can be useful in many ways, and it alone was probably the goal.

In fact it was probably already developed long before this feature was deployed to the general populace, and this new feature is just a token gesture of "giving back to users."

I doubt it'll see much use from end-users themselves.

Yes the search terms from users will be useful, but I don't see this as being a replacement for Google.

eastbayjake · on Dec 29, 2014

Doesn't having a time-series index -- where new entries are only being added to the end of the index, never updating the middle of the index -- make it a little less expensive?

Todd · on Dec 29, 2014

You may be thinking of Greplin. I hadn't followed their story but it looks like they rebranded to Cue and were acquired by Apple.

mcintyre1994 · on Dec 29, 2014

I'm not sure how similar this will be but Twitter have a bit of information on how they indexed "roughly half a trillion documents " when they moved to indexing all tweets here: https://blog.twitter.com/2014/building-a-complete-tweet-inde...

Edit: I don't think Twitter do nearly the same filtering with this and their social graph so maybe not as close as I first thought.

EGreg · on Dec 29, 2014

It is pretty straightforward. You could denormalize as much as you want upfront, or you could parallelize. Facebook does some of both.

Indexing is an exercise in denormalization. At the time I post X, the data is also stored in an index keyed by eg every phrase I wrote and maybe even stemmed by synonyms. This makes search fast. It is a memory-time tradeoff which basically precomputes the answer for every query, limited to my feed.

As for searching 500 indexes in parallel - this is hardly unreasonable. The alternative -- updating N indexes for every post where N is the number of friends -- is more unreasonable, since it does the maximum amount of work, when the vast majority of it is wasted. No friend is going to search for every single word. On the contrary, some friends will eventually search for a large number of words in YOUR index.

Facebook can easily do parallel queries and combine them. Even with a simple MySQL partitioned / sharded setup a site can do it. We do it at Qbix. Let alone Facebook which has improved on mapreduce in their architecture to be more dynamic (http://thenextweb.com/facebook/2012/11/08/facebook-engineeri...) and then there's this: http://www.semantikoz.com/blog/hadoop-2-0-beyond-mapreduce-w...

So yeah. That's how they probably do it.

AznHisoka · on Dec 29, 2014

Wouldn't this be a lot more useful if we could search ALL public posts by any person or business (FB pages)? Sort of like Twitter search. There'd be no privacy issue b/c u'd only be searching the public posts.

getdavidhiggins · on Dec 29, 2014

The "Deactivate your account" button is still the only privacy setting you need to click. The implications are only for facebook - now they have more rope to play with - and something else to maintain - this does nothing for users, who expressly waive their right to privacy from the outset in setting up an account.

jambo · on Dec 29, 2014

Deactivating is designed to leave your account intact in case you change your mind. In fact, some people use this as a super-logout, deactivating their account when not using it. If you want to truly delete your account, the best way to express that intention is the harder to find button to delete your account.

getdavidhiggins · on Dec 29, 2014

Doesn't an inactive account expire after a certain time, and there is a 'grace period' to log back in before that time and keep the account?

On a side note: accounts are persistent through time, and it's not hard to match multiple old deleted copies (resignups) and create a 'super-account'

nine_k · on Dec 29, 2014

I find more useful keeping your account activated and reasonably empty. This prevents anyone from grabbing your name and imposing you, and allows you to use Facebook login at various sites without sharing too much.

vinceguidry · on Dec 29, 2014

Depends on what your name is. The John Smiths of the world don't have anything to worry about on that front. I know of at least two other Vincent Guidrys. One's my father, the real-world consequences of that naming collision far dwarf anything Zuckerberg introduced into my life.

Dirlewanger · on Dec 29, 2014

So there's this for end users, are there any business-specific tools for marketers that Facebook has available or is planning to make now?

pearjuice · on Dec 29, 2014

And all of this is brought to you by the friendly folks of PHP. Next time when you bash PHP because the argument order for functions is inconsistent, think about how some people overlook semantics and build great things with the tools they have at their disposal.

algorithmsRcool · on Dec 29, 2014

I would be willing to wager the shoes on my feet that this indexing system is not written in PHP.

echoless · on Dec 29, 2014

Yeah, it's all brought to us by the friendly PHP folks who found it to be so dog slow they had to write a compiler to get it running at a decent speed. Further, they found the semantics so flaky, they actually wrote a statically typed variant on top of it, just because they're stuck with PHP.

asocial · on Dec 29, 2014

Not all of it. Some of it, maybe. And given that they've forked PHP maybe even less than some.

PHP sometimes doesn't get the credit it deserves for what can be accomplished with it, particularly modern PHP... but i'm not entire sure it deserves credit for this in any reasonable way.

ZenoArrow · on Dec 29, 2014

How are you so sure the search index functionality is written in PHP?