Hacker News new | past | comments | ask | show | jobs | submit login
How I got sued by Facebook (petewarden.typepad.com)
336 points by petewarden on April 5, 2010 | hide | past | favorite | 80 comments



"my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me."

This is a very real problem with law surrounding emerging business practices, esp here in the US.

Ultimately you can only pioneer whatever you can afford to defend in court. Facebook, or any other BigCo for that matter, can assert that you can't do X and it's up to you to fight it in court... Even if there is prior behavior such as with this case. Clearly Google does the very same job and doesn't have an agreement with Facebook to spider their site.

But if you can't afford to defend it and bring up the prior examples in a court, then Facebook - or anyone else for that matter - can stop you.


They probably have some clause in the terms and conditions about the automated interaction with the page (e.g. point 3.2, but there might be more)... but I hope it would be very hard to enforce. Especially when you can download the public pages without accepting the terms and conditions document or even seeing it.


"Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission."

This is ridiculous. Can someone release the data so this can be tested in court? EFF?


For obvious reasons this would never happen, but it would be pretty cool to see Google unindex Facebook entirely. They could state that Facebook has made it clear that as Google did not have prior written permission to index their content (Google had simply assumed the rules outlined in robots.txt were sufficient, as it is with every other website).

They would state that Google would be happy to reindex Facebook, but it would take several weeks for their lawyers to meet with Facebook's counsel, draft documents, reindex privately, audit their cache to ensure compliance, etc.

We could then sit back and watch how fast Facebook backpedals.


I had a strangely similar problem to this in college with the network admins and filesystem permissions.

This was back in the day when everybody was using the unix network (pine for email, etc). Many people had webpages on their university accounts and would also store homework online. A very popular thing to do was to go into someone else's home directory and copy their .fvwmrc or .bashrc file (yes this was a few years ago).

The IT people came up with a new policy that said you were not allowed to look into other people's directories without explicit (possibly written) permission from the owner. They said that copying files could be considered "cheating" and possibly copyright infringement and you'd be referred to the university disciplinary court. I assume this all stemmed from fear of cheating.

In any event, I got into a long debate with the IT people about the whole point of unix file permissions. Basically: if you don't want people to look at your stuff, don't let them. As you might guess, when dealing with university IT, I was on the losing end of the argument.


It would seem fair to argue that "rwxrwxrwx" is explicit permission to read your files!


robots.txt is good for telling robots they have no business reading a section, or will be severely disappointed if they do. There is no way one can encode the TOS of a site into the syntax of robots.txt, so there is no reason to believe it embodies the TOS.

I've never used facebook, but Section 9.2.6 of their terms says You will delete all data you received from Facebook if we disable your application or ask you to do so.

I don't see how that could be encoded in robots.txt.


User-agent: *

Disallow: /directory/people/*

Disallow: /directory/pages/*

Done. Google crawls Facebook. Every social media monitoring company crawls Facebook (and every other social network). Tons of other companies do the same every day. I know this because many of these companies are our customers.

What really irks me is that Pete collected valuable data and whoever he shared it with was probably able to derive added value from that data - in ways that Facebook is not doing. Facebook is prohibiting value creation.


Google can do it because Facebook doesn't really want to go up against money in making this a test case, and Google has baskets of cash.

Facebook doesn't give a tinker's damn about prohibiting your value creation.


Or because Google would simply comply with their demands and remove all mention of facebook from their indexes... That would be pretty interesting.


Ha! Yeah, I hadn't thought of that aspect. Half of their userbase wouldn't even be able to sign on any more!


rather more than half I think


I guess when they have direct gain, ie- people accessing facebook profiles through google searches, they are more than happy for this to occur.


Surely ToS is only applicable if you use facebook - eg create an account.

Either facebook wants to allow people to crawl it, or they don't. robots.txt should be binary - yes or no.


That's not exactly true. The TOS usually applies to people crawling the severs and mining data. Still, there is no clear way to know how a court would rule on something like this; each case is different.

See http://en.wikipedia.org/wiki/Browse_wrap


You don't waive your copyright by having a robots.txt, and while I believe most people think Google style indexing and searching is fair use - that doesn't mean anything you do with the data is fair use.


User content on Facebook may not be copyrightable. If I make a list of my personal interests, I haven't necessarily produced a creative work by the standards of US law.

Check out this site (just found it via search): http://www.canyoucopyrightatweet.com/


Correct- lists of facts without styling aren't something you can copyright. The specific form that they are printed in are copyrightable, but there is no IP created by a list of facts. Phone numbers, game scores, colors of rocks, etc... not copyrightable.


The courts have generally only required a minimum of creativity to make something copyrightable. Also sentences are surprisingly easy to mke unique- see http://go-to-hellman.blogspot.com/2009/11/uniqueness-of-sent...


I don't see any support for statement #1.

Statement #2 is a false dichotomy.

The Terms for facebook is a terrible document. It is written to be intelligible to humans but is full of ambiguity and undefined terms. Any lawsuit about the theory "robots.txt did not forbid my actions; therefore they are legal" would probably disintegrate into how the Terms language is interpreted.

Their lawyers probably leapt from high windows when it was released.


Surely if true, that means any web crawler must first locate and understand every single websites terms of service before it can be sure what it is allowed to do.

For a start though, you could easily state that facebook freely allow access to data, without requiring you to read terms&conditions.

You could argue that by freely allowing access to all of their data, without requiring you to read and agree to the terms of usage, then their claims have no basis.

If they did want to restrict access, or make sure every crawler had first agreed to T&C, it wouldn't take long for them to add that.


I think this is missing the point. Facebook's attorneys don't care about the technical details. They already know that whatever is technically possible is entirely subordinate to how they can threaten the owners of the technology.

It hasn't been tested in court, and there's a truly excellent chance Facebook would lose - either in terms of the court or in terms of what's left of their privacy reputation - but that doesn't matter one little bit. They have attorneys and money, and that's all that matters in this instance.

Really. The facile assumption that "it's possible to aggregate thus it's OK to aggregate" is exactly the way normal people think (by which I mean, tongue-in-cheek, us), but corporate attorneys see all this in terms of power relationships and contracts. As far as they're concerned, the poster took pictures through their front windows, and they're damn well going to threaten him with kneecapping until he gives them his negatives.


As a long-time (but sadly former) EFF donor, I hope they pick it up too. This case is tailor made for them.


There's nothing ridiculous about that. There's a clearly visible link on the bottom of each Facebook page that links to the Terms & Conditions of accessing Facebook pages. Some relevant parts:

"If you collect information from users, you will: obtain their consent, make it clear you (and not Facebook) are the one collecting their information, and post a privacy policy explaining what information you collect and how you will use it."

"You will only use the data you receive for your application, and will only use it in connection with Facebook."

"By "application" we mean any application or website (including Connect sites) that uses or accesses Platform, as well as anything else that receives data." - note that by this definition a Facebook crawler is an applicatiom

"You will only use the data you receive for your application, and will only use it in connection with Facebook."

"You will have a privacy policy or otherwise make it clear to users what user data you are going to use and how you will use, display, or share that data."

"You will not transfer the data you receive from us (or enable that data to be transferred) without our prior consent."

"You will make it easy for users to remove or disconnect from your application."

etc.

These things are easy to look up, it took me about five minutes to find them. In fact, they have the right to limit the way even publicly faced content on their site is used. Things like copyright and data protection laws come in mind. You might not agree with that but it is absolutely in their right to do so.

The FB Terms & Conditions don't mention robots.txt at all.


Doesn't apply. "If you collect information from users" is not equivalent to "if you collect information about users". Pete collected information about users that Facebook freely gave him without requiring any binding agreement from him at all.

It's probable they would lose. But it's far, far more probable that Pete would be bankrupt well in advance of that point in time, and that's how the court system works.


You might not agree with that but it is absolutely in their right to do so.

Well, they claim it is their right. The courts will decide if it actually is.

Plenty of places make claims to rights and make you sign waivers (dojos, for example), but fail in court. This could be interesting.


As far as I can see, they weren't creating a facebook application, so I'm not sure why the Terms & conditions are relevant.

If you run a public website, you have to accept that people may crawl your website. If you want to prevent it, don't let them crawl it. That's what robots.txt is for.


As far as I understand, the problem was not crawling the website, it was the way he used to data he gained by crawling.


I think it's quite ridiculous.

If I buy a map, can I use it to be a guide, giving people directions and make a living?

If I buy a story book, can I use it to read to children and collect money from their parents?

Or, if I spend some time to index all the books in the library, can I sell the index and make money? Or did I violate the rights of either the library or the books' copyright owners?

I think crawling and making use of the crawled data is not offending the copyright law because it's not actually a copy of the data. Rather, it's a service to transform the data and help other people get to the original data in an easier way.

Technically, they can crawl the data on the spot when receiving a request from their clients. In order to speed up the process, they somehow cache the data crawled from earlier occasions. But that's technical details, which is encapsulated. Otherwise, they could have stored URLs and offsets in the pages, rather than the data itself. I don't think that breaks the law, or there will be no way to avoid breaking the law to refer to anything.


A couple of examples: if you rent a DVD from one of those DVD rental places, normally you can't play it in public places like bars and places like that. If you buy a CD you can't broadcast it on your radio station, you need a special licence for that.

Another possible related issue is the question of derivative works (e.g. the index of a library). There's a Wikipedia article about derivative works, it's a bit too complex to summarize here: http://en.wikipedia.org/wiki/Derivative_work


Two extraordinarily poor examples!

Both films and music (plus songwriting too) have been explicitly enshrined by US copyright law as having separate rights for home use and public performance. Each has multiple spheres of statutory licensing organizations with special exemptions from anti-trust law.


I was referring to the copyright laws of the United Kingdom which is the country where I live (Copyright, Designs and Patents Act 1988). Here, public performance infringement is not limited to solely films and music (Section 16.1c).

I'm not very familiar with US copyright legislation but I assumed that the copyright laws there are similar without checking any sources - which was a mistake apparently.


Indeed. Knowledge about social graphs are probably the core business of Facebook. No wonder they want monopoly over the knowledge about the social graphs they host.

But, really, they can't stop it. An when (not if) nominative aggregated data is out, maybe the general population will begin to actually understand how expensive Facebook actually is.


The information is copyrighted. Redistribution without prior written permission from the copyright owner is illegal.


Can you copyright someone elses name and the names of their friends?


You can't copyright somebody's name and address, but you can copyright a map showing where everybody lives. That actually sounds like what Pete was doing, but Facebook could make similar claims about some of the information they provide.

Everything from stock prices to the temperature is in theory uncopyrightable, but you can still run into trouble if you just scrape some site and republish it.


You can't copyright facts, but you can copyright collections of facts.

This could have been an interesting case.

And regarding robots.txt - it's by no means anything more than a suggestion and common courtesy.. it's a way to suggest to crawlers what to ignore and how to behave with your site, to their benefit and yours - it's not by itself a legally binding, well, anything..... it was just a convention for all parties to unofficially cooperate.


You can't copyright facts, but you can copyright collections of facts.

It depends, re: the oft cited example of phone books being uncopyrightable,


There has to be some kind of creative component to the compilation. And arguably Facebook didn't even compile the data in the first place -- it'd be like a college claiming a compilation copyright on the band fliers, roommate requests, and lost-pet ads pinned to a bulletin board in the student union building.


No


Why don't I just sue Google for crawling my website and get this court case done with. Once precedence is set it would be much harder for FaceBook to make these types of claims. "Your honor in the case of Google Canada vs Zach Aysan the courts clearly decided that..." DONE. Actually, I'm surprised this hasn't happened already.


I'd guess you'd need to burn $100k before you even saw a judge. Are you up for that?


Could somewhat explain why is it so ridiculously expensive? At least in US?


(Not a lawyer)

From my point of view, there has been lots of individually useful rules that have been added to the court system, that when taken in aggregate have negative effects. There are lots of pre-trial motions that have specific uses, designed to stop a problem, or plug a loophole in the law. The problem comes when you have multiple dozens of them. Or even just a few big ones.

For instance, discovery. The idea is that fairness is not helped by surprise evidence. So, the solution is that both sides get access to the other side's evidence. But how do you do that? You have lawyer time, calendar time, etc to collect and go through it all.

That's just one (although one of the larger ones). Add in requests for everything else, and you can see how much lawyer time is "wasted".


A lawyer's initial purpose is almost always intimidation.


Time. You will need a lawyer working for you nearly fulltime for months. I imagine that when you work for someone fulltime for months, you expect to get paid well enough that you can eat, pay rent, bills and maybe have some disposable income. Even the cheapest lawyer will have the same expectations.


You'd have to show damages. Otherwise (and correct me if I'm wrong on this point) you have no standing to file suit.


But wouldn't Facebook have to show damages here as well?


If someone could show that they could extract personally identifiable information from the aggregated database that isn't immediately obvious from browsing the site, I'd expect Facebook could make the case that distributing the database could cause serious damages, much like it did for Netflix.


How is it ridiculous? The more ridiculous thing is to expect robots.txt to actually have some sort of legal force.


Implied agreement.

The law acknowledges standard practice and expectations. That's what it is built on.

If you put something up on the web, then although you have copyright, you're also implying permission for people to be able to do what they normally do with it (at a minimum, view it in a web browser).

Similarly, if you've put up a robots.txt, then you're also quite clearly giving permission to crawlers (within certain bounds). There's also reasonableness and taking into account what robots.txt is capable of, of course. You clearly aren't implying permission for me to DoS you.

Explicitly agreed terms trumps this. If it could be shown that the person operating the crawler had agreed to an AUP, then that would change things. I imagine that this would make this case quite different from Google crawling your site.


Army of Facebook Lawyers:

"Their contention was robots.txt had no legal force and they could sue anyone for accessing their site ..."

Lone data collector dude:

<blink> <blink>

It turns out the lawyers are right. Huh.


Oh. I actually wasn't saying that the guy "blinked." I was trying to evoke the image of a relatively powerless person staring at an army who had just given an order; the guy would of course have no response available except OK. Wasn't deriding the guy at all.


Facebook was founded by an outlaw scraper. Game recognize game, they're just neutralizing a threat, don't worry about the legal window dressing.

Read about Zuck's wget magic here, and how he illegally scraped (guarantee it was a TOS violation) all the Harvard online facebooks in 2003: http://www.scribd.com/mobile/documents/538697

As Balzac said, behind every great fortune lies a great crime :)


Where did that diary entry come from? Was that entered into public record as a result of a lawsuit against Facebook or something?

If it's real (it's a pretty apalling read into the guy's mind), what a great example of why you shouldn't keep all your private thoughts online...


He also wrote software that scraped and archived AIM status messages. Apparently he has a long history of scraping content in various forms.


Not to mention the fact that he 'scraped' the idea of the concept of facebook in the first place.


Legal precedent: "Copiepresse (Belgian Newspaper Conglomerate) v. Google (Read more.. ). Filed in August, 2006.

Claims: to remove all the content indexed by Google's crawlers on the newspaper's websites."

http://www.infoniac.com/offbeat-news/google-list-of-class-ac...

Google's response: "Of course, if publishers don’t want their websites to appear in search results (most do) the robots.txt standard (something that webmasters understand) enables them to prevent automatically the indexing of their content. It's nearly universally accepted and honoured by all reputable search engines."

http://googleblog.blogspot.com/2006/09/about-google-news-cas...

May be you can get Google to file an amicus brief...

"Outcome: Google had to remove the plaintiff's newspaper content from its database within 10 days or face fines of 1,000,000 Euro per day. Google had to publish "in a visible and clear manner and without any commentary from her part the entire intervening judgment on the home pages of google.be and of news.google.be for a continuous period of 5 days within 10 days... under penalty of a daily fine of 500,000 Euro per day of delay". Google had was awarded the costs of the expenses of 941.63 Euro (summons) and 121.47 Euro (costs of thy proceedings)."


I've been beaten into oblivion before for speaking out about the way I see the future shaping up re Big Corporations owning our information but I'll take my chances and use this opportunity to speak out against Facebook again. In truth, I did invoke Orwell...

I'm no Facebook hater, I actually use Facebook often but when they pull these kind of stunts it really upsets me. I would imagine it would upset most freedom lovers on this site as well. When Facebook decides to take a smattering of our data and make it "Public" how then do they decide to control that data after the fact? Since when did Facebook re-define the word "Public"? It's like your cell phone provider and ISP redefining "unlimited". This stuff has to stop. Getting back to Facebook, since when does merely browsing to a URL(I) enter you into some sort of binding contract with the publisher?


Whilst i agree with you, the devil's advocacy in me says,

"what happens when Facebook gets sued for this data exposure?"

And there are two avenues that might happen - investors suing as to loss of potential revenue, and user(s) suing over privacy violations.

Bear in mind, FB have already had class actions against them.. i don't think setting aside another 10m+ (conservative) just to defend yet another nuisance lawsuit is something they're going to be up for.

So, as Pete Warden described, they bullied him into shutting it all down with threats that they could just as easily lose in court - based on the principle that it's cheaper for them to sue Pete and lose cred amongst us than it would be for them to face a class action from users who are led to believe that their privacy has been violated by the data set via an ambulance-chasing lawyer.

Facebook should have instead invited Pete in for a chat and a job, but instead they took the full frontal lawyer bulldog approach. Sometimes that happens. Hopefully next time it wont- and i've already poked my friends at FB to raise the issue internally if possible -- and so should you.


Wait, from reading the article it appears he didn't get sued by Facebook (they just threatened to sue).

There's a difference.


Yes... a difference between a lawsuit and a "legal extortion".


Barratry.


It's not extortion, they didn't settle for anything other than him deleting all copies of the data that he had access to. Hyperbolic statements are unnecessary.


They extorted the "right" to aggregate data from their site. Also note that such right could imply monetary benefits, so this is indeed extortion.


You are right, there is a huge difference. Essentially he's whining their legal terms do not serve his purpose. And they have better paid lawyers to prove it. And then he gets all upset about robots.txt which is actually not a contract of any sort. Let's all move along.



How about crawling search engines? For example, instead of crawling directly on facebook, crawl on a search engine with "site:facebook.com".

In that case, your data is gathered from the search engine and has nothing to do with facebook any more. And I doubt the search engine will sue you for using their service.


It's still Facebook's data (by which I mean "data that originated at Facebook"), and I guarantee you that their lawyers will still threaten a lawsuit. What do they have to lose?


You could then claim that the lawsuit is against the wrong person or would they argue that the original scraper was in the right and the secondary person was in the wrong?


Well, it's your use of the data, not so much the scraping itself. But way more importantly, the search engine has money and you don't.


I'm surprised that shutting up about this wasn't part of the agreement, either that or he's violating it. I've never heard of a settlement that didn't involve keeping your mouth shut.

I'm glad he posted about it though, that legal issue regarding robots.txt is good to know.


They never asked me to keep quiet about it, and they've been willing to comment about it too, eg

http://www.newscientist.com/article/dn18721-data-sifted-from...

I'd imagine they want to deter people from following in my footsteps, so publicizing their actions serves as a warning.


I'd imagine they want to deter people from following in my footsteps, so publicizing their actions serves as a warning.

If anything, I'm encouraged to start a publicly-available database based on crawling Facebook. Don't associate your name with it, crawl from your Ipredator account, and make sure your data is spread far and wide. Then they can sue the Entire Internet like the music companies are having so much success with.

Information wants to be free.

(Also, what's the worst that could happen? You get sued, and Facebook gets a few thousand bucks from your savings account and your used mattress? OH NOES.)


> Information wants to be free.

I think the full quote is something like "Information wants to be free. Information also wants to be expensive". As I recall, it was about the dichotomy between the fact that you could copy bits for free, and that information was very valuable.


I thought it meant that Mr. Information was locked in a dungeon, yelling "Hellooooo? Can anyone hear me!???" to any passerby inside the walled garden. He wants to be free. He wants you to free him!


Even if you get sued, how would they know it's you? ie. how would they know who to serve a lawsuit to?

By the time they get a court order to take the data down, if they even bother, people will have already distributed it, and it'll be another 09:f9:11:02-type fiasco.

I hope he releases the code he used, although I wouldn't really blame him if he doesn't.


The open source php for the google profiler is in a link on the article, very easy to modify if you wanted to do the same thing.


He is not publishing the data as is. he is publishing his work done over the data.

or am i wrong?

Now can't i publish my thesis because it contains information i got from copyrighted books?


He offered to distribute the data as-is to whoever was interested.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: