As much as unethical as using adblock or disabling javascript. If you don't want...

bduerst · on Oct 17, 2015

You still have copyright to your online content and can dictate how it can be used. Websites can also have TOS for their content. If you're arguing ethics then it could be unethical to use content in a way that the copyright owner doesn't want it to be used.

That doesn't mean that it doesn't happen.

jumperjake · on Oct 17, 2015

IANAL, but copyright governs redistribution of content not consumption (That's what pirates get busted for). I aslo recall that there was a ruling that footer TOSs aren't enforceable unless the user actively and explicitly agrees to them.

I agree with the GP in that public content is fair game. How do you thing google works?

bduerst · on Oct 17, 2015

Google technically respects robots txt and noindex metatags. OP is arguing the ethics of scraping, not if people are ignoring bot meta tags.

Copyright governs how the content is used, including distribution. The reason people who download videos are not liable is because you have to download the complete content to see the copyright. File sharers have already downloaded the content and are subject to copyright. Bots that scrape can interpret meta tags in the header of the dom, which is why scraping and violating copyright is unethical.

manigandham · on Oct 17, 2015

That's not a good argument. Do you ever leave your stuff lying around? I guess we can just take it then right?

Just because you have access to something doesn't give you permission to access or access it in any manner possible.

angry-hacker · on Oct 17, 2015

You have the right to disagree, but that's the way World Wide Web was built. Feel free to use alternative service(s) or stop publishing your stuff. Put it behind password or don't answer to my scrapers or browsers requests. Fair and simple.

Web is for people from people, not solely for company(s) financial interests.

manigandham · on Oct 17, 2015

What's the WWW or the way it's built have to do with it? And the web is just technology, it's not "for" anyone or anything in particular.

Intentions matter - on both sides. This is what most of the legal framework of the entire world is based on. You can disagree with that but again the ability to do something doesn't grant permission to do it. You're saying the solution to that is to remove the ability, but I don't see how that's realistic.

fauigerzigerk · on Oct 17, 2015

>What's the WWW or the way it's built have to do with it?

What it has to do with it is that putting an HTTP server on the public Web signals the intention to serve up resources to anyone who sends an HTTP request. Any restrictions to this default must be implemented explicitly on top of the default.

Leaving my stuff lying around does not signal my intention for anyone to take it, unless I let it lying around next to the bins.

So yes, intentions matter. The question is how we learn about them. Sometimes the choice of technology implies particular intentions by default.

manigandham · on Oct 18, 2015

Putting a up a webserver that can be publically reached is not authorization to access it. I really can't say this in any other way - just because you can do something doesn't mean you are allowed to, whether it's online or offline.

We already have an explicit signal called robots.txt which major search engines use. The problem is that there's no way to enforce this and there's just very little enforcement against actions on the web in general which is why people can get away with scraping but please don't mistake it for somehow being OK or allowed by the owner of that content. It's just not that simple.

fauigerzigerk · on Oct 18, 2015

">Putting a up a webserver that can be publically reached is not authorization to access it."

This is legally incorrect. Without any further information or protection measures by the publisher it is legal to access content on a public web server.

">I really can't say this in any other way - just because you can do something doesn't mean you are allowed to"

You are allowed to do everything that is not expressly forbidden by law. Accessing a public web server is not forbidden by law unless the owner takes steps to prevent you from accessing it or at least clearly signals that intention. Terms of service do not constitute an implied contract, so you are not required to read the TOS before accessing a public page.

">We already have an explicit signal called robots.txt"

Exactly. That is part of what I meant when I said that any restrictions had to be implemented on top of the default, which is that everyone who can is allowed to access anything on a public webserver.

[Edit] But initially, this wasn't a thread about legality but about ethics. I think there are unethical reasons to scrape and there are unethical reasons to block scrapers. We simply need to know more about the purpose of any scraping before making a judgement.

manigandham · on Oct 18, 2015

1) You're not a lawyer. https://news.ycombinator.com/item?id=10339328

2) This directly contradicts your previous comment: "Leaving my stuff lying around does not signal my intention for anyone to take it, unless I let it lying around next to the bins."

3) There is already case law precedent regarding this exact type of publically accessible information not being authorized: http://www.net-security.org/secworld.php?id=14614

I agree that this is about ethics and any scraper that doesnt honor robots.txt and explicitly uses different IPs, user-agents, and other methods to mainly disguise itself as a machine service is unethical in this context.

fauigerzigerk · on Oct 19, 2015

>You're not a lawyer

And you are a lawyer?

>This directly contradicts your previous comment

Absolutely not. The default intention of putting an HTTP server online is not "letting stuff lying around", it is publishing stuff. And yes, the default can be overruled in various ways.

>There is already case law precedent regarding this exact type of publically accessible information not being authorized

You're grasping for straws here. In this specific case, it was completely obvious that this information was not supposed to be public. It was an embarrassing security failure that the defendant wanted to expose.

I think we agree on a lot. robots.txt should be honored and scraping in way or for a purpose that negatively impacts the website's viability or business model is unethical. But usually, such purposes are covered by copyright law anyway.

yxdfasdjkljasdf · on Oct 17, 2015

That is not how HTTP works; your analogy is not correct.

Nobody is taking anything. If you don't want someone to access your page, then don't respond to their request.

manigandham · on Oct 17, 2015

Since there's no easy way to always reliably identify the requester, this gets complicated.

Most scrapers - including this one - advertise how they use multiple servers/locations/ips/etc to get around this.

yxdfasdjkljasdf · on Oct 17, 2015

I fail to see a problem you are trying to present.

Even if identification was hard, which is not true because of how HTTP works, it is irrelevant because HTTP doesn't discriminate. If someone does, that is their problem, and should be solved by them, and not a committee or law.

manigandham · on Oct 18, 2015

> If you don't want someone to access your page, then don't respond to their request

> there's no easy way to always reliably identify the requester

That's the problem: you can't identify the person to block them in the first place.

Robots.txt is actually an explicit signal of intention for reputable search engines but that's all we have today and is easily ignored and does not work with these scrapers or anyone else.

Not sure what your last sentence means.

dsjoerg · on Oct 17, 2015

At a high enough frequency, scraping is indistinguishable from a DDoS attack. Do you believe DDoS attacks are OK? How do you draw the line?

cookiecaper · on Oct 18, 2015

DDoS attacks are malicious events that disrupt service. In almost 100% of cases, scrapers don't want to disrupt service, because they need the data they're scraping. They want to be able to continue to get it, so they won't do things that may harm their ability to do that (including presenting honest IPs and user agents).

Services like this one actually make scraper-related unavailability, which IMO is already greatly exaggerated, less likely, since there will be fewer amateurs trying to write their own bots and accidentally breaking things.

To the extent that a scraper harms the other business, the scraping company can be held civilly liable on several accounts without specifically bringing scraping as a practice into the picture. All that matters is that they damaged the target site's ability to operate, not that they were saving [portions of] the pages (that'd be a separate copyright claim, unrelated to the disruption of service).

yxdfasdjkljasdf · on Oct 17, 2015

There is a clear distinction in the two. You are presenting a straw-man argument.

dsjoerg · on Oct 18, 2015

You haven't quite laid out your argument so I have to guess what it is.

When you say "That is not how HTTP works" it suggests that your claim is that anything that HTTP allows is ethically OK to do. However that is clearly a ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and that's clearly not OK.

So I'm left wondering what your argument actually is for why unwelcome scraping is ethically OK.

I find this an interesting question, because while I would love for protcols to also define ethics, I feel that would be scope creep for the poor protocol designers. There's a wide variety of conduct and ethics questions that a protocol cannot address.

Where I myself draw the line is at protocol behavior intentionally designed to obscure my intentions. For example, sending my requests from a wide variety of IP addresses is behavior that is specifically designed to obscure where I'm coming from; my only intent in doing so would be to circumvent the intent of the serving machine from providing lots of content to a single requestor. At that point I'm engaging in deceptive behavior; I've crossed an ethical line.

yxdfasdjkljasdf · on Oct 18, 2015

When you say "That is not how HTTP works" it suggests that your claim is that anything that HTTP allows is ethically OK to do. However that is clearly a ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and that's clearly not OK.

That wasn't a response made to your comment, and you are mixing two different arguments there. You guess in not correct.

So I'm left wondering what your argument actually is for why unwelcome scraping is ethically OK.

I never even suggested such an argument.

The behavior you described in the last paragraph is only deceptive from the eyes of an information and privacy surveillant state actor. Anonymity is not unethical, it is a human right.