More

danielmiessler · on Oct 17, 2015

Probably a super dumb question, but isn't this fairly unethical? The "automatic IP rotation" feature isn't there for no reason.

angry-hacker · on Oct 17, 2015

As much as unethical as using adblock or disabling javascript. If you don't want your content to be scraped, don't put it online!

bduerst · on Oct 17, 2015

You still have copyright to your online content and can dictate how it can be used. Websites can also have TOS for their content. If you're arguing ethics then it could be unethical to use content in a way that the copyright owner doesn't want it to be used.

That doesn't mean that it doesn't happen.

jumperjake · on Oct 17, 2015

IANAL, but copyright governs redistribution of content not consumption (That's what pirates get busted for). I aslo recall that there was a ruling that footer TOSs aren't enforceable unless the user actively and explicitly agrees to them.

I agree with the GP in that public content is fair game. How do you thing google works?

bduerst · on Oct 17, 2015

Google technically respects robots txt and noindex metatags. OP is arguing the ethics of scraping, not if people are ignoring bot meta tags.

Copyright governs how the content is used, including distribution. The reason people who download videos are not liable is because you have to download the complete content to see the copyright. File sharers have already downloaded the content and are subject to copyright. Bots that scrape can interpret meta tags in the header of the dom, which is why scraping and violating copyright is unethical.

manigandham · on Oct 17, 2015

That's not a good argument. Do you ever leave your stuff lying around? I guess we can just take it then right?

Just because you have access to something doesn't give you permission to access or access it in any manner possible.

angry-hacker · on Oct 17, 2015

You have the right to disagree, but that's the way World Wide Web was built. Feel free to use alternative service(s) or stop publishing your stuff. Put it behind password or don't answer to my scrapers or browsers requests. Fair and simple.

Web is for people from people, not solely for company(s) financial interests.

manigandham · on Oct 17, 2015

What's the WWW or the way it's built have to do with it? And the web is just technology, it's not "for" anyone or anything in particular.

Intentions matter - on both sides. This is what most of the legal framework of the entire world is based on. You can disagree with that but again the ability to do something doesn't grant permission to do it. You're saying the solution to that is to remove the ability, but I don't see how that's realistic.

fauigerzigerk · on Oct 17, 2015

>What's the WWW or the way it's built have to do with it?

What it has to do with it is that putting an HTTP server on the public Web signals the intention to serve up resources to anyone who sends an HTTP request. Any restrictions to this default must be implemented explicitly on top of the default.

Leaving my stuff lying around does not signal my intention for anyone to take it, unless I let it lying around next to the bins.

So yes, intentions matter. The question is how we learn about them. Sometimes the choice of technology implies particular intentions by default.

manigandham · on Oct 18, 2015

Putting a up a webserver that can be publically reached is not authorization to access it. I really can't say this in any other way - just because you can do something doesn't mean you are allowed to, whether it's online or offline.

We already have an explicit signal called robots.txt which major search engines use. The problem is that there's no way to enforce this and there's just very little enforcement against actions on the web in general which is why people can get away with scraping but please don't mistake it for somehow being OK or allowed by the owner of that content. It's just not that simple.

fauigerzigerk · on Oct 18, 2015

">Putting a up a webserver that can be publically reached is not authorization to access it."

This is legally incorrect. Without any further information or protection measures by the publisher it is legal to access content on a public web server.

">I really can't say this in any other way - just because you can do something doesn't mean you are allowed to"

You are allowed to do everything that is not expressly forbidden by law. Accessing a public web server is not forbidden by law unless the owner takes steps to prevent you from accessing it or at least clearly signals that intention. Terms of service do not constitute an implied contract, so you are not required to read the TOS before accessing a public page.

">We already have an explicit signal called robots.txt"

Exactly. That is part of what I meant when I said that any restrictions had to be implemented on top of the default, which is that everyone who can is allowed to access anything on a public webserver.

[Edit] But initially, this wasn't a thread about legality but about ethics. I think there are unethical reasons to scrape and there are unethical reasons to block scrapers. We simply need to know more about the purpose of any scraping before making a judgement.

manigandham · on Oct 18, 2015

1) You're not a lawyer. https://news.ycombinator.com/item?id=10339328

2) This directly contradicts your previous comment: "Leaving my stuff lying around does not signal my intention for anyone to take it, unless I let it lying around next to the bins."

3) There is already case law precedent regarding this exact type of publically accessible information not being authorized: http://www.net-security.org/secworld.php?id=14614

I agree that this is about ethics and any scraper that doesnt honor robots.txt and explicitly uses different IPs, user-agents, and other methods to mainly disguise itself as a machine service is unethical in this context.

fauigerzigerk · on Oct 19, 2015

>You're not a lawyer

And you are a lawyer?

>This directly contradicts your previous comment

Absolutely not. The default intention of putting an HTTP server online is not "letting stuff lying around", it is publishing stuff. And yes, the default can be overruled in various ways.

>There is already case law precedent regarding this exact type of publically accessible information not being authorized

You're grasping for straws here. In this specific case, it was completely obvious that this information was not supposed to be public. It was an embarrassing security failure that the defendant wanted to expose.

I think we agree on a lot. robots.txt should be honored and scraping in way or for a purpose that negatively impacts the website's viability or business model is unethical. But usually, such purposes are covered by copyright law anyway.

yxdfasdjkljasdf · on Oct 17, 2015

That is not how HTTP works; your analogy is not correct.

Nobody is taking anything. If you don't want someone to access your page, then don't respond to their request.

manigandham · on Oct 17, 2015

Since there's no easy way to always reliably identify the requester, this gets complicated.

Most scrapers - including this one - advertise how they use multiple servers/locations/ips/etc to get around this.

yxdfasdjkljasdf · on Oct 17, 2015

I fail to see a problem you are trying to present.

Even if identification was hard, which is not true because of how HTTP works, it is irrelevant because HTTP doesn't discriminate. If someone does, that is their problem, and should be solved by them, and not a committee or law.

manigandham · on Oct 18, 2015

> If you don't want someone to access your page, then don't respond to their request

> there's no easy way to always reliably identify the requester

That's the problem: you can't identify the person to block them in the first place.

Robots.txt is actually an explicit signal of intention for reputable search engines but that's all we have today and is easily ignored and does not work with these scrapers or anyone else.

Not sure what your last sentence means.

dsjoerg · on Oct 17, 2015

At a high enough frequency, scraping is indistinguishable from a DDoS attack. Do you believe DDoS attacks are OK? How do you draw the line?

cookiecaper · on Oct 18, 2015

DDoS attacks are malicious events that disrupt service. In almost 100% of cases, scrapers don't want to disrupt service, because they need the data they're scraping. They want to be able to continue to get it, so they won't do things that may harm their ability to do that (including presenting honest IPs and user agents).

Services like this one actually make scraper-related unavailability, which IMO is already greatly exaggerated, less likely, since there will be fewer amateurs trying to write their own bots and accidentally breaking things.

To the extent that a scraper harms the other business, the scraping company can be held civilly liable on several accounts without specifically bringing scraping as a practice into the picture. All that matters is that they damaged the target site's ability to operate, not that they were saving [portions of] the pages (that'd be a separate copyright claim, unrelated to the disruption of service).

yxdfasdjkljasdf · on Oct 17, 2015

There is a clear distinction in the two. You are presenting a straw-man argument.

dsjoerg · on Oct 18, 2015

You haven't quite laid out your argument so I have to guess what it is.

When you say "That is not how HTTP works" it suggests that your claim is that anything that HTTP allows is ethically OK to do. However that is clearly a ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and that's clearly not OK.

So I'm left wondering what your argument actually is for why unwelcome scraping is ethically OK.

I find this an interesting question, because while I would love for protcols to also define ethics, I feel that would be scope creep for the poor protocol designers. There's a wide variety of conduct and ethics questions that a protocol cannot address.

Where I myself draw the line is at protocol behavior intentionally designed to obscure my intentions. For example, sending my requests from a wide variety of IP addresses is behavior that is specifically designed to obscure where I'm coming from; my only intent in doing so would be to circumvent the intent of the serving machine from providing lots of content to a single requestor. At that point I'm engaging in deceptive behavior; I've crossed an ethical line.

yxdfasdjkljasdf · on Oct 18, 2015

When you say "That is not how HTTP works" it suggests that your claim is that anything that HTTP allows is ethically OK to do. However that is clearly a ridiculous stance, since a DDoS attack is a stream of valid HTTP requests and that's clearly not OK.

That wasn't a response made to your comment, and you are mixing two different arguments there. You guess in not correct.

So I'm left wondering what your argument actually is for why unwelcome scraping is ethically OK.

I never even suggested such an argument.

The behavior you described in the last paragraph is only deceptive from the eyes of an information and privacy surveillant state actor. Anonymity is not unethical, it is a human right.

hvs · on Oct 17, 2015

It could certainly be used for unethical purposes, but it isn't de facto unethical.

snowwrestler · on Oct 17, 2015

I don't think it's unethical to pull down a copy of public information. If you pull too fast it might be considered rude (heavy load on the server). That's why some sites reflexively block all scrapers, hence the rotating IP feature. Hopefully this tool is rate-limited so it's not rude.

In terms of copyright, what matters is what you do with the scrape. If you scrape a public website for personal use, it's no different from just browsing it for personal use. If you try to republish the content for your own benefit, you'll run afoul of copyright law.

chinathrow · on Oct 18, 2015

If you disobey robots.txt it's unethical and rude.

snowwrestler · on Oct 19, 2015

I agree, but running a scraper does not necessarily mean disobeying robots.txt.

blairanderson · on Oct 17, 2015

mcs_ · on Oct 17, 2015

how can be the research into internet unethical? or the automation of that unethical?

danielmiessler · on Oct 2, 2015

Article author here.

The irony in your comment is that I recommend Drew's book in the primer, and Drew has recommended this primer as well.

Maybe they're both good ways to learn Vim.

stevebmark · on Oct 2, 2015

Then title it "Vim Primer," which is accurate.

coryfklein · on Oct 2, 2015

This is hardly a primer. If somebody is using the motions and actions described in the article they may not be a complete wizard in obscure vim features, but they are certainly far beyond being merely "primed" to use vim. I've been an avid vim proponent for years and I still picked up new things from this article.

danielmiessler · on July 13, 2015

I think you make good points here, but I don't think it'll stop this from happening. There are ways to try to capture the intangibles, and the functionality is just too rich for us to avoid.

robbrit · on July 14, 2015

It's not the IoT that I'm arguing against, it's the reputation system. I think that it's fairly dystopian and unfair, and even if it comes into play as you've predicted eventually a significant percentage of people will just abandon or ignore it just like they do with many other reputation systems.

danielmiessler · on July 13, 2015

I count one (1).

danielmiessler · on July 12, 2015

I think this might be overcomplicating things. Why not just a secret to create the valid sessionID in the first place? Why have it as a separate process if you already have the ability to mix in a secret?

danielmiessler · on May 19, 2015

Maybe this is just like the debate around driverless cars. "They're not perfect!", they say. "They're flawed just like anything else.", they say.

Well, they don't have to be perfect; they just have to be better than humans. And it turns out that's pretty easy.

So maybe rather than beat up on internet.org because it's not free as in perfection, maybe we should be happy that a billion dollar corporation is trying to do SOMETHING to help 4 billion people who can't afford the current option.

It doesn't have to be perfect. It just has to be better, for those billions of people with no access to the internet, than having nothing at all. And I think they are meeting that and far above it.

amazon_not · on May 19, 2015

The correct solution is to make real, uncensored internet access available and affordable to the masses, not to provide a better version of the North Korean intranet for free.

evincarofautumn · on May 19, 2015

There’s an important difference between “right” and “good”. Providing free basic internet access is clearly good, and internet.org is doing that. Many people argue, validly, that the manner in which they’re doing so is not right.

But that’s as if I were to tell you that giving anti-malarial drugs to people is wrong, because the right thing to do is to eradicate malaria itself. That’s probably true, but the practical thing to do right now is what’s good, not necessarily what’s right. And you can work on both fronts at once: they’re orthogonal.

I don’t pretend to understand the massive logistical challenges involved in implementing this, and it makes me sigh when others no more knowledgeable than myself make armchair proclamations about what should or should not be done.

amazon_not · on May 19, 2015

> Providing free basic internet access is clearly good, and internet.org is doing that.

If it were only so. I would have no issue with internet.org if they provided free basic internet access. They do not, and this is a VERY important distinction. Internet.org is a gated community with a gatekeeper and no security. It is like AOL or an intranet. It is by definition limited, excluding and discriminatory. It is very much not free basic internet.

> But that’s as if I were to tell you that giving anti-malarial drugs to people is wrong, because the right thing to do is to eradicate malaria itself.

The malaria analogy is a straw man. The resources and effort required to eradicate malaria are vastly larger than the effort and resources required to distribute anti-malaria drugs to a group of infected people. If they were the same it would obviously be both good and right to eradicate malaria. However, they are not.

The effort and resources to provide a gated internet.org and the effort and resources to provide an open internet.org are the same. Thus it is both good and right to provide an open internet.org.

> I don’t pretend to understand the massive logistical challenges involved in implementing this, and it makes me sigh when others no more knowledgeable than myself make armchair proclamations about what should or should not be done.

Unlike you, I do know what I am talking about having made a career in the telecoms industry.

Feeling good about internet.org is about as smart as feeling good about price dumping. All short turn gain for long term loss. Or if you feel like a more concrete example, it's about as smart as pissing in your pants when you are cold.

nitrogen · on May 19, 2015

Providing free basic internet access

There is no "basic" Internet access with a limited number of sites in the same sense that one can get "basic" cable with a limited number of channels. That's not Internet access. Calling it Internet access is disingenuous on the part of Internet.org, and providing it in the first place is destructive to the long-term interests of the intended beneficiaries.

What's both "good" and "right" is providing bandwidth-limited access to the entire Internet, subsidized if necessary by the local ads shown on a non-prioritized, low-bandwidth-friendly localized version of Facebook.

niklasni1 · on May 19, 2015

Providing internet access in the 3rd world should be the same as anywhere else: Shut up and provide bandwidth.

danielmiessler · on April 23, 2015

Well done. Here was my attempt at the same:

https://danielmiessler.com/study/pvsnp/

vivaldifan · on April 23, 2015

Great. I think it would be better if you visualize the travelling salesman problem and watermelon problem.

danielmiessler · on March 23, 2015

This is even less honest for a simple reason: the manager keeps asking, "will this work for you?"

It's fake in the same way that the sandwich is. They're being reprimanded, and that's the fact of it.

I think the sandwich is much more honest as long as you're being honest when you execute it.

You could even merge the two and say:

"Here's what we're going to talk about. We're going to talk about what's been going well lately, talk about an issue I've seen recently, and then close out with a plan to improve things. How does that sound?"

danielmiessler · on March 18, 2015

This is a legit issue, and you can definitely expect it to be patched quite soon. Not sure how/why someone would think it wouldn't get patched.

Many, many enterprises bet their data on passcodes combined with the 10-guess wipe defense. You can bet that they've already called Apple many times about this.

It'll be patched very soon.

robmcm · on March 18, 2015

Enterprise can mandate complex pass codes, not the 4 digit pins. The 11 hour number will increase rapidly with complex codes. However I assume they will still want the wipe feature to actually work.

danielmiessler · on March 7, 2015

I hear and feel the frustration, but it's misplaced.

If a man shows up to a tech conference dressed like the guys from Jersey Shore, he's going to be looked down on by everyone there. He'll be assumed to be part of the delivery people doing setup for the booths.

If he complains that he's a programmer and that he shouldn't be judged by his clothing, he will get mixed results.

It's true that if he turns out to be a nice guy, and a great programmer, then people will change their opinions of him.

But the one thing we cannot do is demand that the entire world see signals differently than they see them out on the street.

When someone dresses like the men on Jersey Shore, they do so because they are signaling certain things. They're signaling masculine power. Strength. Sexual prowess. Fighting ability. Etc.

Women who dress extremely femininely and girlishly are also sending signals that literally BILLIONS of people already know how to receive.

Don't be surprised when people interpret signals the way that is most beneficial to them in 99.9% of cases.

This is not a message that men from New Jersey or Women in general cannot be seen as programmers. It's a message that signaling matters, and we must be aware of what messages we're intentionally sending to others that we may need to overcome.

rfrey · on March 7, 2015

I agree with everything you said, except the conclusion (that you led with.)

The OP said she wore a dress. Not "extremely femininely and girlishly". She wore a dress.

"But the one thing we cannot do is demand that the entire world see signals differently than they see them out on the street."

Apparently the entire world thinks wearing a dress is a signal that one cannot be a technical person. That's the point of the article.

danielmiessler · on March 19, 2015

You missed the other cues in the article.

stephenboyd · on March 7, 2015

"But the one thing we cannot do is demand that the entire world see signals differently than they see them out on the street."

We can work on it instead of trying to justify such a narrow conception of what a competent programmer looks like. We can and do make progress on how people interpret the appearances of others.