Not really: For whatever reason, my electric company commonly drops power for a ...

pdonis · on Aug 19, 2013

that one second power drop also causes my cable modem to forget the IP address it was assigned via DHCP by my ISP.

Does the IP address actually change when this happens? Some ISPs have their DHCP server assign IPs according to the MAC address of the cable modem, which of course won't change if power is shut off and then turned on again.

the guy with the C&D letter can tell the Web site that traffic on the banned IP address was no good evidence that the traffic was from him and, instead, could have been from any customer of the relevant ISP.

If he was a private individual getting internet access from an ISP that did that, sure. But in the particular case referred to in the OP, the "guy with the C&D letter" was a company, not an individual, and as I understand it, the IP addresses that were banned were the ones mapped to that company's domain name based on DNS records. That's a different situation.

the ISP could assign the IP address the Web site banned to just any customer not involved in the C&D letter, etc. Then the Web site would ban that person; I hope that person would not get charged with a crime.

It's not clear how Craigslist found out that 3Taps had changed the IP addresses it was using and resumed scraping the site. However, whatever means it used to find that out was apparently accurate, since 3Taps admitted that it had changed IP addresses and was still scraping the site. The court case was based entirely on Craigslist saying that 3Taps was no longer authorized to access their site; there was no dispute about whether they had actually done so.

graycat · on Aug 19, 2013

This case is stupidity piled higher and deeper!

> Does the IP address actually change when this happens? Some ISPs have their DHCP server assign IPs according to the MAC address of the cable modem, which of course won't change if power is shut off and then turned on again.

My ISP does this. I get a new IP address whenever I cycle power on my cable modem. And sometimes my ISP gives me a new IP address for whatever reason. Maybe their reason is that they want to charge a little more for a fixed IP address. Or maybe they have more paying customers than IP addresses so must dynamically assign IP addresses to users actually connected.

Yes, if 3taps was accessing Craigslist from a fixed IP address, then that can be fair, although not really good, evidence that 3taps was continuing to access Craigslist after the C&D letter.

If 3taps just admitted continuing to use Craigslist, then they were, just how do I say it, s.t.u.p.i.d, or some such? Or, all a 3taps person had to do was just go home and get the Craigslist data from a home computer with a different IP address. I'm not up on mobile devices, but I have to believe that they also use frequently changing, dynamically assigned IP addresses.

Also I don't like the idea that there is screen scraping as something different from ordinary usage; it's not. The Web site sends the data, and the data is nearly always stored on disk by the Web browser. Also the Web browser can write the data to an HTM file and a directory with the JS, CSS, JPG, PNG, GIF, etc. files. Then the user has essentially all the data in simple, plain unencrypted form. Nearly all the data is sent just as simple text. E.g., likely the Craigslist data is sent this way. Then if someone wants to make some new use of that Craigslist data, they can easily remove the HTTP, HTML, CSS, JS, etc. stuff, leave the simple text, analyze it, reformat it, combine it with other data, format it with Word, TeX, PostScript, PDF, etc., wrap it in some new HTML, CSS, and JS, and publish it again. Then it need not be the least bit clear just where the data came from. In this case, with little good evidence that the data came from Craigslist, it would not be fair to search the facilities of 3taps for evidence.

Broadly, the Web site offers the data to all anonymous users, as mostly just simple text. In that case, the Web site should basically just shut up about what happens to the data they sent.

pdonis · on Aug 19, 2013

I'm not up on mobile devices, but I have to believe that they also use frequently changing, dynamically assigned IP addresses.

I believe that's correct, yes. If they are using wifi, they will appear to be connecting using the wifi router's public IP address, which will certainly be different for each wifi router. If they are using the cell phone network to connect, I'm pretty sure they get assigned a public IP address based on which cell tower they are using to connect, so that will change as well.

I don't like the idea that there is screen scraping as something different from ordinary usage; it's not.

In terms of the data itself, you're right, screen scraping is just pulling the data, the same as a web browser does.

However, since screen scraping can be automated, it can potentially use a lot more bandwidth, since it can request multiple pages from a site much faster than a human driving a browser can. That's why sites are allowed to restrict what automated search bots can do on their sites, for example with a robots.txt file. A service like 3Taps would be expected to respect these types of restrictions just like Google does.

That said, I don't think the issue in this case was the screen scraping per se; I think the issue was that Craigslist asserted copyright over their data, so that they had the right to say that 3Taps could not use the data the way they were using it.

graycat · on Aug 19, 2013

For screen scraping, it appears that some people want to say that, because some software in effect provided the browser keystrokes or mouse clicks, something was wrong.

Once I wrote a little program that gets Web pages from a Web site; if I handled all the details correctly, then there is no way for the Web site to know that it is sending the data to my program instead of a Web browser. Indeed, essentially I wrote a Web browser. That my Web browser just wrote data to files and did not provide a graphical user interface on my screen is none of the business of the Web site. It can't be illegal to write a Web browser, especially a very simple one.

For getting pages too fast, just write the software to get the pages more slowly. Done. Or if want to use one computer to get 100 pages from each of 10,000 sites, then then get one page from each of the 10,000 sites, 100 times. Done.

> I think the issue was that Craigslist asserted copyright over their data,

Fine. But there is an issue: Just how the heck is Craigslist to know who got the data? Not from IP address -- that's terrible evidence. Then how's Craigslist to know just what the heck the data was used for? Even it it's clear that the data was from Craigslist originally, if the person using or misusing the data might have gotten the data from someone else and not directly from Craigslist.

So, to me, for Craigslist to run around with lawyers and C&D letters attacking Internet users looks like a bummer. If a user does something obvious and blatant with Craigslist data, or is dumb enough to admit getting the data after a C&D letter, then okay. But mostly the legal effort is a loose cannon on the deck that can hurt a lot of people based on really poor evidence.

pdonis · on Aug 19, 2013

there is no way for the Web site to know that it is sending the data to my program instead of a Web browser.

Except through your User-Agent string. Which can, of course, be faked, but if you are actually running a scraper or other automated tool, you're not supposed to use a browser User-Agent string.

For getting pages too fast, just write the software to get the pages more slowly. Done.

Yes, agreed; all search bots and other automated tools are supposed to do this.

But there is an issue: Just how the heck is Craigslist to know who got the data?

I don't know how Craigslist found out in this specific case; but the point was moot anyway because 3Taps admitted they had obtained the data; there was no dispute about that. The dispute was entirely over whether what 3Taps was doing with the data once they got it was "authorized".

how's Craigslist to know just what the heck the data was used for?

Because 3Taps admitted what they were using it for. There was no dispute about that either, only about whether that use was authorized.

for Craigslist to run around with lawyers and C&D letters attacking Internet users looks like a bummer.

Have they been doing that? In this particular case, as I said above, there was no dispute at all about the facts, only about the legal rights involved. I don't see any evidence that Craigslist is indiscriminately banning people and then suing them based on disputed facts; the only dispute I see is over whether Craigslist should be able to assert the rights it's asserting over its data.

graycat · on Aug 19, 2013

> your User-Agent string

Sure, my software sends a nice, simple, vanilla pure, good looking string for the user agent string.

I agree with you about essentially all the details of this specific case.

As seemingly hinted in the OP, my concern is with the more general situation -- could a Web site use lawyers, C&D letters, and IP addresses to make big legal problems for Internet users who download an unusually large number of Web pages? I hope not.

Then there's the suggestion that for a user to get a new IP address is somehow nefarious -- it's not. And there's calling getting Web pages screen scraping as if it is different, unusual, and nefarious -- it's not. Then there's the suggestion that what the user did that was bad was getting the data when the real problem was that the user republished the copyrighted data.

pdonis · on Aug 19, 2013

I don't think* this case gives any basis for a site to take legal action against someone just based on downloading a large number of web pages or accessing the site with different IP addresses. There has to be quite a bit more than that. I don't think the headline of the article really gets across all of the factors that had to be present for this ruling to go the way it did (but the body of the article does a better job of that).

* - of course, IANAL.

graycat · on Aug 20, 2013

> be faked

Be careful: The purpose of the agent string is to tell the server how to treat the client. That is, different Web browsers do different things with the same HTML, JS, CSS, etc. So, the agent string tells the Web site how the browser wants to be treated.

In my little program to get Web pages, I just tell the Web server how I want my program treated -- like a certain Mozilla browser. This is not "faking" anything. It would do no good to tell the Web server that I wrote my own Web browser because the Web server would know nothing about my browser and, thus, have no way to respond to it in any special way. So, I just tell the Web server to treat me like Mozilla.

Faking is not really the point.

We've got evil on the brain here.

I wrote my own Web browser. So what?

pdonis · on Aug 20, 2013

Faking is not really the point.

No, but giving reasonably accurate information about what kind of user agent is being used is. If you write your own browser, yes, you're probably better off telling a website that it's, say, Firefox than telling it it's "Joe's Really Cool Browser v1.0". But if you're writing a program whose purpose is not to display pages to the user, but to do something else, your program shouldn't be telling web servers that it's a program whose purpose is to display pages to the user.