For screen scraping, it appears that
some people want to say that, because
some software in effect provided the
browser keystrokes or mouse clicks,
something was wrong.
Once I wrote a little program that
gets Web pages from a Web site;
if I handled all the details correctly,
then there is no way for the Web site
to know that it is sending the data
to my program instead of a Web browser.
Indeed, essentially I wrote a Web
browser. That my Web browser just
wrote data to files and did not
provide a graphical user interface
on my screen is none of the business
of the Web site. It can't be illegal
to write a Web browser, especially
a very simple one.
For getting pages too fast, just write
the software to get the pages more slowly.
Done. Or if want to use one computer to
get 100 pages from each of 10,000 sites,
then then get one page from each of the
10,000 sites, 100 times. Done.
> I think the issue was that Craigslist asserted copyright over their data,
Fine. But there is an issue: Just how the heck
is Craigslist to know who got the data? Not from
IP address -- that's terrible evidence. Then
how's Craigslist to know just what the heck the
data was used for? Even it it's clear that the
data was from Craigslist originally, if the person
using or misusing the data might have gotten the
data from someone else and not directly from
Craigslist.
So, to me, for Craigslist to run
around with lawyers and C&D letters attacking
Internet users looks like a bummer. If a user
does something obvious and blatant with Craigslist
data, or is dumb enough to admit getting the data
after a C&D letter, then okay. But mostly the
legal effort is a loose cannon on the deck
that can hurt a lot of people
based on really poor evidence.
there is no way for the Web site to know that it is sending the data to my program instead of a Web browser.
Except through your User-Agent string. Which can, of course, be faked, but if you are actually running a scraper or other automated tool, you're not supposed to use a browser User-Agent string.
For getting pages too fast, just write the software to get the pages more slowly. Done.
Yes, agreed; all search bots and other automated tools are supposed to do this.
But there is an issue: Just how the heck is Craigslist to know who got the data?
I don't know how Craigslist found out in this specific case; but the point was moot anyway because 3Taps admitted they had obtained the data; there was no dispute about that. The dispute was entirely over whether what 3Taps was doing with the data once they got it was "authorized".
how's Craigslist to know just what the heck the data was used for?
Because 3Taps admitted what they were using it for. There was no dispute about that either, only about whether that use was authorized.
for Craigslist to run around with lawyers and C&D letters attacking Internet users looks like a bummer.
Have they been doing that? In this particular case, as I said above, there was no dispute at all about the facts, only about the legal rights involved. I don't see any evidence that Craigslist is indiscriminately banning people and then suing them based on disputed facts; the only dispute I see is over whether Craigslist should be able to assert the rights it's asserting over its data.
Sure, my software sends a nice,
simple, vanilla pure, good looking
string for the user agent string.
I agree with you about essentially all
the details of this specific case.
As seemingly hinted in the OP,
my concern is with the more general
situation -- could a Web site use lawyers,
C&D letters, and IP addresses to make
big legal problems for Internet users
who download an unusually large number
of Web pages? I hope not.
Then there's the suggestion that for a
user to get a new IP address is somehow
nefarious -- it's not. And there's
calling getting Web pages screen
scraping as if it is different, unusual,
and nefarious -- it's not. Then there's
the suggestion that what the user did
that was bad was getting the data
when the real problem was that the user
republished the copyrighted data.
I don't think* this case gives any basis for a site to take legal action against someone just based on downloading a large number of web pages or accessing the site with different IP addresses. There has to be quite a bit more than that. I don't think the headline of the article really gets across all of the factors that had to be present for this ruling to go the way it did (but the body of the article does a better job of that).
Be careful: The purpose of the agent string
is to tell the server how to treat the client.
That is, different Web browsers do different
things with the same HTML, JS, CSS, etc. So,
the agent string tells the Web site how the
browser wants to be treated.
In my little program to get Web pages, I just
tell the Web server how I want my program
treated -- like a certain Mozilla browser.
This is not "faking" anything. It would do
no good to tell the Web server that I
wrote my own Web browser because the Web
server would know nothing about my browser
and, thus, have no way to respond to it in
any special way. So, I just tell the
Web server to treat me like Mozilla.
No, but giving reasonably accurate information about what kind of user agent is being used is. If you write your own browser, yes, you're probably better off telling a website that it's, say, Firefox than telling it it's "Joe's Really Cool Browser v1.0". But if you're writing a program whose purpose is not to display pages to the user, but to do something else, your program shouldn't be telling web servers that it's a program whose purpose is to display pages to the user.
Once I wrote a little program that gets Web pages from a Web site; if I handled all the details correctly, then there is no way for the Web site to know that it is sending the data to my program instead of a Web browser. Indeed, essentially I wrote a Web browser. That my Web browser just wrote data to files and did not provide a graphical user interface on my screen is none of the business of the Web site. It can't be illegal to write a Web browser, especially a very simple one.
For getting pages too fast, just write the software to get the pages more slowly. Done. Or if want to use one computer to get 100 pages from each of 10,000 sites, then then get one page from each of the 10,000 sites, 100 times. Done.
> I think the issue was that Craigslist asserted copyright over their data,
Fine. But there is an issue: Just how the heck is Craigslist to know who got the data? Not from IP address -- that's terrible evidence. Then how's Craigslist to know just what the heck the data was used for? Even it it's clear that the data was from Craigslist originally, if the person using or misusing the data might have gotten the data from someone else and not directly from Craigslist.
So, to me, for Craigslist to run around with lawyers and C&D letters attacking Internet users looks like a bummer. If a user does something obvious and blatant with Craigslist data, or is dumb enough to admit getting the data after a C&D letter, then okay. But mostly the legal effort is a loose cannon on the deck that can hurt a lot of people based on really poor evidence.