Well, the article is right. But what happens in practice in cases like these is that everything has been made illegal and it's just arbitrarily enforced. All behavior that grandpa judge can't understand, like, say, anything beyond using a browser (say, wget?) is a hacking tool and as long as the person you annoyed has deep pockets or connections you'll go to prison.
The problem isn't with the details of any particular scraping or subtly. The problem is with the law, CFAA 1030, and how it can be interpreted however the particular DA wants. The judges are too ignorant of technology to stop them.
This is what unconcious bias is in action, I guess. Statement was likely not intended to be harmful, rather to sound witty when making a point - but it is indeed offending if you think about the problem of age discrimination (especially in the tech industry).
You mean if all the judges were old white men and they were somehow unable to understand what seemed obvious to young women of color so their rulings were discriminatory? Ok, I'm seriously considering it.
I watch a lot of tv. I thought all judges were black women in their 30s.
In reality you see all genders, races and backgrounds. Judges when appointed have a general good balance of genders, races, backgrounda. When voted for represent the voter population makeup. They tend to be older because experience really does help.
I thought about this sequence of comments seriously for quite a while. I accept that "grandpa judge" was inappropriate. But how would you phrase the problem of our judges not understanding technology? Is it a judge thing? Lawyers understand. But lawyer tend to be younger. There are few young judges. I think age, and the lack of life experience with tech because they grew up before it, probably is the factor that matters here.
>Lawyers understand. But lawyer tend to be younger.
Lawyers also tend to be specialists to a greater degree than judges. I know quite a few IP/technology licensing/etc. lawyers who are very technically knowledgeable and many of them are not especially young.
But it's what they do for a living. And I'm pretty sure I'll pick one of them to help me out on most technology-related matters rather than a random 30 year old criminal defense lawyer.
Judges tend to be much more generalists. They have to rule on all sorts of arcane points of law which they probably haven't made a career out of studying. But some of them are pretty astute on matters of technology (and at higher levels of the court, they have clerks too.) Look at Judge Alsup in the Oracle-Google case https://www.theverge.com/2017/10/19/16503076/oracle-vs-googl... or Judge Kimball in the SCO case.
Actually, few do. However you can choose your lawyer, and therefore pick one well-versed in the relevant area of technology and who also has a network of expert witnesses and advisors.
I appreciate this more thoughtful response, but bear in mind that it is the role of domain experts to (hopefully) clarify complicated matters for those on the outside, whatever their demographic background.
Age has positives and negatives associated with it, and ascribing only the latter to it with regard to judges and technology would mean we should set an upper age limit to all court proceedings dealing with any complicated subject - technology is hardly unique in that respect.
There might be an argument for that (or at least some competency test beyond a certain advanced age), but not in the crass, prejudiced way implied in the OP.
>I appreciate this more thoughtful response, but bear in mind that it is the role of domain experts to (hopefully) clarify complicated matters for those on the outside, whatever their demographic background.
I'm familiar with one particular subset of the SCO case. Suffice it to say that a great deal of legal and expert witness time (from both sides) went into creating detailed reports that were intended to be consumed by a non-specialist that laid out history, discussed various claims, and so forth.
A judge could of course choose not to spend the time and effort to digest all this information--but that would be the case whether you're talking technology contract and IP claims or you're talking some complicated and arcane case of real estate contracts.
It's not ageism when it is true. Ism would imply a bias. We all know old people don't have tech chops like young people. It's literally scientific fact that brains get less plastic, lazier, and less adept at learning new things as they age. There's even an adage!
"You can't teach an old dog new tricks!"
The average age of newly appointed judges is 50. Yeah, keep on smokin that identity politics shenanigans instead of statistics and science.
I would agree with you if 100% of older judges were less adept at technology than all younger people. But the fact is, that while the stereotype of older people being worse at tech than younger people is true on a statistical level, it is not true on every individual level. And that is the very definition of -ism. Assuming someone must not be good at something because of their age or group or whatever.
Statistically, black people are less likely to have a college degree than white people. Does that mean it is ok to not consider black candidates for hire since statistically, they are less educated? Of course not.
If the problem is that judges don't understand technology, then the solution is to get more judges that do. Not, get younger judges and hope that they understand tech better than their older counterparts because it will probably play out that way statistically.
I agree to a point, but I think its more a lack of training than a purely age thing. Yes, "grandpa judge" might not be able to learn new things now, but they could have put effort into learning things then. Even if hardly anyone uses FORTRAN any more, being familiar with the field then has got to count for something now.
I'm not saying a judge should know every field - far from it - but if they could read around just one other subject while in law school and specialise in law surrounding that subject, we may have a hope of having $field literate judges in the future.
Who do you think built most of the tech you’re using? Bear in mind that HTTP (relevant in this particular article) was invented almost 25 years ago. And TCP/IP is far older.
I don't think it's about ageism, but about the pace of technology. Many (if not all?) of the concepts being debated were likely invented after most current judges were born: the internet; HTTP; web scraping; browsers. It's not easy to understand something you've not spent much time learning about. And the rule book on laws regarding the internet is not fully written yet.
How much do you think the average 25 year old knows about networking protocols, HTTP, and web scraping--much less how those areas intersect with legal precedent and practice? Especially given (maybe barring patent and technology-related IP lawyers) they probably have a liberal arts undergrad degree of some sort.
>And the rule book on laws regarding the internet is not fully written yet.
Yes, it is certainly true that there's a lot of technology-related law which is pretty unsettled relative to areas of law like contracts and property that have at least centuries of jurisprudence behind them.
ADDED: And one last thing I'd add to the last point is that, when technical people complain about judges (or whoever) not understanding technology when they make some decision, at least some of the time, they understand perfectly well but the existing laws don't lead to the outcome that the technical people want.
Although it sometimes can seem like they do, courts don't usually just make up things out of whole cloth to get the result they personally think is best.
But I've witnessed some particularly aggressive scrapers bring a shopping site to its knees.
I've watched as multiple scrapers were launched at this site from different blocks of IP addresses. I spent some hours blocking these.
After spending even more time, I tracked down these IP blocks - from multiple countries - back to the one scraping company, who had obviously been hired by a competitor to scrape the entire site for its prices for the ranges of goods sold.
I consider such aggressive scraping to be a DDoS attack.
It’s more like service slaughter (which isn’t a term AFAIK) than denial of service (since denial isn’t the goal).
It’s a tough problem though. In an ideal world, I think scrapers should pay for the compute/resources required to serve their requests. I think that’s one of the only ways to scale for this fairly. (This assumes you agree that scraping of public content should be allowed in the first place.)
That's an interesting idea. Sometimes a lot of the (service provision) cost depends on how efficient the data access is at the host.
For example, let's take an API that includes a 'GET /data?start=0&end=100' endpoint that retrieves items with key between 0 and 100.
If items are stored in sorted key order, that could be an efficient and inexpensive query that retrieves exactly the 100 relevant items and doesn't perform any other unnecessary work.
If, on the other hand, the database contains a billion items and they aren't stored in key order, then with a caller-pays pricing model, it's (probably, for the sake of argument) not worth running that query.
In some cases it might be reasonable to continue to pay for the expensive queries. But what if all you need to do is convince the host to 'CREATE INDEX ... ON (key)'?
(all very hypothetical - I don't have any explicit answers. if the host service were open source, the answer might be 'open a pull request and talk to them, and hopefully reduce everyone's costs')
If I was evil and the law was that scrapers had access BUT they had to pay their fair share of the cost. I would just make their costs so absurdly high that it wouldn't be beneficial to them.
Yeah, you’d almost need some way to genetically price this. You’d almost want websites to opt into it. You can either accept the $x/request price or not participate in the program.
Though, I guess at that point you might as well just sell API access.
Yeah, a hard part of this is that it depends on the efficiency of the implementation by the website creators/maintainers. At least at that point you’re on the same side. You’re both trying to be fair and you’re both trying to help the website work well.
The term "hacking" long ago shifted to mean "illegal activity" unless otherwise specified or obvious from the context. In the context of this title it's clear what was meant. I'm an ancient, pedantic nerd, and even I don't cling to the archaic 1970s meaning of "hacking".
It's not that archaic; PG, among others, talks about the meaning of "hacking" [1]. Not surprising that a HN crowd might default to this interpretation.
Note, I'm talking about the title alone--from the actual article it's obvious they're concerned with legality.
I think he meant he understood the title as "if you just scrape a website, you're not [meeting the threshold to be a] hacker", which would fit with 'no true scotsman'[0]. At least that's the way I understood it at first. Whether you consider a hacker to be a criminal or a nerd doesn't really affect _this_ particular misunderstanding ;)
Was "hacking" as a lay person term ever the same as "hacking" as used by makers? If it was a shift, then yeah you're right. If it was a demonization as the public learned the term at all, then I'd say there's more reason to continue pushing to protect the term as a part of our community.
Until about the early 1980s, it exclusively referred to benign or prankish tinkering. It started to drift increasingly toward the prankish end, and then to the nefarious. Most early computer hacking—in the illegal sense—was also fairly benign (I wardialed my area and broke into the local supermarket computers to teach myself Unix and C). From the 90s on, its primary use (especially in the mainstream) has referred to illegal activity, whether gray- or black-hat. There was a concerted effort in the 80s and early 90s to use "cracking" for the illegal activities, but it never caught on.
"The Markup believes [...]" is underselling it. The 9th circuit, for one, agrees with them [1].
In the case at question here the 11th circuit ruled the other way. When circuit courts come to opposite conclusions the Supreme Court will sometimes step in to provide a ruling so that there's unified precedent. Hopefully they'll agree with the 9th circuit.
It's funny how scraping can be illegal yet Google and Microsoft and Yahoo and DuckDuckGo, and every other search engine have a monopoly on it. It's legal for them, because .....? Money money money! Google even hotlinks images for Google images... pretty hilarious stuff.
This is a tough problem because if you asked most sites/admins if you could scrape their site, they don’t really have much motivation to say yes. For some it feels like you are stealing their hard work. For others they don’t want to pay for the requests your scraper will make to their site, etc.
Regardless, I’m an ideal world, some polite things would be to:
- ask for permission before scraping, explaining how it’s neutral or positive for their business (so they are more likely to support your continued scraping or even potentially provide a more cost-effective format/API for you)
- scrape at a reasonable pace
- scrape off hours
- make requests with a clear user agent so they know where the requests are coming from
Start by assuming good faith from the website regarding scrapers and clearly identify yourself in your user-agent (provide a contact email or something) and do it from a single IP.
Submit a reasonable amount of requests/second and ramp it up slowly depending on the response time and size/popularity of the website (if it's a national e-commerce brand you can hit it with hundreds per second with no expected ill effects because that's a drop in the bucket compared to their overall traffic, but a very small website on shared hosting might warrant a more cautious approach).
Of course, if they are hostile to it and don't like the "being nice" approach then fight fire with fire and try to get what you want as fast as possible while blending into their overall traffic so they can't identify nor block it. It might sometimes even involve paying real people (on Mechanical Turk or similar) to just browse the website as they normally would and scrape it manually.
Something that I rarely see mentioned: only scrape what you need. If the site provides some means of limiting what it returns to you (a particular path, category, or search), take advantage of that. If you know that the information is on a limited number of pages and you only need to acquire it once, there it is perfectly fine to download those pages manually then have your program extract the required information automatically. (That may even save you from some work.)
Scrapy (a popular python scraping framework) has an auto-throttling feature that lets you limit concurrent requests and requests per minute for a domain. I tweak these values to what I estimate the website could handle (depending on how large it is).
The page also outlines how throttling algorithm works, which you use as a starting point for your own work.
I might be thinking about this the wrong way, but, scrapers as described by this article are more-or-less targeted web crawlers, right? Don't we have a de-facto standard for handling this?
It is, but what I'm suggesting is that a scraper identifies itself as such in the user agent and complies with the robots exclusion standard. Google's crawlers recognise that `googlebot` - among other strings - refers to them, and a scraper should behave similarly if it doesn't want to be seen as malicious. I think scrapers should be legal, but just like crawlers, they should comply with a directive that says they aren't welcome.
And on the topic of trivial workarounds, it's even more trivial to just ignore the robots.txt and scrape away, but that's what malicious bots performing questionable activities do. One would hope a newsbot is being civil and not behaving in a shady manner, but alas, I don't for one moment think any of these news publishers have even heard of a robots.txt
The problem isn't with the details of any particular scraping or subtly. The problem is with the law, CFAA 1030, and how it can be interpreted however the particular DA wants. The judges are too ignorant of technology to stop them.