Considering the kind of *private* scraping and selling tactics LinkedIn has been...

penagwin · on Sept 9, 2019

I feel like this is a really common theme I've seen several times. Something like "Music Lyric site X sues Google for embedding their lyrics in the results directly" which is funny because site X got the lyrics by scraping them from other sites.

Plus Google only exists from scraping content, but I believe their TOS includes "don't scrape our content".

I find it really funny that the scrapers are battling scrapers - like guys you only exist because you do THE EXACT SAME THING

dmix · on Sept 9, 2019

Regardless, there is legitimate value in the collection, cleaning, interlinking, and presentation of existing data. How that is interpreted by the law is one thing but merely because the data came from a variety of other public/private sources doesn't mean it derived all of its value externally.

meowface · on Sept 9, 2019

For sure, but they shouldn't be hypocritical about it. If they don't consider themselves content parasites, they shouldn't consider people scraping their site to be content parasites, either. (Some sites really are just parasites, though.)

dodobirdlord · on Sept 10, 2019

There's nothing hypocritical about it. Googlebot respects robots.txt configured on pages it scrapes. Google in turn expects that their own robots.txt will be respected. What's the issue?

https://www.google.com/robots.txt

olau · on Sept 10, 2019

Can I politely point out that the conversation is not about respecting robots.txt.

If you want to talk about this in terms of robots.txt, Google is thriving on the fact that other companies don't block their content in robots.txt, but at the same time Google blocks all of its content in its robots.txt.

dodobirdlord · on Sept 10, 2019

> If you want to talk about this in terms of robots.txt, Google is thriving on the fact that other companies don't block their content in robots.txt, but at the same time Google blocks all of its content in its robots.txt.

It seems like you're stating this as though to cast some sort of moral aspersion. I don't get it. If other companies don't want Googlebot to scrape them they just have to say so. Most companies want Googlebot to scrape their content. Google doesn't want other people's scrapers to scrape Google's content. Nobody involved in any of this has done anything unreasonable or morally objectionable.

ijidak · on Sept 10, 2019

> Plus Google only exists from scraping content, but I believe their TOS includes "don't scrape our content".

Yes. This is EXTREMELY frustrating.

Of all companies to prevent scraping, Google is the most ironic.

Especially since their goal is to organize the world's information, it shocks me that there's no way to get access to this organized information from machine to machine.

TylerE · on Sept 10, 2019

I d0n't really think it's inappropriate or ironic. I can easily imagine naive scrapers essentially DDOSing google.

3xblah · on Sept 10, 2019

Perhaps this issue will be recognised in some of the antitrust investigations.

If I am not mistaken, they no longer claim "organize the world's information" as their goal.

etaioinshrdlu · on Sept 10, 2019

But it still is:

> https://about.google/

"Our mission is to organize the world’s information and make it universally accessible and useful."

3xblah · on Sept 10, 2019

Appears I am mistaken. Cheers.

aledalgrande · on Sept 10, 2019

Yes, we want a Google Search API [at a decent price].

falcor84 · on Sept 9, 2019

Well, setting up so-called Barriers to Entry[0] is econ 101.

[0] https://en.wikipedia.org/wiki/Barriers_to_entry

cwkoss · on Sept 9, 2019

Creating barriers to entry is an antisocial tactic that harms consumers and society at large.

It is the responsibility of moral consumers to avoid spending their money with companies that use these regressive tactics.

leesalminen · on Sept 9, 2019

I think it’s important to distinguish types of barriers to entry. Some are “real” while others are “artificial”. For example, a real barrier to entry would be institutional knowledge about an industry while an artificial one would be an arbitrary TOS clause.

TeMPOraL · on Sept 10, 2019

And disallowing scraping or making it difficult while refraining from providing an API for the same data is the arbitrary kind. The default state of the web is that it's trivially scrapable - you have to go out of your way to make it harder.

jibal · on Sept 9, 2019

[flagged]

cwkoss · on Sept 9, 2019

HN is not an appropriate place to project your moral insecurities in this manner.

jibal · on Sept 13, 2019

HN is a place where you feel free to make personal attacks.

skybrian · on Sept 10, 2019

Google respects robots.txt, so it's arguably not the same as scraping a website without their consent.

aledalgrande · on Sept 10, 2019

Most sites don't have their main data/functionality in the Disallow section though.

dodobirdlord · on Sept 10, 2019

Sure, but they could if they wanted to and that's their own business.

Breza · on Sept 14, 2019

Google respects robots.txt files from webmasters who don't want Google to scrape their content