Amazon has a way to scrape GitHub and feed its AI model

Kye · 2024-06-15T11:28:31 1718450911

Is it git pull?

>> "In response, Amazon proposed a workaround: encouraging its employees to create multiple GitHub accounts and share their access credentials."

Ah, no, it's git pool.

xmodem · 2024-06-15T11:54:16 1718452456

Ethically Microsoft has about as much claim to be able to use the data for co-pilot as anyone else.

On the other hand, maybe a MSFT v Amazon lawsuit over this could be the wake up call the world needs that maybe we should stop centralising critical infrastructure in the hands of a single company. Which is why I think they wouldn't do it - at most I could see Microsoft tightening request limits on accounts associated with Amazon.

drewcoo · 2024-06-15T13:05:42 1718456742

> maybe we should stop centralising critical infrastructure in the hands of a single company

Managing your own on-prem or in-colo infrastructure sucks: it's expensive and a source of risk, which is why we moved things like source servers to a centralized model.

cyanydeez · 2024-06-15T12:03:28 1718453008

Well do.that right when distributed computing finds a workable modrl

xmodem · 2024-06-15T12:30:39 1718454639

Yeah, I guess building a distributed version control system is basically an intractable problem.

amadeuspagel · 2024-06-15T12:33:32 1718454812

GitHub offers more then version control. A decentralized version is about as realistic as a decentralized facebook.

fire_lake · 2024-06-15T12:44:27 1718455467

The GitHub user-base cares more about decentralization than the Facebook user-base. Maybe there is hope.

cyanydeez · 2024-06-15T15:33:04 1718465584

Right but the past decade has shown distributed models are incompatible with public discovery and use.

skeledrew · 2024-06-15T17:39:09 1718473149

There are solutions out there though. Mostly it's a lack of - financial - incentive to standardize and iterate on them IMO. But GitLab at least is currently working on adding ActivityPub support.

xmodem · 2024-06-15T12:42:40 1718455360

Of course - I was being facetious. But I disagree with your assessment of how realistic it is. We already have it on a small scale today in the form of some projects using their own gitlab/gitea instances. If Github were to enshitify I expect we would see a push for this from a lot more communities.

I don't think it's that much of a stretch to design a system that keeps, for instance, issue, wiki, and PR metadata in git alongside code. This could then support simple import/export between instances. You could also support cross-instance forking and PR's.

The biggest problems I think you would still have are 1. third-party integrations and 2. abuse/spam prevention. Having been the system-owner for GitHub at a large engineering org before, I can say that for us, switching away would have been virtually impossible because of all the integrations we would have to replace. But, this is a consequence of the centre of gravity being Github currently and not an immutable law of nature.

As for 2, well I expect that'll remain one of the hard unsolved problems of computer science for the time being.

yodon · 2024-06-15T13:48:35 1718459315

>I was being facetious

You may want to re-read Poe's Law[0].

Poe's law technically refers to Sarcasm but tends to cover Facetious comments as well.

[0]https://en.m.wikipedia.org/wiki/Poe%27s_law

jsnell · 2024-06-15T12:05:13 1718453113

I'm surprised Amazon's legal team signed off on this. It's clearly against the GitHub terms of service[0], and Amazon employees acting on the instructions from Amazon had to approve those terms. It seems pretty much identical to the LinkedIn vs. hiQ scraping case, where as I understand the fake account creation was the key point.

[0] E.g. no API key sharing for the purposes of evading rate limits, only a single free account per person or organization.

londons_explore · 2024-06-15T12:21:10 1718454070

When you pay your legal teams as much as Amazons, they probably tell you "Yeah, you'd probably lose any case, but the fine will be a couple of million dollars and you won't have to pay it for a decade, and by then you'd have cemented your market leadership".

that_guy_iain · 2024-06-15T12:12:08 1718453528

What if they‘re not free accounts?

koolba · 2024-06-15T12:02:43 1718452963

Is the cover image itself generated via some ML model? The old guy in the middle is missing substantial parts of his arm. The box right by him also has some artifacting in the corner.

amelius · 2024-06-15T12:15:40 1718453740

No, this depicts exactly the nightmarish nature of a job at an Amazon warehouse.

KineticLensman · 2024-06-15T12:26:56 1718454416

Yes, the Amazon brand arrow at centre top is also broken. In fact all of the people look wrong in some way

batch12 · 2024-06-15T12:29:44 1718454584

Yeah. It is likely an edited AI image. You can confirm by looking at the text on the box at the top left. "BMGOMa"

willwade · 2024-06-15T12:13:22 1718453602

and the guy on the right.. umm.. what's with his face? Or is he an Alien maybe? Image credit goes to https://linkmedya.com - it doesn't say it is AI-generating content but yep, it certainly looks like it

stamourd · 2024-06-15T12:37:39 1718455059

"Featured image credit: Eray Eliaçık/Bing"

firtoz · 2024-06-15T12:55:25 1718456125

And the "Bing" just links to Bing's Dalle3 functionality

lokimedes · 2024-06-15T12:52:18 1718455938

This just rekindled my desire to self-host my git repos. The whole idea that a platform provider can use the IP I host there is obscene. That thieves steal by bounty from each other is not the story.

neilv · 2024-06-15T12:27:14 1718454434

Separate from the courts, Microsoft could send a message to the AI gold rush field, about "abuse of Microsoft's resources", via ToS:

* All Amazon domain names could be banned from accounts on GitHub, or face annoying restrictions, implemented with trivial technical changes. And lawyers could send a letter to Amazon legal, about how Amazon may and may not use GitHub, including Amazon personnel having to disclose their affiliation (not hide it with GMail), and craft some language about how those employee accounts may and may not be used.

* More harshly, but fear-instilling to individuals throughout industry, the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation. Not only those particular accounts, but any accounts the individuals might use. (This would hurt, not only for genuine open source participation, but also given how open source is sometimes used for job-hunting appearances, and all the current employers that ask for candidate's "GitHub" specifically rather than open source in general.) If banning would have undesired effects of projects GitHub wants to host being pulled, or public reaction as too harsh and questioning why GitHub has so much power, there could instead be annoying restrictions.

rdtsc · 2024-06-15T13:12:59 1718457179

> the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation.

That would work, assuming GH doesn’t make mistakes and ban someone else with the same name m. That would then be embarrassing for GH. I can already see news headline “Github banned my account because my name matches that of a web scraping account from Amazon”

foreigner · 2024-06-15T12:53:37 1718456017

Microsoft could sabotage Amazon's AI model by returning poisoned code to accounts registered with @amazon.com email addresses.

xmodem · 2024-06-15T13:07:19 1718456839

The way git works means that you can check that you have an un-doctored clone of a repo just by checking that the commit hash matches. Which in this instance is quite unfortunate, because it would be very funny.

(barring a SHA-1 collision, of course)

EDIT: i suppose another approach could be to invent poisoned repos out of whole cloth and only show them to Amazon, but I susepct that'd be even easier to detect.

raarts · 2024-06-15T12:13:13 1718453593

Language in this article smells like it's written or rewritten by AI.

belter · 2024-06-15T12:16:27 1718453787

Agree. Looks like we have a good hear: https://youtu.be/zbo6SdyWGns?t=78

paradite · 2024-06-15T12:20:59 1718454059

Microsoft is probably one of the few companies that can sue Amazon without worrying about retaliation from Amazon.

For example, GitLab would need to think twice before suing because they offer deployment on AWS.

threecheese · 2024-06-15T13:27:44 1718458064

Can anyone share a Fermi estimation of the size of poison-pill training data required to impact code interpreter models? (of the size that AMZN might be building with this data)

I expect it would vary by language/platform popularity (size of available training code). Is it infeasible to create or generate enough code, pushed to enough repositories, to impact the correctness of a model that includes the code in its training data set?

lofaszvanitt · 2024-06-15T14:01:34 1718460094

MS only provides the infra, everything else is other's hard work under the trojan horse open source whatever. If they introduce limits, time to leave github. This will evolve into an elsevier vs researchers kinda situation.

chumanak · 2024-06-15T15:33:07 1718465587

This article doesn’t make any sense. Why would Amazon make their employees do all this when they can easily pay for a service like crawlbase or similar and easily scrape github without having to create employee accounts?

rty32 · 2024-06-15T12:17:07 1718453827

If github cares enough about this, they would have already sued Amazon. I don't think the author needs to worry about any of this

hi-v-rocknroll · 2024-06-15T11:27:46 1718450866

MSFT's LinkedIn scraping was also a thing about 10 years ago until the magic method was taken away. :'(

altdataseller · 2024-06-15T12:39:47 1718455187

You can still scrape Linkedin today, can you not?

hi-v-rocknroll · 2024-06-17T16:09:56 1718640596

No, I don't think so. Not without an account and not completely as was possible in the past.

amadeuspagel · 2024-06-15T12:32:38 1718454758

They should send make this data available for everyone on AWS.

glimshe · 2024-06-15T11:28:28 1718450908

I couldn't care less about these huge tech companies stealing from one another. Let them sue themselves to extinction.

graemep · 2024-06-15T11:39:52 1718451592

Hos is it even stealing? its taking copies of public information, mostly under open source licences.

Amazon is causing a bit of extra server load for MS to handle.

noprocrasted · 2024-06-15T11:43:38 1718451818

> its taking copies of public information

Yet the same companies will be first to tell you that scraping their public information is against ToS or even illegal.

See the whole drama about LinkedIn scraping, etc.

jsheard · 2024-06-15T12:00:26 1718452826

It's simple: AI companies are allowed to scrape whatever they want, but if you scrape an AI companies data then you are a copyright terrorist and you will never see the light of heaven.

lupusreal · 2024-06-15T12:33:53 1718454833

This double standard from Amazon particularly predates AI by decades. They scrape their e-commerce competition but don't want anybody to scrape them back.

graemep · 2024-06-15T12:49:33 1718455773

I have clients who have done it. The last one was a service for brands who wanted to see how their products were placed and described on Amazon.

7373737373 · 2024-06-15T12:05:29 1718453129

> a bit of extra server load

One of my sites has been spammed by scrapers (Bytedance's Bytespider, Googlebot, Bingbot) several thousand times within just an hour, to the point of making it break. They do this without notification or asking for consent of the users creating the content they ingest and possibly use to train AI models with, and also without credit or compensation. I think the world needs strict regulation against this kind of parasitic, likely illegal behavior.

washadjeffmad · 2024-06-15T14:40:11 1718462411

1994 called and wants your opinion on (hot)linking.

If you email them, they'll usually respond quickly and stop. Otherwise, what's stopping you from blocking / rate limiting their ranges?

I set them up over a decade ago, but I still have some honeypot crawler traps just to keep their cores busy ingesting junk, with autoupdating rules for some of my domains. If they're not obligated to ask, I'm not obligated to give them anything useful.

graemep · 2024-06-15T12:47:25 1718455645

I think that is very different from cloning repos from Github in accordance with the licences.

I agree crawling sites to the extent it causes problems is a problem.

Googlebot and Bingbot do follow robots.txt and respect HTTP 429 responses and usually have reasonable default crawl rates.

Is it possible that these are scrapers using fake UA strings?

7373737373 · 2024-06-17T23:05:19 1718665519

> Is it possible that these are scrapers using fake UA strings?

I now believe this is the case

nextaccountic · 2024-06-21T14:32:18 1718980338

You need to check if the requests come from Googlebot's IP ranges

https://developers.google.com/search/docs/crawling-indexing/...

https://searchengineland.com/google-publishes-the-list-of-go...

https://developers.google.com/search/apis/ipranges/googlebot...

ta1243 · 2024-06-15T12:16:38 1718453798

Did they ignore your robots.txt file?

7373737373 · 2024-06-15T18:37:49 1718476669

I changed it a few days ago, still getting hit by over 14800 requests within 24 hours

ta1243 · 2024-06-16T08:43:38 1718527418

From google etc? Sounds like an abuse case

7373737373 · 2024-06-16T15:00:51 1718550051

19010 requests from user agents "Googlebot" and "bingbot" combined within the last 24h. Bytespider has died down to 74. 17 from Claudebot. But I've heard from a HN user that Bytedance just change their user agent when they are blocked. I'm blocking all of them with Cloudflare.

lobsterthief · 2024-06-15T12:02:04 1718452924

You do know Microsoft has also used all private repositories to train its models, right? Especially for Copilot

graemep · 2024-06-15T13:06:16 1718456776

Yes, and I do not like that. On the other hand I do not have any private Github repos of my own and its a good reason for not having any.

I am beginning to feel that if you really want repos to be private you should probably self host.

beardyw · 2024-06-15T11:50:56 1718452256

What is being scraped is not GitHub s data. It's other people's.

threecheese · 2024-06-15T13:38:47 1718458727

Exactly. I permissively license my code, but not because I want to improve mega-corp’s bottom line. I’m annoyed.

I felt exactly like this when I learned that some of my Goodwill donations - the good stuff - is marked up and sold online, instead of going to low income folks at low-income prices. It might be even worse, given the capability they are building intends to compete with me directly as a developer. It’s like if Goodwill started funding domestic terrorists or the local burglars union.

htrp · 2024-06-15T11:59:22 1718452762

disappointing that large mega Corp does the exact same thing broke developers do to get around rate limits

shermantanktop · 2024-06-15T12:05:06 1718453106

This is a large mega corp that prides itself on acting like it is broke.

amelius · 2024-06-15T12:13:16 1718453596

Except they have access to one of the largest "botnets" on the planet.