Hacker News new | past | comments | ask | show | jobs | submit login
Amazon has a way to scrape GitHub and feed its AI model (dataconomy.com)
65 points by doubtfuluser 7 months ago | hide | past | favorite | 59 comments



Is it git pull?

>> "In response, Amazon proposed a workaround: encouraging its employees to create multiple GitHub accounts and share their access credentials."

Ah, no, it's git pool.


Ethically Microsoft has about as much claim to be able to use the data for co-pilot as anyone else.

On the other hand, maybe a MSFT v Amazon lawsuit over this could be the wake up call the world needs that maybe we should stop centralising critical infrastructure in the hands of a single company. Which is why I think they wouldn't do it - at most I could see Microsoft tightening request limits on accounts associated with Amazon.


> maybe we should stop centralising critical infrastructure in the hands of a single company

Managing your own on-prem or in-colo infrastructure sucks: it's expensive and a source of risk, which is why we moved things like source servers to a centralized model.


Well do.that right when distributed computing finds a workable modrl


Yeah, I guess building a distributed version control system is basically an intractable problem.


GitHub offers more then version control. A decentralized version is about as realistic as a decentralized facebook.


The GitHub user-base cares more about decentralization than the Facebook user-base. Maybe there is hope.


Right but the past decade has shown distributed models are incompatible with public discovery and use.


There are solutions out there though. Mostly it's a lack of - financial - incentive to standardize and iterate on them IMO. But GitLab at least is currently working on adding ActivityPub support.


Of course - I was being facetious. But I disagree with your assessment of how realistic it is. We already have it on a small scale today in the form of some projects using their own gitlab/gitea instances. If Github were to enshitify I expect we would see a push for this from a lot more communities.

I don't think it's that much of a stretch to design a system that keeps, for instance, issue, wiki, and PR metadata in git alongside code. This could then support simple import/export between instances. You could also support cross-instance forking and PR's.

The biggest problems I think you would still have are 1. third-party integrations and 2. abuse/spam prevention. Having been the system-owner for GitHub at a large engineering org before, I can say that for us, switching away would have been virtually impossible because of all the integrations we would have to replace. But, this is a consequence of the centre of gravity being Github currently and not an immutable law of nature.

As for 2, well I expect that'll remain one of the hard unsolved problems of computer science for the time being.


>I was being facetious

You may want to re-read Poe's Law[0].

Poe's law technically refers to Sarcasm but tends to cover Facetious comments as well.

[0]https://en.m.wikipedia.org/wiki/Poe%27s_law


I'm surprised Amazon's legal team signed off on this. It's clearly against the GitHub terms of service[0], and Amazon employees acting on the instructions from Amazon had to approve those terms. It seems pretty much identical to the LinkedIn vs. hiQ scraping case, where as I understand the fake account creation was the key point.

[0] E.g. no API key sharing for the purposes of evading rate limits, only a single free account per person or organization.


When you pay your legal teams as much as Amazons, they probably tell you "Yeah, you'd probably lose any case, but the fine will be a couple of million dollars and you won't have to pay it for a decade, and by then you'd have cemented your market leadership".


What if they‘re not free accounts?


Is the cover image itself generated via some ML model? The old guy in the middle is missing substantial parts of his arm. The box right by him also has some artifacting in the corner.


No, this depicts exactly the nightmarish nature of a job at an Amazon warehouse.


Yes, the Amazon brand arrow at centre top is also broken. In fact all of the people look wrong in some way


Yeah. It is likely an edited AI image. You can confirm by looking at the text on the box at the top left. "BMGOMa"


and the guy on the right.. umm.. what's with his face? Or is he an Alien maybe? Image credit goes to https://linkmedya.com - it doesn't say it is AI-generating content but yep, it certainly looks like it


"Featured image credit: Eray Eliaçık/Bing"


And the "Bing" just links to Bing's Dalle3 functionality


This just rekindled my desire to self-host my git repos. The whole idea that a platform provider can use the IP I host there is obscene. That thieves steal by bounty from each other is not the story.


Separate from the courts, Microsoft could send a message to the AI gold rush field, about "abuse of Microsoft's resources", via ToS:

* All Amazon domain names could be banned from accounts on GitHub, or face annoying restrictions, implemented with trivial technical changes. And lawyers could send a letter to Amazon legal, about how Amazon may and may not use GitHub, including Amazon personnel having to disclose their affiliation (not hide it with GMail), and craft some language about how those employee accounts may and may not be used.

* More harshly, but fear-instilling to individuals throughout industry, the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation. Not only those particular accounts, but any accounts the individuals might use. (This would hurt, not only for genuine open source participation, but also given how open source is sometimes used for job-hunting appearances, and all the current employers that ask for candidate's "GitHub" specifically rather than open source in general.) If banning would have undesired effects of projects GitHub wants to host being pulled, or public reaction as too harsh and questioning why GitHub has so much power, there could instead be annoying restrictions.


> the individuals who let their accounts be used for the scraping could be banned from GitHub, for ToS violation.

That would work, assuming GH doesn’t make mistakes and ban someone else with the same name m. That would then be embarrassing for GH. I can already see news headline “Github banned my account because my name matches that of a web scraping account from Amazon”


Microsoft could sabotage Amazon's AI model by returning poisoned code to accounts registered with @amazon.com email addresses.


The way git works means that you can check that you have an un-doctored clone of a repo just by checking that the commit hash matches. Which in this instance is quite unfortunate, because it would be very funny.

(barring a SHA-1 collision, of course)

EDIT: i suppose another approach could be to invent poisoned repos out of whole cloth and only show them to Amazon, but I susepct that'd be even easier to detect.


Language in this article smells like it's written or rewritten by AI.


Agree. Looks like we have a good hear: https://youtu.be/zbo6SdyWGns?t=78


Microsoft is probably one of the few companies that can sue Amazon without worrying about retaliation from Amazon.

For example, GitLab would need to think twice before suing because they offer deployment on AWS.


Can anyone share a Fermi estimation of the size of poison-pill training data required to impact code interpreter models? (of the size that AMZN might be building with this data)

I expect it would vary by language/platform popularity (size of available training code). Is it infeasible to create or generate enough code, pushed to enough repositories, to impact the correctness of a model that includes the code in its training data set?


MS only provides the infra, everything else is other's hard work under the trojan horse open source whatever. If they introduce limits, time to leave github. This will evolve into an elsevier vs researchers kinda situation.


This article doesn’t make any sense. Why would Amazon make their employees do all this when they can easily pay for a service like crawlbase or similar and easily scrape github without having to create employee accounts?


If github cares enough about this, they would have already sued Amazon. I don't think the author needs to worry about any of this


MSFT's LinkedIn scraping was also a thing about 10 years ago until the magic method was taken away. :'(


You can still scrape Linkedin today, can you not?


No, I don't think so. Not without an account and not completely as was possible in the past.


They should send make this data available for everyone on AWS.


I couldn't care less about these huge tech companies stealing from one another. Let them sue themselves to extinction.


Hos is it even stealing? its taking copies of public information, mostly under open source licences.

Amazon is causing a bit of extra server load for MS to handle.


> its taking copies of public information

Yet the same companies will be first to tell you that scraping their public information is against ToS or even illegal.

See the whole drama about LinkedIn scraping, etc.


It's simple: AI companies are allowed to scrape whatever they want, but if you scrape an AI companies data then you are a copyright terrorist and you will never see the light of heaven.


This double standard from Amazon particularly predates AI by decades. They scrape their e-commerce competition but don't want anybody to scrape them back.


I have clients who have done it. The last one was a service for brands who wanted to see how their products were placed and described on Amazon.


> a bit of extra server load

One of my sites has been spammed by scrapers (Bytedance's Bytespider, Googlebot, Bingbot) several thousand times within just an hour, to the point of making it break. They do this without notification or asking for consent of the users creating the content they ingest and possibly use to train AI models with, and also without credit or compensation. I think the world needs strict regulation against this kind of parasitic, likely illegal behavior.


1994 called and wants your opinion on (hot)linking.

If you email them, they'll usually respond quickly and stop. Otherwise, what's stopping you from blocking / rate limiting their ranges?

I set them up over a decade ago, but I still have some honeypot crawler traps just to keep their cores busy ingesting junk, with autoupdating rules for some of my domains. If they're not obligated to ask, I'm not obligated to give them anything useful.


I think that is very different from cloning repos from Github in accordance with the licences.

I agree crawling sites to the extent it causes problems is a problem.

Googlebot and Bingbot do follow robots.txt and respect HTTP 429 responses and usually have reasonable default crawl rates.

Is it possible that these are scrapers using fake UA strings?


> Is it possible that these are scrapers using fake UA strings?

I now believe this is the case



Did they ignore your robots.txt file?


I changed it a few days ago, still getting hit by over 14800 requests within 24 hours


From google etc? Sounds like an abuse case


19010 requests from user agents "Googlebot" and "bingbot" combined within the last 24h. Bytespider has died down to 74. 17 from Claudebot. But I've heard from a HN user that Bytedance just change their user agent when they are blocked. I'm blocking all of them with Cloudflare.


You do know Microsoft has also used all private repositories to train its models, right? Especially for Copilot


Yes, and I do not like that. On the other hand I do not have any private Github repos of my own and its a good reason for not having any.

I am beginning to feel that if you really want repos to be private you should probably self host.


What is being scraped is not GitHub s data. It's other people's.


Exactly. I permissively license my code, but not because I want to improve mega-corp’s bottom line. I’m annoyed.

I felt exactly like this when I learned that some of my Goodwill donations - the good stuff - is marked up and sold online, instead of going to low income folks at low-income prices. It might be even worse, given the capability they are building intends to compete with me directly as a developer. It’s like if Goodwill started funding domestic terrorists or the local burglars union.


disappointing that large mega Corp does the exact same thing broke developers do to get around rate limits


This is a large mega corp that prides itself on acting like it is broke.


Except they have access to one of the largest "botnets" on the planet.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: