Just about 2 years ago, a long dormant project surged back to life becoming one of the best crawlers out there. Zimit was originally made to scrape mediawiki sites but it now is able to crawl literally anything into an offline archive file. I have been trying to grab the data i am guessing will shortly be put under much stricter anti scraping controls soon, and I am not the only one hoarding data. The winds are blowing towards a closed internet faster than I have ever seen in my life. Get it while you can.
The winds may be changing, but those who don't fear resorting to piracy will always be sailing smoothly. We will just have more walled gardens and more illegal offerings, leaving normal people stranded and thirsty.
Unfortunately things like a 500gb dump of sacredtexts website is not often found on torrent trackers or other 'warez' sites. Anna's is pretty great for written material that has been published offline, but even the wayback machine and archive.org have limited full scrapes aside from the published ones by the kiwix team.
Such megadumps are only really interesting for people who train LLMs anyways. No way you can ever consume that for personal development. Individual sources are much more prevalent and useful to everyday life.
Curious what people think is an appropriate request rate for crawling a web page. I have seen many examples where author will spin up N machines with M threads and just hammer a server up until it starts returning more than a certain failure rate.
I have never done anything serious, but have always tried to keep my hit rate fairly modest.
I don't do much crawling or scraping either, but when I have, I go to the opposite extreme. There's no reason why I need the data right that second, so I set it up to pause for some random amount of seconds between requests.
*Except YouTube, I'll yt-dl an entire playlist, no probs*
One of the best use cases for "serverless" functions like AWS lambda is easily proxying around web crawling requests from the comfort of your codebase. To the developer, it's just a function in a file, but it runs in isolation from a random IP address on each invocation - independent of app state. Like Puppeteer, one of Big Tech's little gifts to indie hackers.
It's part of the job at this point - it's public data, not like I'm stealing credit cards. Are you a more ethical developer because you "try to keep the hit rate low"?
Can I pat myself on the back for only ever having 1 browser tab open at a time? Doesn't my minimalist take on browsing the web "keep my hit rate low" - I think I deserve a tax credit for helping the environment too, wdyt?
Would it make you feel better if I said the pages I'm scraping are also hosted on AWS? That is - I'm effectively paying for the data by paying for lambdas. Or is it the poor hardware itself you worry for?
I think, one other side effect of this is the increasing restrictions on VPN usage for accessing big websites, pushing users towards logging in or using (mobile) apps instead. Recent examples include X, Reddit, and more recently, YouTube, which has started blocking VPNs.
I'm also concerned that Free and open APIs might become a thing of the past as more "AI-driven" web scrapers/crawlers begin to overwhelm them.
I think that it is helpful for public mirrors to be made, in case the original files are lost or if the original server is temporarily off, or if you want to be able to access it without needing to access the same servers all the time (including if you have no internet connection at all but you might have a local copy); you can make your own copies from mirrors and from those too etc. Cryptographic hashing can be used too (in order to check that it matches by a shorter code than the entire file). However, they should not make excessive amount of attempted access, so I do block everything in robots.txt (but it is OK if someone wants to use curl to download files, or clone a repository, and then wants to mirror them; I also mirror some of my own stuff on GitHub). What I do not want others to do is to claim that I wrote something that actually I did not write, or to claim copyright on something that I wrote that will restrict their use by others.
None the less dark actors are still going to get through and sell this data later.
There are probably tons of those crawlers and with intent to later launder that data, probably using LLMs intended to change the content just enough to be outside of copyright zone.
Another way of stating it: now that worthless text and images can be made to be worth money people no longer want to have public websites and are trying to change the longstanding culture of an open remixable web to fit their new paranoia over not getting their cut.
I welcome all useragents and have been delighted to see openai, anthropic, microsoft, and other AI related crawlers mirroring my public website(s). Just the same as when I see googlebot or bingbot or firefox using humans. For information that I don't want to be public I simply don't put it on public websites.
Another way of stating it: how dare all those greedy web publishers want to eat! /s
The first couple decades of the web were build on an implicit promise: publishers could put their content out on freely accessible websites, and search engines would direct traffic to those sites. It was a mutually beneficial arrangement: publishers got the benefit of traffic that they could monetize in different ways if they saw fit (even hosting ads from the search engines a la AdSense), and Google etc. could earn mountains in AdWords revenue. I'm not saying it was always a fair tradeoff on both sides, but both sides had an incentive to share content openly.
AI breaks that model. Publishers create all the content, but then the big search engines and AI companies can answer user questions without giving any reference at all to the original sites (companies like Google are providing source citations, but I guarantee the click-through rates go WAY down). This breakdown in the web's economic model has been happening for a while even before AI (e.g. with Google hosting more and more "informational content" directly on SERP pages - see https://news.ycombinator.com/item?id=24105465), but with AI the breakdown is really complete.
So no, I don't fault people at all who don't want all the toils of their labor to be sucked up by trillion dollar megacorps.
That’s their expectation but not the original design. The design of HTTP is that, if you run a web server, you are willing to send content to anyone that sends you a HTTP request. Anyone putting content online with a system like that should expect the data to go out to as many as request it.
Alternative systems, like paywalls with prominent terms and conditions, exist to protect copyrighted content. They intentionally avoid them. So, they’re using a system designed for unrestricted distribution to publish content they hope people will use in restricted ways. I think the default, reasonable expectation should be that anyone can scrape public data for at least personal use.
> they’re using a system designed for unrestricted distribution to publish content they hope people will use in restricted ways
HTTP is not designed for unrestricted distribution - if that were the case we wouldn't have the HTTP status codes 401 (Unauthorized), 402 (Payment required) and 403 (Forbidden).
I don't really disagree with you that, if I understand you correctly, it's not really possible to make web content open for some purposes but not for other, and I commented as much here, https://news.ycombinator.com/item?id=41418603.
But there is currently a lot of content that is published openly solely because the pre-AI economic model of the web made it viable. That economic model is now going away, so you'll see a lot more content put behind paywalls (or just not published at all) because AI means that it no longer makes sense for some publishers to spend time and money to produce content they can't get a return on.
I activated the Block AI Scrapers and Crawlers feature on Cloudflare months ago. When I checked the event logs, I was surprised by the persistent attempts from GPTBot, PetalBot, Amazonbot, and PerplexityBot to crawl my site. They were making multiple requests per hour.
Considering my blog's niche focus on A/V production with Linux, I can only imagine how much more frequent their crawling would be on more popular websites.
I feel like talking about robots.txt in this context is kind of a pointless enterprise given how there's no guarantee it will be followed by crawlers (and TFA fully acknowledges this). Before AI, there was a mutually (not necessarily equal, but mutual) beneficial economic arrangement where websites published open content freely, and search engines indexed that content. That arrangement fundamentally no longer exists, and we can't pretend it's coming back. The end game of this is more and stronger paywalls (and not ones easily bypassed by incognito mode), and I think that's inevitable.
> It’s also the case that preferences shouldn’t be respected in all cases. For instance, I don’t think that academics or journalists doing prosocial research should necessarily be foreclosed from accessing data with machines that is already public, on websites that anyone could go visit themselves.
Screw that. Research is about ethical consent. Also this is very much "well everyone should let me (a researcher) access whatever I like"
This is a shortsighted view. There are plenty of counterpoints, from things like documenting slaughterhouse practices to conservatives implementing policies while simultaneously limiting research into the outcomes of those policies.
Obviously most of the 'AI researchers' right now are not altruistic, but it is possible to take the position that advancing AI will be sufficiently valuable to society that it overrides corporate preferences against bulk scraping.
What a senseless, selfish, petty thing to attempt. To not only deprive systems ultimately benefitting humanity of training content for no possible tangible benefit to yourself, but to try to sabotage them because you think it's theft for a computer to learn from patterns in an image no differently from a human artist (who will often outright trace existing works to train their neural nets, but it's ok when they do it because uh.. uh humans are special!)
What a senseless, selfish, petty thing to attempt.
I would apply those adjectives to the companies (or researchers out for accolades and citations) who attempt to build their systems based on the non-consensual data extraction in the first place.
To not only deprive systems ultimately benefitting humanity ...
By and large, these systems are built to finance the lush early retirements of the founders, investors and high-level ICs of these companies -- not to benefit humanity. Whether they even benefit humanity as a side effect is very much open to question.
If these companies won't pay, or even condescend to ask permission for access - fuck 'em.
No one needs consent to learn from public information. If an AI can look at your drawing and code, then later draw a correct elbow and make the right API call for someone, they have benefitted without taking anything away from you. To try to stop this process because you think you deserve royalties and consent every time someone uses a fact they learned from you would be absurdly entitled.
If an AI can look at your drawing and code, then later draw a correct elbow and make the right API call for someone, they have benefitted without taking anything away from you.
This is like saying if a publishing company takes excerpts from your work, and builds products from it -- without your permission, and of course without paying you royalties of any kind -- they have benefitted without taking anything away from you.
The tech companies call this "learning" of course, but that's just subterfuge.
> What a senseless, selfish, petty thing to attempt
This is clearly not selfishness. Selfishness is believing you're entitled to other people's work for free, which is the implicit assumption made by people claiming that they should be able to train their models on your work.
Selfishness is trying to take other people's work and turn it onto an AI model without compensating them on their turns. This demonstrates extreme entitlement, lack of a consistent moral system, and a fundamental ignorance of basic economics.
> no differently from a human artist
Yes, humans are objectively different from computers, including neural networks. Humans are sentient, and existing AI is not. It's not even accurate to say that humans have "neural nets" in their brains, because human cognition not understood.
The AI poisoning technique is very moral - if someone doesn't explicitly contact me and ask me for permission to train on my data, then they absolutely deserve a corrupt model. I think I'll do this.
Entitlement is expecting a royalty from someone who saw your publicly posted art and learned anatomy from it. Selfishness is attempting to sabotage it so the learner is mislead into learning wrong anatomy. Lack of a consistent moral system is believing these are variably wrong or right to do when it's a meat neural net or a digital neural net in question. And finally, thinking any such loom-stomping tantrum will meaningfully halt progress and let you keep art creation in exactly the same state as the past is a fundamental ignorance of economics.
Your comment demonstrates a complete lack of both reading comprehension and basic logic, as every single claim made was already either refuted by me, or isn't a logical argument at all.
> Entitlement is expecting a royalty from someone who saw your publicly posted art and learned anatomy from it.
As I already pointed out, it's extremely clear that AI are not humans nor are remotely comparable to them (objectively from both a physical and a functional perspective, and morally from the perspective of the vast majority of people holding on to wildly different moral systems, both inconsistent and consistent), so this is completely irrelevant, and additionally not what I claimed (also known as a "strawman fallacy").
Additionally, it is objectively not entitlement to expect payment from someone who consumes your content. If someone posts a learning resource as a paid course on a platform like Udemy, then paying them is morally required - you have to obey the license that the creator makes the content accessible under. Furthermore, the vast majority of the content publicly posted on the internet is not done so under the assumption that it will be used for AI training data - if someone publishes their work for other humans to view for free, that does not extend to a license for AI developers to use it to train for free, which is exactly what the article is about.
> Selfishness is attempting to sabotage it so the learner is mislead into learning wrong anatomy.
See above.
> Lack of a consistent moral system is believing these are variably wrong or right to do when it's a meat neural net or a digital neural net in question.
Objectively incorrect. I can describe a moral system that's non-arbitrary and consistent and perfectly delineates the boundaries between humans and artificial intelligences, and the rights ascribed to both. You cannot. (if you think you can, you're welcome to put it here, and I'll show you why it's inconsistent and/or arbitrary)
Further demonstrating the issues with reading comprehension, you missed/ignored the statement that I made that "It's not even accurate to say that humans have "neural nets" in their brains, because human cognition [is] not understood." which neuters your claims.
> And finally, thinking any such loom-stomping tantrum
The only person making emotionally manipulative fallacies is you. I've articulated my points logically - you've made multiple fallacies, logical mistakes, and your entire first comment was emotional pleading without a shred of logic or reason.
> will meaningfully halt progress and let you keep art creation in exactly the same state as the past
Yet another strawman argument and/or reading comprehension failure. I never claimed that it was possible, necessary, or desirable to keep art creation in the same state. You really didn't read my comment before replying.
> is a fundamental ignorance of economics
...and so this isn't valid. However, I can point out the mistake that you initially made - you believe it's economically viable for those training AI to steal the work of artists and use it to replace them. It isn't.
Your entire comment reads like an AI response that attempted to mirror the structure of mine without any of the understanding, or the ability to make coherent arguments.
Very interesting article. I'm guessing this will remain a cat/mouse game for a while.
However, any moves will increase the cost of training on un-licensed internet data. This will shift the balance slightly closer towards AI companies licensing data as opposed to using scraped data. Emphasis on slightly.
I wonder how long until something like Copilot+, but also with the ability to leak these images into the training data, is released by somebody (probably not by MS, they appear to be trying to maintain some appearance of not being entirely malware authors).
I wonder if “affirming the consequent” is still ok (not that you’ve done so, your post just brought it to mind).
Affirming the consequent is always an interesting one to me when it overlaps with a tragedy of the commons situation.
If publicly available content can be used in training sets, any one model willing to use them will have an advantage. Is it reasonable to assume then that all publicly available content is in the most popular training sets, or is that falling into the affirming the consequent fallacy?
There's nothing stopping anyone here from coming to your public website and downloading your images for use in a training set. If someone did that, why would they be posting online where they pulled data from?
It wouldn't be the most outrageous or illegal thing they have done. In 2022 a Kenyan data labeling firm suddenly cut ties mid-contract because OpenAI was asking them to gather and label CSAM images.[0]
>In February 2022, Sama and OpenAI’s relationship briefly deepened, only to falter. That month, Sama began pilot work for a separate project for OpenAI: collecting sexual and violent images—some of them illegal under U.S. law—to deliver to OpenAI. The work of labeling images appears to be unrelated to ChatGPT. In a statement, an OpenAI spokesperson did not specify the purpose of the images the company sought from Sama, but said labeling harmful images was “a necessary step” in making its AI tools safer. (OpenAI also builds image-generation technology.) In February, according to one billing document reviewed by TIME, Sama delivered OpenAI a sample batch of 1,400 images. Some of those images were categorized as “C4”—OpenAI’s internal label denoting child sexual abuse—according to the document. Also included in the batch were “C3” images (including bestiality, rape, and sexual slavery,) and “V3” images depicting graphic detail of death, violence or serious physical injury, according to the billing document. OpenAI paid Sama a total of $787.50 for collecting the images, the document shows.
>Within weeks, Sama had canceled all its work for OpenAI—eight months earlier than agreed in the contracts. The outsourcing company said in a statement that its agreement to collect images for OpenAI did not include any reference to illegal content, and it was only after the work had begun that OpenAI sent “additional instructions” referring to “some illegal categories.” “The East Africa team raised concerns to our executives right away. Sama immediately ended the image classification pilot and gave notice that we would cancel all remaining [projects] with OpenAI,” a Sama spokesperson said.