The actual problem seems to be that a large number entities now want a full copy of the entire site.
But why not just... provide it? Charge however much for a box of hard drives containing every publicly-available tweet, mailed to the address of buyer's choosing. Then the startups get their stupid tweets and you don't have any load problems on your servers.
What do you even charge for that? We might never make a repository of human made content with no Ai postings in it ever again. Seems like selling the golden goose to me
Substantially higher loads than Twitter gets today were not "melting the servers" until Musk summarily fired most of the engineers, stopped paying data center (etc.) bills, and then started demanding miscellaneous code changes on tight deadlines with few if any people left who understood the consequences or how to debug resulting problems.
In other words, the root problem is incompetent management, not any technical issue.
Don't worry though, the legal system is still coming for Musk, and he will be forced to cough up the additional billions (?) he has unlawfully cheated out of a wide assortment of counterparties in violation of his various contracts. And as employee attrition continues, whatever technical problems Twitter has today will only get worse, with or without "scraping".
Scraping has a different load pattern than ordinary use because of caching. Frequently accessed data gets served out of caches and CDNs. Infrequently accessed data results in cache misses that generate (expensive) database queries. Most data is infrequently accessed but scraping accesses everything, so it's disproportionately resource intensive. Then the infrequently accessed data displaces frequently accessed data in the cache, making it even worse.
Caches are only so large. Expanding them doesn't buy you much, and increases costs greatly.
The key benefit to a cache is that a small set of content accounts for a large set of traffic. This can be staggeringly effective with even a very limited amount of caching.
Your options are:
1. Maintain the same cache size. This means your origin servers get far more requests, and that you perform far more cache evictions. Both run "hotter" and are less efficient.
2. Increase the cache size. Problem here is that you're moving a lot of low-yield data to the cache. On average it's ... only requested once, so you're paying for far more storage, you're not reducing traffic by much (everything still has to be served from origin), and your costs just went up a lot.
3. Throttle traffic. The sensible place to do this IMO would be for traffic from the caching layer to the origin servers, and preferably for requesting clients which are making an abnormally large set of non-cached object requests. Serve the legitimate traffic reasonably quickly, but trickle out cold results to high-demand clients slowly. I don't know to what extent caching systems already incorporate this, though I suspect at least some of this is implemented.
4. Provide an alternate archival interface. This is its own separately maintained and networked store, might have regulated or metered access (perhaps through an API), might also serve out specific content on a schedule (e.g., X blocks or Y timespan of data are available at specific times, perhaps over multipath protocols), to help manage caching. Alternatively, partner with a specific datacentre provider to serve the data within given facilities, reducing backbone-transit costs and limitations.
5. Drop-ship data on request. The "stationwagon full of data tapes" solution.
6. Provide access to representative samples of data. LLM AI apparently likes to eat everything it can get its hands on, but for many purposes, selectively-sampled data may be sufficient for statistical analysis, trendspotting, and even much security analysis. Random sampling is, through another lens, an unbiased method for discarding data to avoid information overload.
Twitter feels more stable today, with less spam, than one year ago. There's of course parts that have been deliberately shut down, but that's not an argument about the core product.
Pandemic lock downs are 99% over. People are getting back outside and returning to office. These effects have little to do with Twitter's specific actions.
I see more spam these days, particularly coming from accounts that paid for the blue check mark. IIRC, Musk said that paid verification would make things better since scammers wouldn't dare pay for it (I would find where he said this but I hit the 600 tweet limit), but given how lax their verification standards are, it seems to be a boon to scammers, much the same way that Let's Encrypt let anyone get a free TLS cert at the cost of destroying the perceived legitimacy that came with having HTTPS in front of your domain.
(And IMO, that perceived legitimacy was unfounded for both HTTPS and the blue check before both were easy to get, it's just that the bar had to drop to the floor for most people to realize how little it meant.)
The "massive layoffs" was just twitter returning to the same staffing level they had in 2019, after they massively overhired in 2020-2021. This information is public, but this hasn't stopped people from building a fable around doomsday prophecies.
I mean, it’s clear the Musk overcorrected. The fact that managers were asked to name their best employees, only to then be fired and replaced by them, or that musk purposefully avoided legal obligations to pay out severance/health insurance payments (I forget the exact name)/other severance, and that the site has had multiple technical issues that make it feel like there’s no review/QA process all show that he doesn’t know what he’s doing.
He got laughed out of a Twitter call thing with lead engineers in the industry for saying he wanted to “rewrite the entire stack” and not having a definition for what he meant.
Doomed or not, Musk is terrible at just about everything he does and Twitter is no exception
I think that’s always been known, but the tacit agreement between users and Twitter has always been “I’ll post my content and anyone can see it, if they want to engage they make an account”. From a business perspective this feels like a big negative to me for Twitter. I’ve followed several links the last few days and been prompted to login, and nothing about those links felt valuable enough to do so.
It's about $1 per thousand tweets and access to 0.3% of the total volume. I think the subscription is 50M "new" tweets each month? There are other providers who continually scrape Twitter and sell their back catalogue.
Researchers are complaining that it's far too high for academic grants. Probably true, but that's no different from other obscenely priced subscriptions like access to satellite imagery (can easily be $1k for a single image which you have no right to distribute). I'm less convinced that it's impossible for them to do research with 50 million tweets a month, or with what data there is available. Most researchers can't afford any of the AI SAAS company subscriptions anyway. Data labelling platforms - without the workers - can cost 10-20k a year. I spoke to one company that wouldn't get out of bed for a contract less than 100k. Most offer a free tier a la Matlab in the hope that students will spin out companies and then sign up. I don't have an opinion on what archival tweets should cost, but I do think it's an opportunity to explore more efficient analyses.
Honestly I think that's why reddit is closing itself up too. Everyone sitting on a website like this might be sitting on a Ai training goldmine that can never be replicated.
Too little too late. Anything pre-ChatGPT is already scrapped, packaged and mirrored around the Internet; anything post ChatGPT launch is increasingly mixed up with LLM-generated output. And it's not that the most recent data has any extra value. You don't need most recent knowledge to train LLMs. They're not good for reproducing facts anyway. Training up their "cognitive abilities" doesn't need fresh data, it needs just human-generated data.
Precisely, which brings us back around to the question: why are social media companies really doing this?
I think "AI is takin' ooor contents!" is a convenient excuse to tighten the screws further. Having a Boogeyman in the form of technology that's already under worried discussion by press and politicians is a great way to convince users how super-super-serious the problem must be, and to blow a dog whistle at other companies to indicate they should so the same.
It's no coincidence that the first two companies to do this so actively and recently are both overvalued, not profitable, and don't actually directly produce any of the content on their platforms.
I've seen that work with self-driving cars. Simulating driving data is actually better since you can introduce black swan events that might not happen often in real world.
Are you really sure it's legal? In theory it's not different from providing the same information from API or website... but do people working in law think so?
Twitter purchased Gnip years ago, and it's a reseller of social media data. Companies that want all the public tweets, nicely formatted and with proper licensing, can just buy the data from Twitter directly.
I'm assuming their terms give them permission to redistribute everybody's tweets, since that's kind of the whole site. I don't know why they'd restrict themselves to doing it over the internet and not the mail, but do you have any reason to think that to be the case?
So, I'd just made that suggestion myself a few moments ago.
That said, there are concerns with data aggregation, as patterns and trends become visible which aren't clear in small-sample or live-stream (that is, available in near-time to its creation) data. And the creators of corpora such as Twitter, Facebook YouTube, TikTok, etc., might well have reason to be concerned.
This isn't idle or uninformed. I've done data analysis in the past on what were for the time considered to be large datasets. I've been analyzing HN front-page activity for the past month or so, which is interesting. I've found it somewhat concerning when looking at individual user data, though, here being the submitter of front-page items. It's possible to look at patterns over time (who does and does not make submissions on specific days of the week?) or across sites (what accounts heavily contribute to specific website submissions?). In the latter case, I'd been told by someone (in the context of discussing my project) of an alt identity they have on HN, and could see that the alternate was also strongly represented among submitters of a specific site.
Yes, the information is public. Yes, anyone with a couple of days to burn downloading the front-page archive could do similar analysis. And yes, there's far more intrusive data analytics being done as we speak at vastly greater scale, precision, and insights. That doesn't make me any more comfortable taking a deep dive into that space.
It's one thing to be in public amongst throngs or a crowd, with incidental encounters leaving little trace. It's another to be followed, tracked, and recorded in minute detail, and more, for that to occur for large populations. Not a hypothetical, mind, but present-day reality.
The fact that incidental conversations and sharings of experiences are now centralised, recorded, analyzed, identified, and shared amongst myriad groups with a wide range of interests is a growing concern. The notion of "publishing" used to involve a very deliberate process of crafting and memoising a message, then distributing it through specific channels. Today, we publish our lives through incidental data smog, utterly without our awareness or involvement for the most part. And often in jurisdictions and societies with few or no protections, or regard for human and civil rights, let alone a strong personal privacy tradition.
As I've said many times in many variants of this discussion, scale matters, and present scale is utterly unprecedented.
This is a legitimate concern, but whether the people doing the analysis get the data via scraping vs. a box of hard drives is pretty irrelevant to it. To actually solve it you would need the data to not be public.
One of the things you could do is reduce the granularity. So instead of showing that someone posted at 1:23:45 PM on Saturday, July 1, 2023, you show that they posted the week of June 25, 2023. Then you're not going to be doing much time of day or day of week analysis because you don't have that anymore.
Yes, once the data are out there ... it's difficult to do much.
Though I've thought for quite some time that making the trade and transaction of such data illegal might help a lot.
Otherwise ... what I see many people falling into the trap of is thinking of their discussions amongst friends online as equivalent, say, to a discussion in a public space such as a park or cafe --- possibly overheard by bystanders, but not broadcast to the world.
In fact there is both a recording and distribution modality attached to online discussions that's utterly different to such spoken conversations, and those also give rise to the capability to aggregate and correlate information from many sources.
Socially, legally, psychologically, legislatively, and even technically, we're ill-equipped to deal with this.
Fuzzing and randomising data can help, but has been shown to be stubbornly prone to de-fuzzing and de-randomising, especially where it can be correlated to other signals, either unfuzzed or differently-fuzzed.
I despise Musk as much as anyone else and charging for API access has hurt a lot of valuable use cases like improving accessibility but … how about not massive scraping a site that doesn’t want you to?
Scraping isn’t illegal, and to be honest, I’m not even sure it’s unethical. I’m assuming you think it so — if so, why? I’m not disagreeing, but haven’t given it much thought.
Having been to twitter mostly through the most recent prominent war, man the signal to noise ratio is really low even when being careful about who to follow and who to block. There is so much disinformation, bad takes, uninformed opinions presented as facts, pure evil, etc.
So I guess it could be used for training very specific things or cataloging the underbelly of humanity but for general human knowledge it’s a frigging cesspool.
OK, not gonna argue with that. There is, I guess, a perception that it matters because policy-makers, and the wonks and hacks that influence them are hooked. The value for me (and ergo the public, some classic NGO thinking there for you) lies in understanding those dynamics.
I do not use the Twitters myself, and actively discourage others from doing so. Sends people bonkers.
I mean, we have found election manipulations like large-scale inauthentic activity of out-of-staters explicitly targeting African Americans, and projects here even to the extent of the perpetrators getting indicted. Other projects were tracking vaccine side-effect self-reports faster than the CDC and other disaster intelligence.
We were actually gearing up to switch to paid accounts as we found use cases that could subsidize these efforts... And then the starting price for reasonably small volumes shot up to like $500k/yr.
So, are we saying it's unethical for Google and other search engines who make money off of ad revenue to scrape sites like Twitter? Or are they paying a large sum to Twitter to do this?
When there is a value exchange between the two entities that are relatively similar then I think it is ethical. People trade Google making money on ads for their site being found when people search. It is also possible to opt-out.
But it's it ethical for the site owner to block access to random people and companies in the internet to _my_ data? I posted that tweet with the expectation that it's gonna be publicly available. Now the owner of the site is breaking that expectation. I would say that this part is also unethical.
Especially since they're not moderating things or anything.
Agreed. However, it's probably covered by their terms of service.
Same thing with the recent reddit kerfuffle. I'd have much preferred a Usenet 2.0 instead of centralizing global communications in the hands of a handful of private companies with associated user-hostile incentive structures.
Being indexed by google is optional. Twitter could stop it a any time if they thought it was a bad deal for them. That not comparable to a startup company trying to scrape the entire site to train their AI and using sophisticated techniques to bypass protections Twitter has put in place
Except with modern software, some wannabe genius programmer will think they can get a bunch of money or cred or whatever by infantilizing the process down to something your grandma could use. Then, suddenly, everyone is scraping. The net effect is largely the same -- server operators see an overwhelming proportion of requests from bots. Still ethical?
Yes, it is ethical. In many countries it is legal for humans to walk around the public square and overhear all conversations.
It is NOT legal to install cameras that record everyone's conversations, much less sell the laundered results.
Pre-2023 people went on Twitter with the expectation that their output would be read by humans.
A traditional search engine is different: It redirects to the original. A bastardized search engine that shows snippets is more questionable, but still miles away from the AI steal.
Many countries have freedom of panorama, which means it is legal to video record the public square. I'm not aware if anywhere has specific laws on mounting the camera on a robot.
If the background of the issue is as Musk described, then it certainly is not allowed by twitter’s robots.txt, which allows a maximum of one request per second.
I do a lot of data scraping, so I’m sympathetic to the people who want to do it, but violating the robots.txt (or other published policies) is absolutely unethical, regardless of the license of the content the service is hosting. Another way of describing an unauthorised usecase taking a service offline is a denial of service attack, which (again, if Musk’s description of the problem is accurate) seems to be the issue Twitter was facing, with a choice between restricting services or scaling forever to meet the scrapers requirements.
Personally I would have probably tried to start with a captcha, but all this dogpiling just looks like low effort Musk hate. The prevailing sentiment on HN has become so passionately anti-Musk that it’s hard to view any criticism of him or Twitter here with any credibility.
This isn't going to make them stop either. Musk is about to see a spike in account creations using the method of lowest resistance. I expect "sign in with apple" will disappear as an option soon, given its requirement of supporting "hide my email" that makes it trivial to create multiple twitter profiles from one apple ID.
And yet people do. Kind of predicting what various people react including scammers, bots, scrapper and what not is, like, job of a management in a company like this.
He holds views that were the progressive norm 15 years ago, which are now considered bigoted, this is considered unacceptable today. There's a lot I don't agree with him on, like Ukraine, but "despise" is a word I reserve for the likes of Putin.
I don’t think Putin is the epitome of evil that the west portray him to be either. War is hell and he surely started the larger scale war, but just remember that you’ve probably been introduced to less than 1% of his side of things as a western citizen. The western world has gone to war many, many times in history for lesser reasons.
What do you know about Putin’s motives? What propaganda do you think you’re under?
You’re probably smart enough to understand that out of spite and regret of your country’s history with the Russians your countrymen have more motivation than many others to judge the Russian efforts without any further investigation into the matter.
The same applies for myself, since I’m Finnish. It’s almost sad to see how people abandon all reason and critical thinking skills because of some ingrained belief that “Russia bad”. All of my knowledge of the human nature leads me to believe that they’re no more bad than the next people, and that they probably have some motives to go to a taxing war that we don’t really understand here in the west - seeing as the first casualty in war is the truth.
>Yeah he started the one of the deadliest wars in the 21 century, threatens to destroy the entire planet with nuclear weapons, but he is not that evil because there were other wars started by the west
I’ll rephrase your argument for you: “Why don’t you listen to the rapist’s opinion? The victim is surely not blameless. Besides, your cousin is a shoplifter”.
Would his side of the story matter to you? I don’t think it’s a particularly nuanced point to you since you’ve already made up your mind, however ignorant it might be.
Putin already gave his side of the story. He declared Ukraine an invalid country, said there were nazis there and then went into full out war to destroy the country while committing countless atrocities.
I don’t think it’s a particularly nuanced point to you
What point and why do you keep saying 'nuance' over and over while giving zero actual information? What are you trying to say and what evidence is there?
Let's think about this super hard. What is the justification for an unprovoked genocidal war? Why are you defending putin?
however ignorant it might be.
Show me where you get your information, lets see the source of this nonsense.
1. As already mentioned, it's hardly a nuanced point.
2. If you actually want to hear my opinion, then the realm of geopolitics + good old-fashioned hate of the US government does a number on people's logic, so we get what I can only charitably describe as a parade of non-sequiturs, whataboutisms and other fallacies. And so it can be useful to frame it in the simpler terms, for example you could hardly find anyone even on this site who would condone the forced takeover of parts of people's homes. Literally the same is happening on a scale of the countries.
API rate limits are more easily enforceable. If they keep scraping there are methods to detect and thwart behaviour. I don't think twitter has the appropriate talent and work environment to allow proper solutions to be implemented. It's all knee jerk reaction to whatever Elon decides.
It's more easily enforced, except when you don't give them enough they just go back to scraping. Or create a million fake developer accounts and pool the free quota if that's possible. These are not hypotheticals, loads of companies have done both against all kinds of APIs over the years, Twitter included.
But they were too stingy with the tiers and too greedy with their prices. Even for minor use cases where you need to make, say, 100 API calls a day, you’ll need to pay $100/month.
I'm not going to pay $100 just to fetch 3000 records for a hobby project. I'll either skip the project, or I'll just abuse my scraping tool.
If they'd made some more reasonable pricing tiers, I would have been happy to pay.
Fetching something as simple as the total follower count from an API shouldn't be more (exorbitantly) more expensive than fetching data from, say, GPT-4. No reasonable person can make an argument for $10c/call pricing.
Did you actually read that comment? I think the point is very clear -- given a reasonable price, people may would want to use the API instead of scaping the data themselves. If you instead ask for exorbitant amount of money, it only forces people to scrape, because there is no business model that would make it possible to pay.