Sorry I don’t buy it. Hundreds of millions of people use Twitter, and we are to ...

dbbk · on July 1, 2023

Not only that but unauthenticated access is the easiest thing to cache. There is no need to "bring large numbers of servers online". He's lying.

gibrown · on July 1, 2023

A bot scraping content will tend to go deep into the archives and hit all content systematically. Caching isn't as effective if you hit everything whereas real users will tend to hit the same content over and over again.

It can add nontrivial load.

sebzim4500 · on July 1, 2023

They could but signing in, even in selenium, means agreeing to twitter's TOS. See the LinkedIn scraping case.

smcl · on July 1, 2023

The same way these AI code completion tools respected GPL-licensed code?

sebzim4500 · on July 1, 2023

You don't generally need to accept licenses in order to scrape something, only if you want to distribute it.

The legal ambiguity comes from the question of whether LLM outputs are a derivative work of the training data. I expect that they aren't, but anything can happen.

naasking · on July 1, 2023

> Hundreds of millions of people use Twitter, and we are to understand that there are an enough people scraping to the extent that they had to suddenly take drastic action by shuttering unauthenticated access

Suppose 1 million people are accessing Twitter at any given time. An actual person might only be making 1 request / second. That's 1 million requests / second.

Suppose there are 100 AI companies scraping Twitter. A bot like this can make thousands to tens of thousands of requests per second. That's an additional million requests / second.

There are probably more than 100 "AI" companies now, trying to train their own bespoke LLMs. They're popping up like weeds so I can totally see Twitter's load doubling or tripling recently. So sorry, I just don't get the skepticism. Sure it could be a cover for something else, but his actual stated reason seems totally possible.

saltminer · on July 1, 2023

> A bot like this can make thousands to tens of thousands of requests per second.

You don't need to use a bot to do this, Twitter literally did this to themselves through their own buggy code https://sfba.social/@sysop408/110639474671754723

If Silicon Valley was still being produced, this would make for a great episode.

jaggirs · on July 1, 2023

Yeah no you cant just 'use selenium'. To keep the same scraping volume you might need thousands of accounts and 10x the compute.

smcl · on July 1, 2023

It’s not a little “use selenium” switch you can click, but it absolutely is an option (and there are others) if the barrier is simply to have an authenticated account and be logged in.

If these data scraping operations are as sophisticated and determined as he claims this measure is insufficient and actually it really hurts Twitter far more than it helps. Case in point: we stopped sharing Twitter links because when you click them in most iOS apps it opens up an unauthenticated web view and presents you with a login screen. So we just collectively decided “ah ok no sharing Twitter” and moved on.

I’m sure there are companies scraping Twitter. I just don’t buy that it’s as big of an issue as he claims it is, and that preventing people from viewing tweets without logging in is a way to mitigate against that (I’d first look at banning problematic IP addresses first, personally).

To me it’s either:

1) a very poor and very temporary mitigation against scraping, that could be bypassed with a bit of effort

2) an experiment in optimising metrics - Musk sees lots of unauthenticated users consuming Twitter, tries to steer them into signing up

3) it’s all just a big mistake

Option #2 makes the most sense to me, but frankly none of them are good