Personal and social information of 1.2B people discovered in data leak

jillesvangurp · on Nov 22, 2019

I was at an Elasticsearch meetup yesterday where we had a good laugh about several similar scandals in Germany recently involving completely unprotected Elasticsearch running on a public IP address without a firewall (e.g. https://www.golem.de/news/elasticsearch-datenleak-bei-conrad..., in German). This beats any of that.

Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."

Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?

In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).

It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.

jerrac · on Nov 22, 2019

I've been using ES off and on since before 1.0 came out. It has always baffled me that ES doesn't require a username and password by default.

ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.

Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.

I am serious about my question. Could anyone clue me in?

jillesvangurp · on Nov 22, 2019

It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).

If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.

At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.

z3t4 · on Nov 22, 2019

Software should be secure by default. Don't blame the user.

mySQL in comparison wont even let you install without setting a root password. And it only listen on localhost/unix-socket by default. Then you need to explicitly add another user if you want to allow it to login from a non local ip. I don't think it's even possible - to both set a blank root password and allow it to login from a public IP.

So you really think the solution is to blame some low level worker, and sue him/her? The blame should always be on the people in charge, usually the CEO, who set the bar for engineering practices, proper training, etc, or the lack of.

xkcd-sucks · on Nov 23, 2019

While I don't think blaming labor is constructive or ethical, it seems like most tools pose danger to users in proportion to utility. For example, cars can squish people, electricity can fry people, and power tools can remove limbs.

Typically, people start out using knives and bicycles as children, learn through experience that crashing and getting cut hurt, and carry those lessons forward when they start using tablesaws and cars later in life. How does this apply to elasticsearch? I have no idea.

z3t4 · on Nov 24, 2019

We could teach our children that software is very dangerous, especially databases. Or we could make software secure by default. But we also need to teach the user how to use the software properly. Learning by getting hurt is effective, but then we also need to have playgrounds.

zerocrates · on Nov 22, 2019

That MySQL stuff is all quite recent... up until 5.7 (?, one of the most recent releases, anyway) there's no root password by default and running `mysql_secure_installation` is a common (but not mandatory) step to, well, secure the installation and set a root password. I think MariaDB still works this way? Not sure.

I'm not aware of "bind to localhost" being the default, either. The skip-networking setting to only allow local socket connections is definitely not the default, and I'm pretty sure the default is still to bind to all interfaces.

z3t4 · on Nov 23, 2019

I installed mySQL a couple of months ago on a Ubuntu server, and got asked to set a root password. I've also installed mySQL many times on Windows. Secure install is the default. And it doesn't annoy me a bit. I like my software to be secure by default.

m00x · on Nov 23, 2019

This is ridiculous.

Software should be built in the best method of delivering maximum value to its users. A trade-off for usability can be made for certain cases like ease-of-use for new software. Redis was part of this a while ago http://antirez.com/news/96.

Engineers should know their tools before using them. It's a huge part of our jobs. You could introduce a ton of other vulnerabilities in software: XSS, SQL injections, insecure cryptography. Security is part of our job and matters we must know.

You don't blame a plane for a pilot mistake that was meant to be part of his training. Engineers in every other sector are responsible for their mistakes, we should be too.

Also, you don't sue the worker, you sue the company.

CydeWeys · on Nov 23, 2019

"Software should be built in the best method of delivering maximum value to its users."

Yes, and defaulting to insecure, thus repeatedly causing huge data breaches, is the exact opposite of delivering maximum value to users. It's delivering maximum liability.

sailfast · on Nov 23, 2019

I would argue that the single command to begin using the application and the ease of on boarding / querying data was a huge factor in expanding its usage. Elastic optimized for initial spin-up and getting things running fast. It works really well! Until you load it full of data on a public IP, that is.

PeterisP · on Nov 23, 2019

That single command to spin up the application can easily generate and show a copyable random secret required to use it, so that you can use easily but there's no option to use it that insecurely.

miohtama · on Nov 23, 2019

Onions. You need layers and defense in depth. Because even the best humans make mistakes and it is inhuman to assume perfectionism. Never rely on just one engineering feature.

lonelappde · on Nov 23, 2019

> You don't blame a plane for a pilot mistake that was meant to be part of his training

Did you miss that Boeing is right now risking bankruptcy for doing exactly this?

bjornjaja · on Nov 23, 2019

Honestly a lot of the problem is: people aren’t studying systems engineering OR security. Look at all the “learn to code in 21 days” BS and all the code academies.

There’s so much emphasis on abstracting away the systems with cloud-this and elastic-that and developers don’t know much about general systems engineering.

My recommendation to software developers: take the Network+ and Security+ exams at the bare minimum.

Honestly as much as people complain about process getting in the way of things, there should be checks and balances at any business that deals with personal information. Finance institutions are heavily regulated—these fkers should be held accountable.

RidingPegasus · on Nov 23, 2019

> "Engineers"

Maybe the hint is right there in your comment. Nearly all the people deploying these nodes aren't engineers in the slightest despite having someone given them such a title.

NicoJuicy · on Nov 23, 2019

It's not always engineers that use them.

Sometimes software managers have the sudden need to show statistics and other things.

Yeah, that was fun...

lonelappde · on Nov 23, 2019

If security is so important, why should we accept database developers who don't understand that?

kbr2000 · on Nov 23, 2019

Because... they dance the devops dance with their devop hats on! Security problems can be swiftly danced around until they actually surface, and can then be handled in the next round of "continuous delivery". It's also smart to postpone solving most issues until after they occur, so sales can continue bragging about "continuous improvement".

jerrac · on Nov 22, 2019

So, after some thought, here's why I don't consider it pointless to have basic auth built in.

It would keep ES from being completely open. If you wanted to get in, you'd have to comprise some part of the network that would let you read the username and password.

The way it is now, anyone can do a scan for port 9200 and get full access right away.

It is also important to have a username and password, even on secured networks. My test instance is on an internal network, and protected by both network and host firewalls, but I still make sure to secure it beyond that.

Basic auth would not provide a false sense of security. It is simply a very basic part of overall security. Not having it is a mistake.

GhettoMaestro · on Nov 22, 2019

> At some point you just have to call out people for being utter morons. The blame is on them, 100%. [...]

Your attitude is a symptom of a broader issue that plagues this industry: Indifference to risk*probability. If you don't ship software with "secure defaults" (depending on the threat/attack model), you essentially are handing out loaded shotguns, then blaming the "dumb" user when they inevitably point it at their foot and click the trigger. Easy solution: Don't hand out the gun loaded -- make the user do specific actions that enable the usage. Yeah, it creates some friction to first time deployment, but that's a secondary concern to having your freaking DB leaking all over the place.

outworlder · on Nov 22, 2019

But ES doesn't hand over a loaded gun . Someone went out of their way to load the gun up.

GhettoMaestro · on Nov 22, 2019

Bullshit.

If firing up a piece of software creates an unauthenticated, unprotected (non-TLS) endpoint to read-write data, that's a loaded gun. That is PRECISELY the default behavior of ES.

ES has jacked around for years by making TLS and other standard security features premium. To that, I say this: Screw ES and their bullshit business model. Their business model is a leading cause to dumbasses dumping extremely sensitive PII data into a DB that is unprotected - those same folks aren't going to go the extra mile to secure the DB, either by licensing or 3rd party bolt-ons.

Thus, why it must be shipped secure by default. Anything less is a professional felony, in my eyes. Also, screw ES again, in-case I wasn't clear.

YawningAngel · on Nov 22, 2019

Is it a secondary concern, though? As a startup, uptake is as vital as oxygen

wildmusings · on Nov 23, 2019

Tort law is going to catch up to software soon enough and people will be held accountable for negligently creating or deploying software that they should have known would cause harm.

The fact that someone else down the chain should have known better is not a perfect defense. If that misuse was foreseeable and you didn’t do enough to prevent or discourage it, then you can still be held liable.

geofft · on Nov 23, 2019

If startups prioritize their growth over the good of society, isn't the logical conclusion that startups are a threat to society?

glloydell · on Nov 23, 2019

They're not a startup.

late2part · on Nov 22, 2019

maybe. but there's always this....

http://www.team.net/mjb/hawg.html

cobookman · on Nov 22, 2019

There's something called defense in depth.

Even with ES deployed in an environment with proper network firewall rules...etc, I'd still want some sort of authentication/RBAC

jwandborg · on Nov 22, 2019

"Defense in depth" sounds, to me, like a phrase to justify multiple layers of imperfect security.

A single layer of cloth might not hold water, adding more layers of cloth may hold water for longer, but it's probably more cost effective to start with the right material.

peteretep · on Nov 23, 2019

> "Defense in depth" sounds, to me, like a phrase to justify multiple layers of imperfect security.

That’s absolutely correct! But you seem to be missing the fact that _all_ layers of security are always imperfect.

bound008 · on Nov 22, 2019

This is a fallacy of distributed systems. Never trust the network. Best case you get packets destined for somewhere else, worst case you your network segmented wasn't actually segmented.

blondin · on Nov 23, 2019

i agree with GP here. ES is to blame here. not long ago apache airflow had a similar vulnerability discovered about not having sensible authentication defaults. the reasoning on their mailing list was eerily similar to those defending ES here. same arguments (iirc)

history is our greatest teacher. i think ES will end up doing what that team did: they agreed to provide sensible & secure defaults.

tibbon · on Nov 22, 2019

Security in depth. If I compromise one part of your network, I shouldn't compromise it all.

jeltz · on Nov 23, 2019

PostgreSQL does the following things by default to prevent this:

    1. Only listen to localhost and unix sockets
    2. Not generate any default passwords

So the only way to connect to a default configured fresh installation of PostgreSQL is via UNIX sockets as the postgres unix user. Where PostgreSQL is lacking is that it is a bit more work than it should be to use SSL.

sedachv · on Nov 23, 2019

> It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only.

Have you ever heard of the end-to-end principle, IPv6, or number 4 of the eight fallacies? http://nighthacks.com/jag/res/Fallacies.html

LaGrange · on Nov 22, 2019

> It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).

I've met at least one cloud provider in the past (small Dutch thing) that provides _only_ public IP addresses. They do have customers, though one less now. Clustering over the public Internet is a thing. It shouldn't, but I could say the same thing about this website and yet here we are.

_gfrc · on Nov 23, 2019

Heroku does the same in non-enterprise tiers. Their databases are accessible by the public internet with no option to limit it to your own dynos.

jillesvangurp · on Nov 22, 2019

Well, lets agree it's a sad thing. Very sad.

LaGrange · on Nov 23, 2019

Oh sure, but sad things happen. And they can be even messier: I had a Jenkins instance "made" public because a sysadmin new to a hosting provider forgot to remove the public IP that gets automatically assigned to new things. We were lucky, being fairly sure nothing found it before I realised, but it was a strong lesson learned:

Any network may become public by accident unless you go to great lengths to make sure it doesn't. Configurations change and mistakes are made even by seasoned people. People bring devices. Unless there's an air gap, people's devices may be hacked and let stuff through. Put authentication and anti-CSRF on _all_ your stuff, always.

anaphor · on Nov 22, 2019

> Clustering over the public internet is not a thing with Elasticsearch

It is, sort of, https://www.elastic.co/guide/en/elasticsearch/reference/curr...

But it's not a feature you'd be using without a really good reason IMO.

jerrac · on Nov 22, 2019

That does give me some food for thought. Not sure I agree a username and password is pointless though.

Thorrez · on Nov 23, 2019

>Having basic authentication in Elasticsearch would be the pointless equivalent.

Instead of that they could implement a PAKE. That would provide security with no certificates.

iforgotpassword · on Nov 23, 2019

Honestly, I as a user don't give a shit what a good engineer should so. All I see is that my personal data gets leaked left and right by elasticsearch and not mysql or postgres. But its fanbois just keep shifting blame instead of reflecting about reality and going "hey yeah maybe we should try do do something about it on our end". So fuck ES.

sebsito · on Nov 22, 2019

I agree. Every anti-moronic default adds friction. I love that I can play with ES quickly via simple URL without any auth.

oblio · on Nov 22, 2019

That's how we got PHP, Javascript, Visual Basic, MySQL (before version 5), Mongo.

You'd think that at some point we'd understand that there's way more morons out there than sensible people.

rbanffy · on Nov 23, 2019

It can still bind to localhost or a local socket without auth.

0ld · on Nov 22, 2019

> It has always baffled me that ES doesn't require a username and password by default.

because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin

ibirman · on Nov 22, 2019

They offer security as a paid feature.

jillesvangurp · on Nov 22, 2019

Actually it comes for free now with the standard ES distribution. https://www.elastic.co/blog/security-for-elasticsearch-is-no...

ThePowerOfFuet · on Nov 22, 2019

>Security for Elasticsearch is now free

What a horrific title. Even simply typing that should have been a blinking neon sign to them that they had their priorities in the wrong order.

lmilcin · on Nov 22, 2019

That's incorrect.

The usual way of using this service is to have backend network configured that connects your services that is not available from outside (ie you have to traverse through services to reach it).

The so called "security" is just a paid feature for companies that want to use ElasticSearch but want to use it in "legacy" way because, presumably, they don't have people to design it correctly.

dtech · on Nov 22, 2019

That's still really insecure, because it means that as soon as someone manages to gain any access to that network or any of the services on that network has a security issue your database is wide open.

That means that if someone manages to get access to the. I'd say public internet with proper (encrypted) password auth is more secure than that.

lmilcin · on Nov 23, 2019

If attacker has access to app server it is already game over. App server typically already has access to all of the data.

The pods are akin to localhost networking where there is only one externally available application with multiple networked components.

rbanffy · on Nov 23, 2019

That's true, but there are usually multiple ways to compromise protected networks. You still need to protect the database against attacks that don't go through the app server.

m00x · on Nov 22, 2019

If an attacker gets a hold of your app server, they will be able to get the connection details for that DB, including the username/password.

Having a password adds a small layer of protection to databases that the affected app wasn't meant to connect to.

It adds some protection in that case, but the user should use best judgement if it's worth doing.

lacker · on Nov 22, 2019

If you set up elasticsearch on a cloud service like AWS, by default your firewall will prevent the outside world from interacting with it, and no authentication is really necessary. If you do use authentication, you probably wouldn't want username+password, you would probably want it to hook into your AWS role manager thing. So to me, username+password seems useful, but it isn't going to be one of the top two most common authentication schemes, so it seems reasonable that it should not be the default.

MongoDB also by default does not have username+password authentication turned on.

I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.

thegeomaster · on Nov 22, 2019

I don't see why, though. It's much safer to start with a secure setup and then have the user disable the security explicitly (hopefully knowing what they're doing). Yes, username/password auth is not that common, but isn't it better than having no auth at all?

outworlder · on Nov 22, 2019

Ok, let's say username/password is mandatory and enabled by default. I see to options.

Option one, they generate an unique password for every installation – non trivial to do, because at which point do you do it? It can't be before a cluster is formed, as you'll have a split brain generating a bunch of credentials. If you do it afterwards, then there is a period of time when you cluster is not yet protected. Worse yet, unprotected and handshaking authentication. So you don't do that.

You could make the user input the credentials. What is to prevent them from creating weak credentials? And worse, they have to do that for every node (or at least the masters). Not a good experience and lost credentials will probably be the subject of a good many support calls.

So most products don't do that. What they do is default passwords. Which is arguably no security at all and doesn't protect anything. It may make it just a tiny bit easier to do the right thing afterwards (by changing to better credentials). Still, there's a period of time while the cluster is unprotected (default credentials are as good as no credentials).

Authentication does little to protect against the sort of people who are exposing databases to the public. If it is easily disabled, then they will be doing just that. Because they are already doing that by forcing databases to bind to publicly accessible interfaces.

thegeomaster · on Nov 22, 2019

I'd say option two is the only one viable. You deny access to the service until credentials are set by the user. You print huge warning labels while the credentials are set by the user to remind them of the possible consequences of setting weak credentials.

Yes, lost credentials will be subject of many support calls. Then, it boils down to your priorities. If you care about minimizing support calls, then sure, leave everything open to everyone. It will surely result in fewer access problems.

On the other hand, if your motivation is actually preventing your end-users from doing stupid things, it makes sense to just do the most conservative thing as default. Let the user change to the more liberal option, but not before informing them of all dangers that might befall them in that case.

I refuse to believe in this narrative of the end-user just being a stupid automaton who does not have any agency, and that any default imposed upon them will just result in them overriding the default with their terrible practices and ideas. I think there is a possibility of education and risk reduction.

jerrac · on Nov 22, 2019

I'd argue that the "pre-cloud" era is still going strong. And that is a good thing. My workplace has it's own data center. There are some downsides, but I prefer it.

So username+password really is needed. And should be included by default.

Also, I'd expect the same of something like MongoDB. That it doesn't have that by default is just baffling.

dmos62 · on Nov 22, 2019

Password auth over HTTP is horrible. Short of binding a public IP address to your instance, basic auth without HTTPS setup is probably the worst thing you can do.

paco_sinbad · on Nov 22, 2019

It's a marketing ploy by ES.

They aggregated the data and published it so that the viral breach would spread their name around because all publicity is good publicity.

Just riffing of course.

jschwartzi · on Nov 22, 2019

This addresses entirely the wrong question. By looking at it as a technical problem you're completely missing the broader ethical problem. Why was anyone allowed by law to amass this amount of data? And why did PDS not take the security and privacy concerns of 1.2 billion people seriously enough to ensure the data was handled correctly? They obviously thought it was valuable enough to amass a huge database. Do they sell this to just anyone? If not, who can buy access to this data? How much does it cost, and what steps are involved in doing so?

This makes me want to talk to a lawyer.

Xylakant · on Nov 22, 2019

> Out of the box it does not even bind to a public internet address.

Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.

Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.

lmilcin · on Nov 22, 2019

You typically use these in pods which share networking but are not available from outside.

It doesn't matter then if you bind it to 0.0.0.0.

dharmab · on Nov 22, 2019

At the time it was common to deploy on bare hosts. Deploying ES into a network namespace isn't even the most common use case today.

Xylakant · on Nov 23, 2019

That still puts you a single firewall mistake away from disaster. It also places a lot of trust into the applications and hosts that can access ES on a network level: They get full access with no control at all.

To add on that: No security also means no TLS, neither in the cluster communication, no TLS speaking to the client etc.

cookiecaper · on Nov 22, 2019

I've come across several such ES instances that are 100% exposed to the world without even trying, and ES is by no means the first tool to have this problem. People are never going to stop doing this. Making it annoyingly difficult within ES just weakens them such that some other "wow it's so easy" search product will be better positioned to eat their lunch.

czbond · on Nov 22, 2019

ES, Mongo, Redis used to be some of the easiest targets for production data (security vuln wise). Deployed by SWE's usually, with products that were early versions, and didn't have access control by default.

ryan_lane · on Nov 22, 2019

ES's practice of making its security a proprietary paid for product is the cause for these kinds of things. It's a shitty practice, and this is one of the reasons I'm glad AWS forked it.

staticassertion · on Nov 22, 2019

Other databases learned that not requiring a user/password upon install is completely irresponsible. ES and other dbs need to catch up ASAP, it's ridiculous.

Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.

jwandborg · on Nov 22, 2019

Trusting software you install to be secure is ridiculous and completely irresponsible, especially if you did not pay for someone else to take the blame.

The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.

staticassertion · on Nov 23, 2019

That doesn't absolve ES of providing basic security defaults.

throwaway5752 · on Nov 22, 2019

Wasn't this exact same thing a huge scandal just a few years ago for Mongo on Shodan?

I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.

achillean · on Nov 22, 2019

Yes, and similar issues still exist with public MongoDB instances even though the defaults are secure.

healsjnr1 · on Nov 23, 2019

This assumes it was incompetence and not done intentionally.

My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.

Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.

Welcome to the early 90s internet.

codetrotter · on Nov 23, 2019

> It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above

I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.

Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.

But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)

> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.

Again, not necessarily, for the same reason as above.

But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(

Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.

Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.

Companies that can’t handle data securely, have no business handling data at all.

Scoundreller · on Nov 23, 2019

My favourite was Bitomat.pl's loss of 17k bitcoins in 2011 because they restarted their EC2 instance.

I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.

https://siliconangle.com/2011/08/01/third-largest-bitcoin-ex...

danmur · on Nov 23, 2019

Not to say this is what people are doing, but I don't think it requires much knowledge to run under Docker, and it's pretty easy to expose it to the public internet that way.

Quekid5 · on Nov 22, 2019

Incompetence and indifference will be the ruin of us all.

This is just another symptom of the Principal-agent problem writ large.

shadowgovt · on Nov 22, 2019

It's a tragedy that all of this data was available to anyone in a public database instead of.... checks notes... available to anyone who was willing to sign up for a free account that allowed them 1,000 queries.

It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.

yoaviram · on Nov 23, 2019

If your in Europe or California, I suggest sending both companies an erasure request: https://yourdigitalrights.org/?company=peopledatalabs.com https://yourdigitalrights.org/?company=oxydata.io

Disclaimer: I'm one of the creators of yourdigitalrights.org.

Already__Taken · on Nov 26, 2019

Can I use this on behalf my @company users HIBP has just emailed me about?

arbol · on Nov 23, 2019

This is great. Thanks

godelski · on Nov 22, 2019

Would it be better if this was a paid service? If the issue access to the data, then maybe we should ask if this data should be collected in the first place.

chmod775 · on Nov 23, 2019

> If the issue access to the data, then maybe we should ask if this data should be collected in the first place.

Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.

The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.

godelski · on Nov 23, 2019

I'm more thinking that not all data is equal. We really treat it like it is, at least from the public perspective (it clearly isn't from the perspective of those gathering data, but there's a clear disparity in how these groups view things). Some data is actually necessary to give up to have a well functioning internet (what browser you're using) and some data is not (canvas fingerprinting). There's a tough question here because the people making the decision of what data to be used is not us. It is the websites we visit. I would argue that there is no consent being given here and all is assumed to be "common consent" (which I'm using as a lack for better terms. Things like that if you walk out in public people can see you. But conversely, someone can't run up to you and measure your height with a tape measure). There has to be some balance here. What that is, I don't know. But really the only people that can figure that out are us computer nerds who at least kinda understand these things. We have to be having these discussions, or else it becomes "fuck silicon valley" (a conversation that is becoming national). So if we don't think about these things, then we clearly live in a bubble and bubbles burst. If we do think about these things, maybe we don't live in a bubble.

jmccorm · on Nov 23, 2019

I was recently told how private detectives from a national agency would actually go door-to-door (over a minimal area) under the pretext of AT&T store / sales employees. They’d try to convince their target (and some incidental neighbors as cover) to switch their bundled services to AT&T.

The private agents were armed with the latest available discounts (which you could find for yourself if you tried). But their skills made them particularly more successful than a typical front-line sales employee.

The catch? It wasn’t a scam, and they really were trying to get their targets to switch. It seems that AT&T was more willing to sell consumer data than the general public is aware of. Converting their targets to AT&T granted their agency access to additional data which they then to passed onto their clients. And the target gets a discount, too. Win-Win-Win? :)

jakemal · on Nov 23, 2019

It seems like that is starting to happen with California's new data privacy law. I'm starting to get a lot of privacy policy update emails like I did when GDPR took effect.

amerine · on Nov 22, 2019

That is OPs point.

sparkywolf · on Nov 22, 2019

I found a vulnerability in linkedIn a few years back that allowed anyone to access a private profile (because client side validation was enough for them I guess..?)

They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.

john-radio · on Nov 22, 2019

I reported an issue to the LinkedIn competitor https://about.me two years ago where signing in with my Google credentials gives me access to some the account of some random other person with a similar name to me. I think that during registration, I attempted to register about.me/johnradio (except it's not "johnradio"), but he was already using it, and then the bug occurred that gave me this access.

I randomly check every 6 months or so and yep, still not fixed.

skissane · on Nov 22, 2019

My gmail is my first initial followed by my last name. There are other people on this planet with same first initial and last name, some of whom seem to think that must be their email too, because I keep on getting emails where they used it to sign up for things.

Spooky23 · on Nov 23, 2019

I had a lady send me a zip file that contained a VPN client, certificate and a word document with usernames and passwords to the VPN and a number of industrial control systems at the factory she was a manager of.

She sent it religiously, every 90 days.

drusepth · on Nov 23, 2019

Every few months I get scans of X-rays from random clients' teeth from some dentist in South America. I've tried so many times to respond and/or unsubscribe but never hear anything back.

mirimir · on Nov 23, 2019

Do you have any clue who she thought you were?

Spooky23 · on Nov 23, 2019

Oh yes, she was emailing a copy of her stuff to “herself”.

mirimir · on Nov 23, 2019

Seriously?

How the hell could she think that your email address was hers? I mean, wouldn't she notice that she never got the messages?

Spooky23 · on Nov 23, 2019

Totally serious. There are about a dozen people who regularly do this. One guy has missed 4-5 job interviews.

mirimir · on Nov 23, 2019

So is it typos? Like one letter off?

I can imagine someone mistyping an address, and then reusing the "to" link.

_kwmj · on Nov 23, 2019

I faced the same problem (though my name is not at all very common). Banks, mobile companies never did anything even after I repeatedly told them on phone and Twitter (and have kept a record of it).

One day after I had received a person's bank, mobile statement and many other bills for few months I decided to call him (his number was easily visible in many emails) and inform him of his mistake. He turned out to be lawyer and he said he will "decide" what to do about it. And the next thing I know is he sent a carefully drafted email (as a legal notice) that I should hand over my email address to him without further delay and all that.

I didn't do that. I talked to a lawyer friend and he just told me to reply with a "G F Y" card. I didn't do that either. But that pushed me to finally move my emails to my personal domain as it was/is a Gmail account and if someone complained Google would have just terminated my account and I don't know anyone who works at Google.

Ayesh · on Nov 23, 2019

That lawyer sounds like a douchebag. I super agree with your point too: I'm also slowly moving all my emails to my personal domain and it feels liberating.

acangiano · on Nov 22, 2019

I get several on a weekly basis. It's amazing how many services do not verify emails and just trust their users to own the email they claim to own.

layoutIfNeeded · on Nov 22, 2019

It’s a common “growth hack” to postpone email verification.

williamscales · on Nov 22, 2019

Even more baffling are the ones who use it to fill out job applications.

mrkstu · on Nov 22, 2019

I get bank statements, job offers, party invitations, and lately a bunch of lets say very questionable email verifications from euro 'dating' sites- I've identified the guy in the UK but its too much (and getting embarrassing now) to keep forwarding his stuff to him.

Downside of getting in early on popular email services.

lazyasciiart · on Nov 22, 2019

I went through several rounds of conversation with somebody's wedding planner over email.

skissane · on Nov 22, 2019

> but its too much (and getting embarrassing now) to keep forwarding his stuff to him

What amazes me is when I get misaddressed email, and I reply to say its misaddressed (and I'm not talking about automated services, I'm talking about obviously manually sent stuff), and my reply just gets ignored and the misaddressed email just keeps on coming.

Already__Taken · on Nov 26, 2019

Somebody keeps phoning me and leaving messages. They don't answer their own phone (or messages clearly). I even have a sarky voicemail now, you'd think they'd notice. Nope!

Lady, whoever you think is going to be at that funeral isn't getting that message.

I've no idea if they'll get disconnected now as I've blocked their number. Hope so maybe they'll notice then.

joncrane · on Nov 23, 2019

That's the most surreal, when you try to fix it and the behavior never changes.

sjwright · on Nov 22, 2019

My gmail is two initials and last name, so theoretically less susceptible to such errors. Yet I get misaddressed mail all the time—and a surprising amount of it is job applications!

hnick · on Nov 22, 2019

Trust me, I used my full first name, it's not enough to stop these people. One is a UK doctor, one is a US teacher, and I think there are one or two more. Been sent a few baby pictures from their relatives too.

wreath · on Nov 23, 2019

This happened to me and I keep getting the guy's notifications on instagram and all. So annoying!

john-radio · on Nov 25, 2019

Relevant xkcd: https://www.xkcd.com/1279/

simonlc · on Nov 22, 2019

I actually had a similar thing happen with facebook, though we didnt share names.

jonathankoren · on Nov 22, 2019

For a while, our Comcast billing account accessed some other person’s account. Comcast didn’t take it seriously, and just told us to create a new account and not use the old one. (!!!)

We had full access. I could have signed this person up for the most expensive package, or even canceled their service.

mythrowaway1124 · on Nov 22, 2019

Let's be realistic here. Everyone knows it's not possible to cancel Comcast service.

klyrs · on Nov 22, 2019

I managed to cancel my dad's after he died. They STILL tried to upsell me! One of my favorite phrases ever uttered: "He's dead, you asshole, he doesn't need more channels!" And that actually did it. Felt sorry for the salesperson, who didn't have much of a choice in the matter...

sjwright · on Nov 22, 2019

Surely by making it difficult to cancel they’re really just making it easier for people to get discounts. If I were a Comcast customer I’d be calling up to cancel every few months.

klyrs · on Nov 22, 2019

He's dead, he doesn't need discounts.

sjwright · on Nov 23, 2019

Obviously. Which is why I used a plural—I was referring to Comcast’s overall customer base.

a3n · on Nov 22, 2019

Nice one. However, I cancelled in person a couple years ago (because I had equipment to return).

The first thing I said at the counter was "I know it's really hard to cancel Comcast, and I'm not going to accept anything but a cancel."

The girl at the counter smiled and said "We know ..." and immediately cancelled my account.

adjkant · on Nov 22, 2019

"Ah yes, cancelling requires a call because of security. A feature for the user!"

kova12 · on Nov 22, 2019

To be fair, internets would have been equally outraged if there wasn't such requirement, because sure as hell somebody would have found an exploit and cancelled a bunch of account, just for funzies

zentiggr · on Nov 22, 2019

That sounds like white hat hacking from all I've heard of Comcast...

Maybe that's how we drive their customer count and revenue down and put them out of business.

dmw_ng · on Nov 22, 2019

I signed up for a disposable Gmail account using my real name at one point, and accepted the randomly suggested address it offered. Gmail loaded with someone else's obviously in use mailbox

IIRC I logged out again and back in, same thing, my credentials worked. Went back to it a few days later and the password no longer worked

wizzwizz4 · on Nov 22, 2019

Hash collisions most likely.

ta999999171 · on Nov 23, 2019

Have heard this so many times about Gmail...

How have they not resolved this?

joncrane · on Nov 23, 2019

I think it's like EC2 instance IDs. When they first came up with it, they never thought there would be literally billions of unique email addresses/EC2 instances eventually.

Ayesh · on Nov 23, 2019

I can only imagine about.me mass-creating profiles for names found on other web pages, and opening a way for someone to "claim" those profile with a matching Google account sign-in.

About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.

paulgb · on Nov 22, 2019

I had a similar experience. In 2014 I reported an issue where you could take over someone's account by adding an email you control to it and having them complete the flow by sending them a link (which, unless they looked very carefully, looked exactly like the regular log-in flow at the time - especially if they used a public email service and you registered a similar-looking account).

I tried it on a friend and it worked, but LinkedIn's response was basically "meh".

My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.

icebraining · on Nov 22, 2019

LI is terrible if you actually try to use it, but it's harmless enough if you just use it as a profile hosting service, where people are likely to look. I just auto-archive their emails and only visit the site a couple of times per year.

adrianmonk · on Nov 22, 2019

While not good, what's the connection to this story?

The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.

In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?

Ayesh · on Nov 22, 2019

I signed up for an API key to see what they have on me, and the data it returned looks awfully close to what I have on linked in.

robbya · on Nov 22, 2019

A few years of heads up is sufficient to disclose publicly. Full disclosure helps keep companies honest about security.

stopadvertising · on Nov 22, 2019

I deleted my linkedin a few years back when they had some bug where I would randomly get page views as some other person, with all their connections and account details and whatnot. It would only last a few minutes then switch me back to my account, but they aggressively ignored my attempts to reach out to them about this bug so I just gave up.

modzu · on Nov 22, 2019

[flagged]

dang · on Nov 22, 2019

Could you please stop posting unsubstantive comments to Hacker News? We're trying for a bit better than internet default here.

pscsbs · on Nov 22, 2019

No it is not.

slg · on Nov 22, 2019

The number in the HN headline was changed from 1.2 billion to 1 billion (despite the original source's headline saying 1.2). It is kind of amazing that leaking the personal data of 200 million people is now just a rounding error that can be dropped from headlines.

class4behavior · on Nov 23, 2019

Imho, it's more impressive that it's basically a non-story outside of it security news.

trickstra · on Nov 23, 2019

The general public just shrugs upon hearing such news. They still think there is nothing dangerous if their data gets leaked.

StillBored · on Nov 22, 2019

I think the solution here is laws which require anonymity, and that includes in banking (where it will never happen).

That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".

To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.

Had this been a post-paid account they would have my name/address/SSN/etc.

TheSpiceIsLife · on Nov 22, 2019

Do you think it’s reasonable to believe your name / address / SSN / DOB / etc is already out there?

I’m of the opinion it’s too late for prevention and we need, instead, mitigation.

a3n · on Nov 22, 2019

Exactly. The very reason for existence of the two companies, pdl and oxy, is to tie n pieces of data with m pieces of data.

So depending on how the "anonymous" phone number was used, it's plausible that the number can be connected with other PII.

In fact I wonder if there is any such thing as non-PII, given the existence of such companies.

ryandrake · on Nov 22, 2019

Companies need to stop treating knowledge of this information as proof that you are who you say you are. I would have no problem publicly posting my name, social security number, birthday, mother's maiden name, etc., if not for the fact that someone can actually use this information to open a bank account or take out a loan in my name. It's ridiculous that this is all it takes in most cases.

TheSpiceIsLife · on Nov 22, 2019

> Companies need to stop treating knowledge of this information as proof that you are who you say you are.

If we assume that isn't happening in the very immediate future due to the latency of introducing new legislation...

Do we have any other options to protect ourselves?

I've personally worked myself in to a bad credit rating. I have a home loan and a credit card, but any new credit applications auto-reject. Not the ideal scenario though!

krn · on Nov 22, 2019

> Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIn data, including recruiter information.

"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2], which is a parent company of NordVPN.

It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".

[1] https://oxylabs.io/

[2] https://litigation.maxval-ip.com/Litigation/DetailView?CaseI...

tyingq · on Nov 22, 2019

The article says it is "Company 2: OxyData.Io (OXY)"* (http://oxydata.io)

krn · on Nov 22, 2019

OxyData and OxyLabs seem to be sister companies[1]: the former sells data as a product, the latter sells scraping as a service.

[1] https://vpnscam.com/wp-content/uploads/2018/08/2018-08-24-09...

trymas · on Nov 22, 2019

Tesonet is true cancer. I am amazed how unethical (and successful) they are.

Knowing how quickly it's expanding, do the employees are just as unethical or they do not connect the dots (company got too big)?

I hate fb, et al as any other person here, but most of people know that "if it's free - you are the product". Though with NordVPN users are paying money and are getting stabbed in the back.

ativzzz · on Nov 22, 2019

> do the employees are just as unethical

Most people's ethics are easily bought. Does working for a company that operates with questionable integrity outweigh providing a stable income for your family?

Remember Facebook is still a very highly desirable company to work at.

867-5309 · on Nov 23, 2019

> NordVPN users are paying money and are getting stabbed in the back.

could you please expand on this claim?

emayljames · on Nov 23, 2019

From the comment they replied to: https://vpnscam.com/

frereubu · on Nov 25, 2019

"My name is Ripoff Reporter." For all that their schtick is about how they're "educating" the public about how shady VPN services are this could be anyone, including a front for a VPN service that isn't mentioned on the site.

gorbachev · on Nov 22, 2019

How is that possible? LinkedIn blocked mining the data this way several years ago.

Is it still possible if you pay LinkedIn enough? Or is this old data?

avip · on Nov 22, 2019

It is strictly impossible to "block mining data" on the public web. Double that if the miner has free access to a pool of residential IPs.

[source: experience]

tyingq · on Nov 22, 2019

A large number residential proxies and fake LinkedIn accounts would look the same to LinkedIn as normal browsing.

gorbachev · on Nov 22, 2019

There's information on the leak that wouldn't be widely available without accessing LinkedIn data using their APIs. Phone numbers and emails, for example.

tyingq · on Nov 22, 2019

The article mentions it is a blend of data from http://oxydata.io/ and https://www.peopledatalabs.com/

Both are aggregators that get data from many sources, correlate them, and sell it. The phone numbers and emails could have come from anywhere.

See this screenshot from PeopleDataLabs: https://d1ennknj6q36vm.cloudfront.net/images/cblead.png

firtoz · on Nov 22, 2019

I'm a nordvpn user. Practices like this scares me though. I guess it's time to switch to a new vpn?

vermilingua · on Nov 22, 2019

https://gist.github.com/joepie91/5a9909939e6ce7d09e29

firtoz · on Nov 22, 2019

Ah... but that is very inconvenient :( I guess comfort comes at a cost.

Is there at least a less shady provider if I would like to compromise myself but a bit less than nordvpn? How far do we go in assuming all are bad?

jeltz · on Nov 23, 2019

Mullvad seems trustworthy (I used to share an office with one of their IT infrastructure staff), but it is impossible to say for sure.

Mirioron · on Nov 23, 2019

You could set up your own VPN on a server you run.

emayljames · on Nov 23, 2019

Yes. This. And is free to setup on big cloud services. Like free 24/7 with whatever amount of data. Guides are online.

vermilingua · on Nov 22, 2019

All the way. It isn’t as if all VPN providers are part of a shadowy cabal to steal your data from an otherwise valuable service; the very premise of commercial VPNs is flawed. Any VPN service is inherently harmful.

Havoc · on Nov 22, 2019

Out of curiosity how do you guys think they managed to scrape LinkedIn on such a large scale?

I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).

kaivi · on Nov 22, 2019

I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.

Here are some tricks which may or may not work today:

- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.

- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.

- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.

hitpointdrew · on Nov 22, 2019

> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Nice tip!!

> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

kaivi · on Nov 22, 2019

>Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's [1] "CQL binary protocol", it simple and always on point.

[1] https://github.com/apache/cassandra/blob/trunk/doc/native_pr...

davidhyde · on Nov 22, 2019

You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?

kaivi · on Nov 22, 2019

Tbh I'm still selling that data.

For a while I've had reoccurring nightmares that my DB had been stolen and published together with an article on how stupid and incompetent I am.

prawnsalad · on Nov 23, 2019

If I've understood you right, you break the TOS on other websites to collect users personal info, and then you have nightmares about people taking that data from you? Doesn't that raise ethical concerns in your eyes?

Havoc · on Nov 22, 2019

>You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

I'll cut straight to the chase and post it on hn. This intermediate step of waiting for someone to discover it takes too long

tomc1985 · on Nov 22, 2019

The use case is in a local datacenter, with a NAT-ed IP not exposed to the main web

kchamplewski · on Nov 22, 2019

A firewalled IP would be much more appropriate, and NAT is not a firewall or a security mechanism.

tomc1985 · on Nov 23, 2019

Same thing, more-or-less. And NAT is effectively a firewall for inbound traffic, even if a lot of people say it isn't.

xfer · on Nov 22, 2019

> Have an app where user logs in through said website, then scrape their friends using this user's token.

That's some extremely shady thing to do.

edm0nd · on Nov 22, 2019

Welcome to the internet!

isoos · on Nov 22, 2019

> Don't be too kind on the big websites.

I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.

arcticfox · on Nov 23, 2019

1000ms is a massive slowdown when revenue-noticeable impacts are far, far smaller. I don't know the legality, but hitting a site hard enough to cause 1000ms slowdowns seems like it's approaching DOS legality issues.

nfoz · on Nov 22, 2019

Don't you consider this unethical -- if not against the site itself, than against the other users of the site whose data you're scraping?

Ayesh · on Nov 22, 2019

Wow these are some hot tips!

YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.

Pretending to be Googlebot also helps.

Havoc · on Nov 22, 2019

>- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Clever. VMs with IPV6 are cheap as a bonus :)

Same for non-js mobile. Thanks for the tips

adatavizguy · on Nov 22, 2019

- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

How would someone do that using node.js? Asking for a friend.

sillysaurusx · on Nov 22, 2019

So far, the answers have contained non-technical answers like "Distributed Scraping." Well, yes, obviously.

A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.

Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.

The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.

It's hard to counter a determined scraper.

onlyrealcuzzo · on Nov 22, 2019

I wrote a chrome headless framework that types using semi-realistic key presses (timing, mistakes, corrections) and does semi-realistic scrolling / swiping and clicking / tapping.

It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it

It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.

zer0tonin · on Nov 22, 2019

I think there's also a lot of bot-detection-as-a-service around here that can be used by sites smaller than Google and Facebook, like WhiteOps or IAS anti-fraud.

mschuster91 · on Nov 22, 2019

These are highly questionable under GDPR, many of them rely on tracking users wherever they go (e.g. Recaptcha is known for this).

77pt77 · on Nov 23, 2019

> These are highly questionable under GDPR

How many fines has GDPR resulted in?

mschuster91 · on Nov 24, 2019

Not many yet, general consensus is to first warn and get companies to implement better compliance - only those who really openly shit on GDPR get the fines.

1996 · on Nov 22, 2019

then release it!

Headless chrome cat and mouse game is a lot of fun. We need more players.

hobofan · on Nov 22, 2019

LinkedIn doesn't protection doesn't seem to be that sophisticated at the moment. Someone I know maintains ~weekly up-to-date profiles of a few million users via a headless scraper that uses ~10 different premium accounts and a very low number of different IPs.

arpa · on Nov 22, 2019

That is a violation of ToS (using registerd accounts for scrape) and could carry potential legal implications.

gddvhy · on Nov 22, 2019

So is leaking PII? ToS isn't a legal contract: it's not signed by anyone and it's changed every other week without consent of users. ToS is just a formal excuse why someone's account may be suspended.

cookiecaper · on Nov 22, 2019

As long as you are able to source more than one provider, this can work well enough. If you're dependent on a single data source, e.g., because that source is the only possible source of said data, you'll get nuked from orbit by legal rather than technical means.

I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.

Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.

The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.

scraping_legal · on Nov 22, 2019

The US courts decided that scraping is legal, even if against EULA:

> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.

https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...

shkkmo · on Nov 22, 2019

That is a blatant misrepresentation of that decision. That decision was upholding a lower court's preliminary injunction that prevents LinkedIn from blocking hiQ while the main case between the two is litigated. It is not a final decision and it doesn't purport to say that scraping is legal (it even points out other laws besides the CFAA that might be used to prohibit scraping.)