I was at an Elasticsearch meetup yesterday where we had a good laugh about several similar scandals in Germany recently involving completely unprotected Elasticsearch running on a public IP address without a firewall (e.g. https://www.golem.de/news/elasticsearch-datenleak-bei-conrad..., in German). This beats any of that.
Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."
Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?
In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).
It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
I've been using ES off and on since before 1.0 came out. It has always baffled me that ES doesn't require a username and password by default.
ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.
Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.
I am serious about my question. Could anyone clue me in?
It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).
If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.
At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.
Software should be secure by default. Don't blame the user.
mySQL in comparison wont even let you install without setting a root password. And it only listen on localhost/unix-socket by default. Then you need to explicitly add another user if you want to allow it to login from a non local ip. I don't think it's even possible - to both set a blank root password and allow it to login from a public IP.
So you really think the solution is to blame some low level worker, and sue him/her? The blame should always be on the people in charge, usually the CEO, who set the bar for engineering practices, proper training, etc, or the lack of.
While I don't think blaming labor is constructive or ethical, it seems like most tools pose danger to users in proportion to utility. For example, cars can squish people, electricity can fry people, and power tools can remove limbs.
Typically, people start out using knives and bicycles as children, learn through experience that crashing and getting cut hurt, and carry those lessons forward when they start using tablesaws and cars later in life. How does this apply to elasticsearch? I have no idea.
We could teach our children that software is very dangerous, especially databases. Or we could make software secure by default.
But we also need to teach the user how to use the software properly. Learning by getting hurt is effective, but then we also need to have playgrounds.
That MySQL stuff is all quite recent... up until 5.7 (?, one of the most recent releases, anyway) there's no root password by default and running `mysql_secure_installation` is a common (but not mandatory) step to, well, secure the installation and set a root password. I think MariaDB still works this way? Not sure.
I'm not aware of "bind to localhost" being the default, either. The skip-networking setting to only allow local socket connections is definitely not the default, and I'm pretty sure the default is still to bind to all interfaces.
I installed mySQL a couple of months ago on a Ubuntu server, and got asked to set a root password. I've also installed mySQL many times on Windows. Secure install is the default. And it doesn't annoy me a bit. I like my software to be secure by default.
Software should be built in the best method of delivering maximum value to its users. A trade-off for usability can be made for certain cases like ease-of-use for new software. Redis was part of this a while ago http://antirez.com/news/96.
Engineers should know their tools before using them. It's a huge part of our jobs. You could introduce a ton of other vulnerabilities in software: XSS, SQL injections, insecure cryptography. Security is part of our job and matters we must know.
You don't blame a plane for a pilot mistake that was meant to be part of his training. Engineers in every other sector are responsible for their mistakes, we should be too.
Also, you don't sue the worker, you sue the company.
"Software should be built in the best method of delivering maximum value to its users."
Yes, and defaulting to insecure, thus repeatedly causing huge data breaches, is the exact opposite of delivering maximum value to users. It's delivering maximum liability.
I would argue that the single command to begin using the application and the ease of on boarding / querying data was a huge factor in expanding its usage. Elastic optimized for initial spin-up and getting things running fast. It works really well! Until you load it full of data on a public IP, that is.
That single command to spin up the application can easily generate and show a copyable random secret required to use it, so that you can use easily but there's no option to use it that insecurely.
Onions. You need layers and defense in depth. Because even the best humans make mistakes and it is inhuman to assume perfectionism. Never rely on just one engineering feature.
Honestly a lot of the problem is: people aren’t studying systems engineering OR security. Look at all the “learn to code in 21 days” BS and all the code academies.
There’s so much emphasis on abstracting away the systems with cloud-this and elastic-that and developers don’t know much about general systems engineering.
My recommendation to software developers: take the Network+ and Security+ exams at the bare minimum.
Honestly as much as people complain about process getting in the way of things, there should be checks and balances at any business that deals with personal information. Finance institutions are heavily regulated—these fkers should be held accountable.
Maybe the hint is right there in your comment. Nearly all the people deploying these nodes aren't engineers in the slightest despite having someone given them such a title.
Because... they dance the devops dance with their devop hats on! Security problems can be swiftly danced around until they actually surface, and can then be handled in the next round of "continuous delivery". It's also smart to postpone solving most issues until after they occur, so sales can continue bragging about "continuous improvement".
So, after some thought, here's why I don't consider it pointless to have basic auth built in.
It would keep ES from being completely open. If you wanted to get in, you'd have to comprise some part of the network that would let you read the username and password.
The way it is now, anyone can do a scan for port 9200 and get full access right away.
It is also important to have a username and password, even on secured networks. My test instance is on an internal network, and protected by both network and host firewalls, but I still make sure to secure it beyond that.
Basic auth would not provide a false sense of security. It is simply a very basic part of overall security. Not having it is a mistake.
> At some point you just have to call out people for being utter morons. The blame is on them, 100%. [...]
Your attitude is a symptom of a broader issue that plagues this industry: Indifference to risk*probability. If you don't ship software with "secure defaults" (depending on the threat/attack model), you essentially are handing out loaded shotguns, then blaming the "dumb" user when they inevitably point it at their foot and click the trigger. Easy solution: Don't hand out the gun loaded -- make the user do specific actions that enable the usage. Yeah, it creates some friction to first time deployment, but that's a secondary concern to having your freaking DB leaking all over the place.
If firing up a piece of software creates an unauthenticated, unprotected (non-TLS) endpoint to read-write data, that's a loaded gun. That is PRECISELY the default behavior of ES.
ES has jacked around for years by making TLS and other standard security features premium. To that, I say this: Screw ES and their bullshit business model. Their business model is a leading cause to dumbasses dumping extremely sensitive PII data into a DB that is unprotected - those same folks aren't going to go the extra mile to secure the DB, either by licensing or 3rd party bolt-ons.
Thus, why it must be shipped secure by default. Anything less is a professional felony, in my eyes. Also, screw ES again, in-case I wasn't clear.
Tort law is going to catch up to software soon enough and people will be held accountable for negligently creating or deploying software that they should have known would cause harm.
The fact that someone else down the chain should have known better is not a perfect defense. If that misuse was foreseeable and you didn’t do enough to prevent or discourage it, then you can still be held liable.
"Defense in depth" sounds, to me, like a phrase to justify multiple layers of imperfect security.
A single layer of cloth might not hold water, adding more layers of cloth may hold water for longer, but it's probably more cost effective to start with the right material.
This is a fallacy of distributed systems. Never trust the network. Best case you get packets destined for somewhere else, worst case you your network segmented wasn't actually segmented.
i agree with GP here. ES is to blame here. not long ago apache airflow had a similar vulnerability discovered about not having sensible authentication defaults. the reasoning on their mailing list was eerily similar to those defending ES here. same arguments (iirc)
history is our greatest teacher. i think ES will end up doing what that team did: they agreed to provide sensible & secure defaults.
PostgreSQL does the following things by default to prevent this:
1. Only listen to localhost and unix sockets
2. Not generate any default passwords
So the only way to connect to a default configured fresh installation of PostgreSQL is via UNIX sockets as the postgres unix user. Where PostgreSQL is lacking is that it is a bit more work than it should be to use SSL.
> It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).
I've met at least one cloud provider in the past (small Dutch thing) that provides _only_ public IP addresses. They do have customers, though one less now. Clustering over the public Internet is a thing. It shouldn't, but I could say the same thing about this website and yet here we are.
Oh sure, but sad things happen. And they can be even messier: I had a Jenkins instance "made" public because a sysadmin new to a hosting provider forgot to remove the public IP that gets automatically assigned to new things. We were lucky, being fairly sure nothing found it before I realised, but it was a strong lesson learned:
Any network may become public by accident unless you go to great lengths to make sure it doesn't. Configurations change and mistakes are made even by seasoned people. People bring devices. Unless there's an air gap, people's devices may be hacked and let stuff through. Put authentication and anti-CSRF on _all_ your stuff, always.
Honestly, I as a user don't give a shit what a good engineer should so. All I see is that my personal data gets leaked left and right by elasticsearch and not mysql or postgres. But its fanbois just keep shifting blame instead of reflecting about reality and going "hey yeah maybe we should try do do something about it on our end". So fuck ES.
> It has always baffled me that ES doesn't require a username and password by default.
because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin
The usual way of using this service is to have backend network configured that connects your services that is not available from outside (ie you have to traverse through services to reach it).
The so called "security" is just a paid feature for companies that want to use ElasticSearch but want to use it in "legacy" way because, presumably, they don't have people to design it correctly.
That's still really insecure, because it means that as soon as someone manages to gain any access to that network or any of the services on that network has a security issue your database is wide open.
That means that if someone manages to get access to the. I'd say public internet with proper (encrypted) password auth is more secure than that.
That's true, but there are usually multiple ways to compromise protected networks. You still need to protect the database against attacks that don't go through the app server.
If you set up elasticsearch on a cloud service like AWS, by default your firewall will prevent the outside world from interacting with it, and no authentication is really necessary. If you do use authentication, you probably wouldn't want username+password, you would probably want it to hook into your AWS role manager thing. So to me, username+password seems useful, but it isn't going to be one of the top two most common authentication schemes, so it seems reasonable that it should not be the default.
MongoDB also by default does not have username+password authentication turned on.
I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.
I don't see why, though. It's much safer to start with a secure setup and then have the user disable the security explicitly (hopefully knowing what they're doing). Yes, username/password auth is not that common, but isn't it better than having no auth at all?
Ok, let's say username/password is mandatory and enabled by default. I see to options.
Option one, they generate an unique password for every installation – non trivial to do, because at which point do you do it? It can't be before a cluster is formed, as you'll have a split brain generating a bunch of credentials. If you do it afterwards, then there is a period of time when you cluster is not yet protected. Worse yet, unprotected and handshaking authentication. So you don't do that.
You could make the user input the credentials. What is to prevent them from creating weak credentials? And worse, they have to do that for every node (or at least the masters). Not a good experience and lost credentials will probably be the subject of a good many support calls.
So most products don't do that. What they do is default passwords. Which is arguably no security at all and doesn't protect anything. It may make it just a tiny bit easier to do the right thing afterwards (by changing to better credentials). Still, there's a period of time while the cluster is unprotected (default credentials are as good as no credentials).
Authentication does little to protect against the sort of people who are exposing databases to the public. If it is easily disabled, then they will be doing just that. Because they are already doing that by forcing databases to bind to publicly accessible interfaces.
I'd say option two is the only one viable. You deny access to the service until credentials are set by the user. You print huge warning labels while the credentials are set by the user to remind them of the possible consequences of setting weak credentials.
Yes, lost credentials will be subject of many support calls. Then, it boils down to your priorities. If you care about minimizing support calls, then sure, leave everything open to everyone. It will surely result in fewer access problems.
On the other hand, if your motivation is actually preventing your end-users from doing stupid things, it makes sense to just do the most conservative thing as default. Let the user change to the more liberal option, but not before informing them of all dangers that might befall them in that case.
I refuse to believe in this narrative of the end-user just being a stupid automaton who does not have any agency, and that any default imposed upon them will just result in them overriding the default with their terrible practices and ideas. I think there is a possibility of education and risk reduction.
I'd argue that the "pre-cloud" era is still going strong. And that is a good thing. My workplace has it's own data center. There are some downsides, but I prefer it.
So username+password really is needed. And should be included by default.
Also, I'd expect the same of something like MongoDB. That it doesn't have that by default is just baffling.
Password auth over HTTP is horrible. Short of binding a public IP address to your instance, basic auth without HTTPS setup is probably the worst thing you can do.
This addresses entirely the wrong question. By looking at it as a technical problem you're completely missing the broader ethical problem. Why was anyone allowed by law to amass this amount of data? And why did PDS not take the security and privacy concerns of 1.2 billion people seriously enough to ensure the data was handled correctly? They obviously thought it was valuable enough to amass a huge database. Do they sell this to just anyone? If not, who can buy access to this data? How much does it cost, and what steps are involved in doing so?
> Out of the box it does not even bind to a public internet address.
Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.
Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.
That still puts you a single firewall mistake away from disaster. It also places a lot of trust into the applications and hosts that can access ES on a network level: They get full access with no control at all.
To add on that: No security also means no TLS, neither in the cluster communication, no TLS speaking to the client etc.
I've come across several such ES instances that are 100% exposed to the world without even trying, and ES is by no means the first tool to have this problem. People are never going to stop doing this. Making it annoyingly difficult within ES just weakens them such that some other "wow it's so easy" search product will be better positioned to eat their lunch.
ES, Mongo, Redis used to be some of the easiest targets for production data (security vuln wise). Deployed by SWE's usually, with products that were early versions, and didn't have access control by default.
ES's practice of making its security a proprietary paid for product is the cause for these kinds of things. It's a shitty practice, and this is one of the reasons I'm glad AWS forked it.
Other databases learned that not requiring a user/password upon install is completely irresponsible. ES and other dbs need to catch up ASAP, it's ridiculous.
Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.
Trusting software you install to be secure is ridiculous and completely irresponsible, especially if you did not pay for someone else to take the blame.
The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.
Wasn't this exact same thing a huge scandal just a few years ago for Mongo on Shodan?
I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.
This assumes it was incompetence and not done intentionally.
My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.
Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.
> It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above
I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.
Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.
But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)
> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
Again, not necessarily, for the same reason as above.
But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(
Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.
Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.
Companies that can’t handle data securely, have no business handling data at all.
My favourite was Bitomat.pl's loss of 17k bitcoins in 2011 because they restarted their EC2 instance.
I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.
Not to say this is what people are doing, but I don't think it requires much knowledge to run under Docker, and it's pretty easy to expose it to the public internet that way.
It's a tragedy that all of this data was available to anyone in a public database instead of.... checks notes... available to anyone who was willing to sign up for a free account that allowed them 1,000 queries.
It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.
Would it be better if this was a paid service? If the issue access to the data, then maybe we should ask if this data should be collected in the first place.
> If the issue access to the data, then maybe we should ask if this data should be collected in the first place.
Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.
The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.
I'm more thinking that not all data is equal. We really treat it like it is, at least from the public perspective (it clearly isn't from the perspective of those gathering data, but there's a clear disparity in how these groups view things). Some data is actually necessary to give up to have a well functioning internet (what browser you're using) and some data is not (canvas fingerprinting). There's a tough question here because the people making the decision of what data to be used is not us. It is the websites we visit. I would argue that there is no consent being given here and all is assumed to be "common consent" (which I'm using as a lack for better terms. Things like that if you walk out in public people can see you. But conversely, someone can't run up to you and measure your height with a tape measure). There has to be some balance here. What that is, I don't know. But really the only people that can figure that out are us computer nerds who at least kinda understand these things. We have to be having these discussions, or else it becomes "fuck silicon valley" (a conversation that is becoming national). So if we don't think about these things, then we clearly live in a bubble and bubbles burst. If we do think about these things, maybe we don't live in a bubble.
I was recently told how private detectives from a national agency would actually go door-to-door (over a minimal area) under the pretext of AT&T store / sales employees. They’d try to convince their target (and some incidental neighbors as cover) to switch their bundled services to AT&T.
The private agents were armed with the latest available discounts (which you could find for yourself if you tried). But their skills made them particularly more successful than a typical front-line sales employee.
The catch? It wasn’t a scam, and they really were trying to get their targets to switch. It seems that AT&T was more willing to sell consumer data than the general public is aware of. Converting their targets to AT&T granted their agency access to additional data which they then to passed onto their clients. And the target gets a discount, too. Win-Win-Win? :)
It seems like that is starting to happen with California's new data privacy law. I'm starting to get a lot of privacy policy update emails like I did when GDPR took effect.
I found a vulnerability in linkedIn a few years back that allowed anyone to access a private profile (because client side validation was enough for them I guess..?)
They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.
I reported an issue to the LinkedIn competitor https://about.me two years ago where signing in with my Google credentials gives me access to some the account of some random other person with a similar name to me. I think that during registration, I attempted to register about.me/johnradio (except it's not "johnradio"), but he was already using it, and then the bug occurred that gave me this access.
I randomly check every 6 months or so and yep, still not fixed.
My gmail is my first initial followed by my last name. There are other people on this planet with same first initial and last name, some of whom seem to think that must be their email too, because I keep on getting emails where they used it to sign up for things.
I had a lady send me a zip file that contained a VPN client, certificate and a word document with usernames and passwords to the VPN and a number of industrial control systems at the factory she was a manager of.
Every few months I get scans of X-rays from random clients' teeth from some dentist in South America. I've tried so many times to respond and/or unsubscribe but never hear anything back.
I faced the same problem (though my name is not at all very common). Banks, mobile companies never did anything even after I repeatedly told them on phone and Twitter (and have kept a record of it).
One day after I had received a person's bank, mobile statement and many other bills for few months I decided to call him (his number was easily visible in many emails) and inform him of his mistake. He turned out to be lawyer and he said he will "decide" what to do about it. And the next thing I know is he sent a carefully drafted email (as a legal notice) that I should hand over my email address to him without further delay and all that.
I didn't do that. I talked to a lawyer friend and he just told me to reply with a "G F Y" card. I didn't do that either. But that pushed me to finally move my emails to my personal domain as it was/is a Gmail account and if someone complained Google would have just terminated my account and I don't know anyone who works at Google.
That lawyer sounds like a douchebag.
I super agree with your point too: I'm also slowly moving all my emails to my personal domain and it feels liberating.
I get bank statements, job offers, party invitations, and lately a bunch of lets say very questionable email verifications from euro 'dating' sites- I've identified the guy in the UK but its too much (and getting embarrassing now) to keep forwarding his stuff to him.
Downside of getting in early on popular email services.
> but its too much (and getting embarrassing now) to keep forwarding his stuff to him
What amazes me is when I get misaddressed email, and I reply to say its misaddressed (and I'm not talking about automated services, I'm talking about obviously manually sent stuff), and my reply just gets ignored and the misaddressed email just keeps on coming.
Somebody keeps phoning me and leaving messages. They don't answer their own phone (or messages clearly). I even have a sarky voicemail now, you'd think they'd notice. Nope!
Lady, whoever you think is going to be at that funeral isn't getting that message.
I've no idea if they'll get disconnected now as I've blocked their number. Hope so maybe they'll notice then.
My gmail is two initials and last name, so theoretically less susceptible to such errors. Yet I get misaddressed mail all the time—and a surprising amount of it is job applications!
Trust me, I used my full first name, it's not enough to stop these people. One is a UK doctor, one is a US teacher, and I think there are one or two more. Been sent a few baby pictures from their relatives too.
For a while, our Comcast billing account accessed some other person’s account. Comcast didn’t take it seriously, and just told us to create a new account and not use the old one. (!!!)
We had full access. I could have signed this person up for the most expensive package, or even canceled their service.
I managed to cancel my dad's after he died. They STILL tried to upsell me! One of my favorite phrases ever uttered: "He's dead, you asshole, he doesn't need more channels!" And that actually did it. Felt sorry for the salesperson, who didn't have much of a choice in the matter...
Surely by making it difficult to cancel they’re really just making it easier for people to get discounts. If I were a Comcast customer I’d be calling up to cancel every few months.
To be fair, internets would have been equally outraged if there wasn't such requirement, because sure as hell somebody would have found an exploit and cancelled a bunch of account, just for funzies
I signed up for a disposable Gmail account using my real name at one point, and accepted the randomly suggested address it offered. Gmail loaded with someone else's obviously in use mailbox
IIRC I logged out again and back in, same thing, my credentials worked. Went back to it a few days later and the password no longer worked
I think it's like EC2 instance IDs. When they first came up with it, they never thought there would be literally billions of unique email addresses/EC2 instances eventually.
I can only imagine about.me mass-creating profiles for names found on other web pages, and opening a way for someone to "claim" those profile with a matching Google account sign-in.
About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.
I had a similar experience. In 2014 I reported an issue where you could take over someone's account by adding an email you control to it and having them complete the flow by sending them a link (which, unless they looked very carefully, looked exactly like the regular log-in flow at the time - especially if they used a public email service and you registered a similar-looking account).
I tried it on a friend and it worked, but LinkedIn's response was basically "meh".
My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.
LI is terrible if you actually try to use it, but it's harmless enough if you just use it as a profile hosting service, where people are likely to look. I just auto-archive their emails and only visit the site a couple of times per year.
While not good, what's the connection to this story?
The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.
In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?
I deleted my linkedin a few years back when they had some bug where I would randomly get page views as some other person, with all their connections and account details and whatnot. It would only last a few minutes then switch me back to my account, but they aggressively ignored my attempts to reach out to them about this bug so I just gave up.
The number in the HN headline was changed from 1.2 billion to 1 billion (despite the original source's headline saying 1.2). It is kind of amazing that leaking the personal data of 200 million people is now just a rounding error that can be dropped from headlines.
I think the solution here is laws which require anonymity, and that includes in banking (where it will never happen).
That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".
To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.
Had this been a post-paid account they would have my name/address/SSN/etc.
Companies need to stop treating knowledge of this information as proof that you are who you say you are. I would have no problem publicly posting my name, social security number, birthday, mother's maiden name, etc., if not for the fact that someone can actually use this information to open a bank account or take out a loan in my name. It's ridiculous that this is all it takes in most cases.
> Companies need to stop treating knowledge of this information as proof that you are who you say you are.
If we assume that isn't happening in the very immediate future due to the latency of introducing new legislation...
Do we have any other options to protect ourselves?
I've personally worked myself in to a bad credit rating. I have a home loan and a credit card, but any new credit applications auto-reject. Not the ideal scenario though!
> Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIn data, including recruiter information.
"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2], which is a parent company of NordVPN.
It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".
Tesonet is true cancer. I am amazed how unethical (and successful) they are.
Knowing how quickly it's expanding, do the employees are just as unethical or they do not connect the dots (company got too big)?
I hate fb, et al as any other person here, but most of people know that "if it's free - you are the product". Though with NordVPN users are paying money and are getting stabbed in the back.
Most people's ethics are easily bought. Does working for a company that operates with questionable integrity outweigh providing a stable income for your family?
Remember Facebook is still a very highly desirable company to work at.
"My name is Ripoff Reporter." For all that their schtick is about how they're "educating" the public about how shady VPN services are this could be anyone, including a front for a VPN service that isn't mentioned on the site.
There's information on the leak that wouldn't be widely available without accessing LinkedIn data using their APIs. Phone numbers and emails, for example.
All the way. It isn’t as if all VPN providers are part of a shadowy cabal to steal your data from an otherwise valuable service; the very premise of commercial VPNs is flawed. Any VPN service is inherently harmful.
Out of curiosity how do you guys think they managed to scrape LinkedIn on such a large scale?
I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).
I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.
Here are some tricks which may or may not work today:
- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.
- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.
- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.
> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
Nice tip!!
> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).
>Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).
Sure it will work, but I personally don't like Elasticsearch for anything high-intensity because of its HTTP REST API and the overhead it carries. Take a look at Cassandra's [1] "CQL binary protocol", it simple and always on point.
You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)
In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?
If I've understood you right, you break the TOS on other websites to collect users personal info, and then you have nightmares about people taking that data from you? Doesn't that raise ethical concerns in your eyes?
I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.
1000ms is a massive slowdown when revenue-noticeable impacts are far, far smaller. I don't know the legality, but hitting a site hard enough to cause 1000ms slowdowns seems like it's approaching DOS legality issues.
YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.
So far, the answers have contained non-technical answers like "Distributed Scraping." Well, yes, obviously.
A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.
Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.
The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.
I wrote a chrome headless framework that types using semi-realistic key presses (timing, mistakes, corrections) and does semi-realistic scrolling / swiping and clicking / tapping.
It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it
It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.
I think there's also a lot of bot-detection-as-a-service around here that can be used by sites smaller than Google and Facebook, like WhiteOps or IAS anti-fraud.
Not many yet, general consensus is to first warn and get companies to implement better compliance - only those who really openly shit on GDPR get the fines.
LinkedIn doesn't protection doesn't seem to be that sophisticated at the moment. Someone I know maintains ~weekly up-to-date profiles of a few million users via a headless scraper that uses ~10 different premium accounts and a very low number of different IPs.
So is leaking PII? ToS isn't a legal contract: it's not signed by anyone and it's changed every other week without consent of users. ToS is just a formal excuse why someone's account may be suspended.
As long as you are able to source more than one provider, this can work well enough. If you're dependent on a single data source, e.g., because that source is the only possible source of said data, you'll get nuked from orbit by legal rather than technical means.
I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.
Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.
The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.
The US courts decided that scraping is legal, even if against EULA:
> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.
That is a blatant misrepresentation of that decision. That decision was upholding a lower court's preliminary injunction that prevents LinkedIn from blocking hiQ while the main case between the two is litigated. It is not a final decision and it doesn't purport to say that scraping is legal (it even points out other laws besides the CFAA that might be used to prohibit scraping.)
LinkedIn Sales Navigator is a paid tool which allows you to search their whole database. Then depending on how much you pay you can get all their personal details (Email address, phone number, even their address sometimes.) https://business.linkedin.com/sales-solutions/sales-navigato...
I've always been a little confused how this works. If I got all that info for free, it's a "data leak", but if I pay to get the same detailed personal information it's...
In either case my personal data is given away without my consent, but there's this implication that it's only an issue when someone doesn't pay for it.
You're right, my take on this is that a company scraped a bunch of publicly available information, that people left open (consciously or not.) That's why only a subset have phone numbers. The profile URLs, emails, most people don't even try to protect those.
Normally the company sells this data, but now they've given it away. It's not good this data got out because the curation has some value to spammers or whoever. But using the word "leak" here undermines the severity of a real leak where passwords and social security numbers are exposed. Data that was never meant by anyone to be open.
Everyone likely has (technically) provided consent for every piece of information here being shared with partners. Buried in fine print that it wasn't really expected they'd read, of course. It's the cost of being online, and that sucks, but it seems only a leak of what had already been given out.
If you get drivers info by hacking a DMV database, it's prison. If you got the same details by paying a few millions for FOIA requests, you're a good citizen and a model tax payer.
Jokes aside, can you really file FOIA requests to get personal driver details from DMV? I thought FOIA would only apply for stuff that is meant to be public, but isn't due to difficulties of hosting, putting it up, etc.
Mind you, I didn't research the topic of what can or cannot be requested with FOIA, so I might be totally wrong.
LinkedIn gives away email id and phone number (even if you had given just for 2FA) to all your contacts. I checked PDL, it has all the information from LinkedIn except for phone number, which I promptly removed once I identified the 2FA issue (now TOTP is available).
'Mobile Proxies' like https://oxylabs.io/mobile-proxies (no affiliation) allow you to use large pools of mobile or domestic IPs to scrape. It's expensive, but not prohibitively so. Once you've got a mobile IP you become incredible hard to throttle, since you're behind a mobile NAT gateway.
You probably have to be highly distributed. At least that’s what I did when I tried to scrape a large site some years ago. I had around 100 machines in different countries and gave each of them random pages to scrape.
Distributed bot and scraper networks. Thousands of IPs geographically dispersed throughout the world. There is only so much you can do with rate limiting.
I go to LinkedIn without being logged in and nearly always get a login gate instead of the profile.
They were ordered to unblock hiQ specifically, they were not ordered to open up content to scrapers generally.
They can still throttle high volume traffic and put up captchas. I think the only specific thing the court ordered was for them to unblock hiQ IP ranges.
Scraping LinkedIn is so common you can usually hire people with years of experience in it. It is not as complicated as you might think. There are at minimum hundreds of companies that sell LinkedIn data they have scraped.
I scraped 10 million records from linkedin a few years ago from a single ip by using their search function. I got a list of the top 1000 first names and top 1000 last names and wrote a script to query all combinations and scrape the results.
It looks like the purpose was data enrichment, so maybe it was pieced together over time from multiple sources. My linkedin from PDL only had 1 bit of wrong info. I wasn't able to find anything on my personal email addresses which is good.
once worked on a project that tried to do just that, but at the time the LinkedIn api was already limited to seeing the authenticated users connections connections, which was too limited for what we wanted to do, can only imagine it got worse.
It's also the reason recruiters really want to connect to you on LinkedIn because even if you are not interested, your connections might be.
Hey - not related to your comment (apologies) but wanted to get in touch . You left a note on a previous post of mine about wanting to simplify FTP. I'd love to work on this project and wanted to see if you'd be willing to connect so I can understand the problem better. Feel free to email me at kunal@mightydash.com, and thanks in advance!
Haha, when I was a kid and scared to use my real name for things, for some reason I used my email... which had my real name in it, to open a Github account with a fake name
So the api knows me as the famous architect, Art Vandelay
In your github account you can add a new email address that doesn't even exist or have a valid TLD, like "name@mail.fake". Don't use it as your primary email and it won't require confirmation. You can now set your git user.email to this fake address and any commits you make will be attributed to your account without exposing your actual email address.
You can use yourgithubusername@users.noreply.github.com instead of adding a fake email, and your commits will still show up on your contribution graph and be linked to your username.
Wow.. I checked with an email address I use for disposable purposes. The only thing they had on it was a blank LinkedIn profile -- meaning that LinkedIn cancer has trawled some pretty questionable sites, harvesting email addresses as placeholders for their accounts. WTF.
Indeed they do have a profile on me - a bare minimum, scaped from GitHub. That makes sense, since that's about the only social platform I use, aside from HN.
EDIT: My GMail address has the most amount of information gathered, which makes sense. It's gathered Facebook, LinkedIn, Pinterest, GitHub..
It lists my skills as: firefighting and emergency planning/management/services. I suppose, with a stretch of imagination..
It returned a 404 for my personal email account, so that appears to be sufficiently protected.
More surprisingly it had data such as my name, title and work email address which was connected to old work email account (Okta managed - GSuite) that I never associated with external services, and absolutely never used on a social networking site like LinkedIn.
Yeah no kidding. Though if you wait until it flips to a new minute and refresh, that helps. Though it takes all of a minute to register a free key, so probably no big deal.
I'm actually a bit surprised at how little data they have on me. They've associated my main email with an old junk email, they've got my first and last name, and know that I'm male, but there's little more.
Nothing for most of my accounts, except one which somehow was falsely attributed to someone else. Odd given I do have a LinkedIn profile; Their scraping must be far from perfect.
My personal email seems to be based on Github and Gravatar, while my job search and work emails got linked together and appear to be based on LinkedIn.
It would be really surprised if this were compliant with the GDPR. I live in the US but I tried email accounts of relatives in Europe and they had data in there.
> The whole point of the GDPR is to protect data belonging to EU citizens and residents. The law, therefore, applies to organizations that handle such data whether they are EU-based organizations or not, known as "extra-territorial effect."
They can say this all they want, but if you have no presence in the EU, and your jurisdiction does not have any agreement to apply GDPR regulations to you, then this is at most a strongly worded request.
Barring explicit agreements to the contrary (treaties, extradition agreements, etc), by definition a country's laws are only enforceable there.
If PDL has no business in Europe, no plans to expand there, and there's no treaty or other agreement making the provisions enforceable against them, the EU can say whatever it wants but PDL has no legal obligation to do anything about it.
One obvious answer in that case would be to establish who is buying the data from them and treat any PDL data as potentially tainted. If you find a downstream customer who does have a presence, then investigate accordingly. You might not be able to fine PDL directly, but you could certainly make the offending data risky or unprofitable...
Usually you'd either track known errors in the dataset (implying that the companies had either bought it from PDL or copied the leak), or you'd ask the banks (who do have a presence) which accounts were paying them and who owned the accounts. If Bitcoin's involved at all, you assume there's something fishy going on and investigate accordingly.
(Assuming anyone were bothered enough to actually do this, of course.)
Theoretically, if it were egregious enough, the EU could say to the owners or management of the company that if they went to the EU they would be arrested. That’s enough of a threat that it might convince them.
Legal jurisdiction is a separate matter than the specific text of laws. The "this applies to non-European companies" things just means that if you fall under the jurisdiction of European courts, you can't absolve yourself of responsibility of complying with this law simply by being a foreign-registered company.
On the other hand, if you never fall under European jurisdiction in the first place, you're free to ignore them, just as you can ignore Thai laws against insulting their king. One very important thing to note is that setting foot in European soil will expose you to their jurisdiction, so you've significantly limited your freedom of movement, but if GDPR compliance is a bigger deal than that then "just never go to Europe" can be a viable strategy.
Oh yes, I'm going to try and see if they have data on me and send a number of GDPR requests if they do. For others from the EU, it's very easy to do using: https://www.mydatadoneright.eu/request
It should be illegal for any company to store my private information like this. The 'anonymous' sharing of my information is easily de-anonymized. Sites asking for your phone number for "security purposes" are a joke.
You just have to accept that absolutely everything you've done online is public information. If it isn't now, it is being stored and future tools / databases will make what is either difficult to access or difficult to interpret very easy to use in the future.
Using phone number as an example of private information is pretty hilarious. Remember when the phone company used to literally print your name and phone number in a book and send it to everyone in your town? Man, their security was terrible!
But it works perfectly fine as a two-factor auth mechanism to prove that whoever setup the account is the same person trying to log into it at some later time.
What private information? If you give a random website your email address or phone number, it's not private anymore and you're the one who released the secret. Unless they promised to keep it private in a legally binding way, in which case, your wish is already true.
At this point practically everything about me's available either for free or a few dollars. The only interesting thing left is whether a given password has been compromised. The answer to everything else is "yes, it's been leaked". Been that way for most of a decade at this point, guessing it's the same for most other folks with any modern digital or banking presence whatsoever.
I'm sure there are search engines for it too but I noticed that credit karma can tell you which of your passwords have been associated with your email addresses in data breaches.
Credit Karma is free but the CEO appears to be transparent in how they make money (recommending financial products to you based on what they see in your credit profile).
I am a very suspicious and wary internet user, hardly sign up for any services, but been using Credit Karma for my taxes and light financial monitoring for the last 3 years. Tax Filing was totally free and I got the tax refunds I was expecting. No issues with them whatsoever. I have never gotten any email or other spam as a result of using their service. I am a happy customer, though technically speaking I have never actually given them any money directly.
I agree. At first I got a few emails over a long period of time recommending financial products (credit cards, savings accounts, etc) but I unsubscribed and haven't seen any of those since. The only emails I get now are when something changes on my credit profile (new account, closed account, etc).
> Through our partnership with Troy Hunt’s “Have I Been Pwned,” your email address will be scanned against a database that serves as a library of data breaches. We’ll let you know if your email address and/or personal info was involved in a publicly known past data breach.
Maybe in the future it will, but it uses Have I Been Pwned. From the FAQ[0]:
How does Firefox Monitor know I was involved in these breaches?
Firefox Monitor gets its data breach information from a publicly searchable source, Have I Been Pwned. If you don’t want your email address to show up in this database, visit the opt-out page.
I think cold calling could be an even bigger nuisance. My DNS provider published my phone number by mistake on a whois when I registered a domain, I spotted it immediately and it was corrected within hours. Over a year later I still receive cold calls from India to sell me web services at least once or twice a week.
Imagine if you can match everyone’s position with a mobile phone, a dream for tele marketers, tailors, scammers, etc...
Yes and no. You need a phone number, but you still need to carry out a variation of an attack that replaces the SIM associated with that phone number. Sometimes this is carrier-specific. Sometimes it's trivial, sometimes it requires a menial amount of work, and in extreme cases you might have to access an actual network. Most of the time there is nothing stopping the attack if they have your personal information.
I wouldn't be surprised if the starting point for this vulnerability wasn't ES, but Docker. Docker by default modifies iptables and if you hack together a system that uses both software running directly on the host and in containers, it's going to expose the forwarded containers to the Internet - which you might not be expecting, since a bind to localhost would be enough to expose a service. It's always a good idea to have a separate firewall running outside of the your system - this is the one Docker can't fool.
They mentioned this is google cloud, which blocks almost all incoming ports by default. they had to have chose to expose this through the project firewall, and not put in a source filter.
No. It's not dockers' fault you did not read the manual and expose the ports wrong: you can bind the port to specific ips for export and tjat address should be 127.0.0.1
I see where you're coming from, but I disagree. I believe that good software and abstractions should take little training to use - everything unintuitive is a design failure and should be fixed. "Reasonably secure" should be the implicit default, not something you need to explicitly added. E.g., it's better to force authentication and force the administrator to add an account than let everyone in by default. Or it's better to bind to 127.0.0.1 than to 0.0.0.0 by default, like most web servers built into frameworks I saw do.
Unfortunately, instead of good intuition, Docker is built on caveats, be it networking, storage, caching, image sharing, container/image distinction, authentication, deployment or building a cluster. Every subsystem I experimented with "works", but fails in weird ways in some situations. In my opinion, that means that Docker is a good idea, but has terrible UX/functionality/error handling. I kind of think the same way of Git.
I believe Elasticsearch doesn't allow restricting access by requiring login unless you pay for the enterprise version, which is just straight up stupid.
I remember there was some brewhaha a while back about how Shodan was able to discover services on IPv6 since the address space was so sparse. Apparently they were running enough of their own NTP servers to reliably map out lots of devices on IPv6.
> Shodan is a search engine that lets the user find specific types of computers connected to the internet using a variety of filters. Some have also described it as a search engine of service banners, which are metadata that the server sends back to the client.
Since I learned about Shodan, I'm convinced that the (subjectively) increase in reported data breaches is just due to an increasing amount of people looking through Shodan results, and doesn't have anything to do with any trends in security.
Security standards at any company have always been low, but now it's easy even for a layman to find leaked data.
Not sure about Google Cloud, but Elasticsearch on AWS doesn't support x-pack security. You can only secure your instance via IP restriction, otherwise you have to sign your requests, which is not always supported on Elasticsearch DSL libraries that are commonly used.
This is why I lie about my birthdate by a couple of days on anything where it's not something like a medical record or where I am required to tell the truth for whatever reason. I also never provide my social security number unless it is required by law.
One of my coworkers generates a fake middle name for every service they sign up with. According to him, this serves as a unique identifier allowing them to determine when a service is selling their data to a third party (or data is being leaked).
Fastmail has subdomain addressing, so if your email is jondoe@example.com, you can use hn@jondoe.example.com to sign up for HN.
That way you'll know for sure who leaks your data, and nobody's going to strip it away like some services would strip away plus addressing (as in, johndoe+hn@example.com).
I have excellent results with a subdomain. Even though PDL probably has a lot of data on me, they have (not yet?) been able to glue it to my primary mail address. That one only has my name, gender, github, country and name of my employer. They can't seem to map the remainder to anything else.
From what I could see the data returned on me was all derived from publicly available sources (eg: my "public" LinkedIn page, my public github page etc). Perhaps others have more but this looks more like an aggregator of public information than a breach of non-public information.
Having said that, I find these companies unspeakably evil - their intent is to make money by harming people (eroding their privacy by making otherwise private personal information easier to get, obviously a gold mine for identity thieves etc).
In retrospect, it would have been interesting to have a bunch of accounts each containing a unique "map trap", at all of the larger services. Then years later, when the aggregator/broker guys get hacked/sold/leaked, you'd have some picture of the genealogy involved.
The problem is that you often can’t find access to the actual “password” used in the breach. Does anyone know where I can see if it was an actual password or just some made up thing?
I was suggesting something different. Specifically open an account on every service as a tracking canary, say with the same email to help them tie them all together. But on each one, vary something slightly like phone. Then years later, when looking at a leaked aggregator entry, all the phones on the record should tell you all the places they bought/stole data from.
It appears to also contain information possibly acquired from other companies. For example, the author notes that he had attributed to him a phone number he had assigned to him by AT&T that he never used or shared.
Back in the day (maybe it is still this way) your landline was default "listed" and you had to pay a monthly fee to be an unlisted number. So AT&T most likely listed his number in some kind of phone book / directory.
Yeah, I'm wondering if this is all scrapped public data or a breach of some kind. Are land line phone numbers published in a directory (like a phone book) in USA?
Yeah. Email addresses? Phone numbers? All of that is practically public at this point anyway. This article is crying wolf. Someday there is going to be a massive credential compromise.
Yes, there are dozens of these data enrichment companies. They scrape public sites and use browser extensions, SaaS tools, inbox addons, etc. They mix it together into profiles, and pretty much have the same dataset by now.
Yep! And as someone who has worked with these data sets and worked on the scraping tools on services like LinkedIn, a lot of the data is outdated, incorrect, or mixing together different entities with the same name into one person or splitting the same person into separate entities incorrectly.
Look at the personal record that was in the article. It looks like aggregated public information. And look at what the companies referenced in the DB do.
It's possible there's someone selling them so not-quite-public info, too, but it's probably more like phone numbers and less like private messaged on Facebook or Linkedin.
The title reads like data from 1.2B profiles was leaked by Facebook and Linkedin, but this looks like scraping public profiles from them.
I mean, this is literally just a leak of data that People Data Labs is selling to anyone who signs up to their service. The 'leak' is just bypassing their payment requirements, so by definition all the data leaked is available for purchase.
This data is accessible at small scales just by registering for a free api key at People Data Labs and making a GET request, and if you want more robust access you could just pay PDL for it.
I mean it is INTENTIONALLY exposed to the public... the only mistake is they are giving it away instead of charging for it. If you don't like it when they give out all the information for free, it doesn't make it better if they charge money.
Depending on the countries the data is hosted in and the attacker lives in, it's unclear any law has been broken that would land a person in jail.
If PDL had a flaw in their implementation that allowed someone to scrape them (or they didn't and someone did the hard work of creating 1.2 million fake accounts to register for 1,000 free API calls), it might be an uphill battle to prove even "unauthorized access."
Linkedin the last social media membership I have. I’ve been mulling over whether to delete my account because I’m not sure how it will look to prospective employers.
Maybe I am missing something here, but I do not really see the scandal here with the "leak" and I rather think the term is missleading in this context.
What happened?
As far as I understand, there are companies who search the web for public data of people like me, without my consent.
Then they sell that data. Also without my consent.
So that data was avaiable anyway, allmost for free.
If this data would contain sensitive information, then I see this buisness practice as a scandal.
But the mere fact that all this data which was gathered without consent is now avaiable for free because of possible db missconfiguration .. is not a scandal to me.
And a leak is usually when a company loses sensitive data of its customers, who expected that data to remain confident, like emails.
Not what happened here.
Feels more like PR.
I don't know about other people, but I have zero personal info with LinkedIn and Facebook.
They only info they have about me is info I don't mind being public. If I want something to be private I don't tell it to them. It's as simple as that.
Google on the other hand, knows lots of private things.
Through shadow profiles, third-party submissions, cross-site cookie tracking, and integration of offline data records, this almost certainly is absolutely false.
Unless you've directly pursued all legal (or otherwise) mechanisms to ascertain this directly, the best you can say is that you're unaware of any information that's been acquired, and that you didn't knowingly or intentionally contribute any yourself.
The article here describes precisely this practice, in its fourth paragraph and following, in the section titled "Data Enrichment":
For a very low price, data enrichment companies allow you to take a single piece of information on a person (such as a name or email address), and expand (or enrich) that user profile to include hundreds of additional new data points of information. As seen with the Exactis data breach, collected information on a single person can include information such as household sizes, finances and income, political and religious preferences, and even a person’s preferred social activities.
Facebook has a lot of personal information about you even if you have never had a Facebook account. For example: your GPS location data, approximate age, gender, ethnicity....
Welcome to the future komrade. Sadly, it's not a matter of just "not giving them" your location data. Your devices supply it.
And your friends too. I dutifully kept a new number out of FB until a friend messaged me with, is this your number right? Xxx-xxx-xxx. They can also tag you and auto tag you through face recognition.
Cyber alarmists would call a telephone directory: 'A verified threat incident'. Yet these are the same companies selling OSINT data. These alarmist groups need to put down the buzzwords, step off from their white horse and take a look at the hypocrite in the mirror.
If you use social networks, you don't have a reasonable expectation of privacy. You've published your data publicly. If you want to keep this information private, then don't publish it on the Internet.
> In order to test whether or not the data belonged to PDL, we created a free account on their website which provides users with 1,000 free people lookups per month.
Well that's very generous of them. Now I know what I'm gonna do next.
But would you call VendingMachinesCo because there is a vending machine outside the local supermarket, operated by said supermarket, that spits out cocaine? Pretty sure that whatever you put in there is the machine owner's responsibility, not the manufacturer. GCP does not put content in their VPSes themselves the way that a bank operates an ATM.
I think it's more like the responsibility of an ISP to poke their noses in what they transfer, since it might be illegal content (similar to whether Google should poke their noses into people's VPSes). I'm not sure if we should want to require them to do that.
Why are there people running anything publicly accessible?
If you are running on the cloud, there is no need for any VMs to have any public IPs at all. Exception for your Bastion host, and even that should be restricted to known networks.
All incoming traffic needs a layer of indirection. On cloud providers that's usually their load balancers.
Facebook/LinkedIn are not implicated in the breach at all; it was some random third-party data enrichment service. The Facebook/LinkedIn in the title refers to the fact that people's FB/LI accounts were one of the fields in the database. So were their Github, and basically any other public-facing account that these scrapers can gather.
ES is the new Mongo. If you make software this easy to use, then people with little or no experience are going to use it. Just have secure defaults, like authentication, how many times do we have to learn this lesson...
The IP address in question does not seem to be working at this time. Clearly whoever runs the server has shut off access. I wonder if someone managed to save a data dump somewhere?
Unless we go after every customer who used the services of PDL, nothing is going to change. We will see a $3 fine per individual after 1 or 2 years of talking about this.
That is precisely my point. Differential privacy would NOT make any difference, and I was pointing the many folks who are working on it to the much simpler issues that are in fact being encountered in the field. This past IEEE S&P had quite a few theoretical privacy talks.
does anyone know how we can search the data to find info about our (more than likely) entries in this database? or did they simply find it but not release the info?
>According to their website, the PDL application can be used to search:
Over 1.5 Billion unique people, including close to 260 million in the US.
Over 1 billion personal email addresses. Work email for 70%+ decision makers in the US, UK, and Canada.
Over 420 million Linkedin urls
Over 1 billion facebook urls and ids.
400 million+ phone numbers. 200 million+ US-based valid cell phone numbers.
Too bad there aren't any laws regulating this sort of private data aggregation and sale. Well, besides GDPR (which apparently isn't enforced) and CCPA (which won't be enforced either.)
Let me make sure I understand: If I take gigabytes of “enriched” personal information and make it available to the public for free, then I’m an irresponsible, idiotic, incompetent buffoon. But if I put a paywall in front of it and sell that same data for a fair price, then I’m a business genius?
Seems to me that if the data is legally acquired and can be legally distributed, doing so at a cost of zero does not constitute a data leak. It may be bad business, but since when is that a crime?
People Data Labs privacy policy:
3. ACCESS TO AND CONTROL OVER INFORMATION
A person may do any of the following at any time by contacting People Data Labs at support@peopledatalabs.com. People Data Labs will reply to a person’s request within five business days.
A. Access any information we have on them, if any.
B. Change, correct, or delete any information we have them, if any.
C. Express any concerns about People Data Labs using their information.
People Data Labs' team will act swiftly upon a person’s email request to change, correct, provide, delete, or explain anything a person query.
People Data Labs understands if a person would like to opt out of People Data Labs' database. Opting out will stop all data sharing and enriching of all PII in People Data Labs servers for that person. Click here, if you would like to opt-out, or choose to have all data about you removed from People Data Labs' database.
For https://www.oxydata.io/:
Review and changes to your information
Contact us at sales@oxydata.io to find out what information we have collected about you, and to request any changes to or deletion of it.
Also it seems like theres a service like this called Delete Me, but it also seems like theyre a manual opt-out shop. Would be cool if you could find a way to not have humans doing it. Bet they're just having people on amazon mechanical turk fill these out or something like that.
https://joindeleteme.com/how-we-work/
Well then thats the trick. A legal research team that develops the form for as many sites as you could find, and then a mechanism to send that form filled with each users data to those sites.
Like, what more do they need than disambiguating identity info and a declaration that I'm opting out? E.g. name and DOB?
My only fear is that you're now sending this all to them, but in 2019, we can safely say your name+DOB+address isn't a secret. Or national identity number if that's a thing in your jurisdiction.
It's a legit service. I use them and they did ensure that my data was removed from the services they specified. Obviously I'm just some person on the internet so my statement has no intrinsic credibility, but I believe they were also validated in a nyt article awhile back.
Actually working on that project right now - www.thekanary.com. Super early stage but have a big list of brokers and opt out links that I'm automating. Would love early feedback.
They list 2 companies as owners of the data in the article. I guess there would be a good place. I'd love to do that but I'm not on the eu.
But the article says that's possible the actual leak comes from a customer or former customer of these companies and the actual ownership is so far a mistery.
Google are jointly liable for this service, so if you can't find a contact point, then you can email google with the service IP. They will more than happily point you on to the customer to avoid being taken to court.
From the article it seems that you can just create a free account and query your own name.
> In order to test whether or not the data belonged to PDL, we created a free account on their website which provides users with 1,000 free people lookups
Really interesting legal question - "Seems like the ball is with Google at the moment, the exposed data is on their GCP servers. So, they can figure out next steps." is a comment above. How will the chain of insecure infrastructure + the data scrapers + the people responsible for configuration react?
They describe the data as being sourced from a 'data co-op' of over 1k companies which share data. It wasn't clear whether that means that those companies are collaborating and pooling data, or whether it's a roundabout/wordy way of saying that they scrape public personal information from thousands of sites.
They also claim that they're GDPR and CCPA compliant; I'm no expert but I do find one or two references that seem to suggest that scraping EU citizens' personal data without consent hasn't been GDPR-compliant for some time.
It does also raise another question: even if PDL themselves aren't GDPR-compliant, would any resulting fines against them reclaim a significant portion of the utility captured from the distribution of that data? As per comments on this thread, PDL API keys seem to be free to create.
Hypothetically speaking it could be within the interests of a group of businesses to provide a small amount of funding towards operation(s) that harvest and redistribute personal data: if the revenue base is low, the operation(s) can eventually fail (once legal proceedings catch up with them) and the group as a whole incurs little cost.
The speaker also takes a question from the audience regarding potential use-cases for this kind of personal data, and answers that knowing about an individual's life events (such as marriage) can be an opportunity to sell products to them, as can differentiating pricing if they'd just started smoking cigarettes.
Although I'm no expert, my understanding of insurance has been that risk is spread across a large pool of customers, allowing them each to pay similar premiums despite potentially slightly different backgrounds, with the understanding that they mutually benefit by paying into a shared fund so that the (random, potentially high-cost) risk of loss to each member is greatly softened.
We're seeing a situation here where more precise, per-individual data is being collected across large populations and could potentially be used for price differentiation.
If the insurance industry doesn't defend itself, this could lead to premiums which are essentially calculations based on 'pre-existing data' -- information which the consumer may not have consented to sharing, and which an insurance company might not be able to collect from application forms.
We don't seem to be particularly good, collectively, at escaping from cycles which seem to introduce or further wealth disparity at the moment and I worry that this kind of tech-driven attempt to optimize revenue efficiency of the insurance industry would only lead to further inequality.
Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."
Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?
In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).
It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.