" This saved costs, guaranteed better uptime, and made the site more portable and thus harder to take down "
Probably not true for " This saved costs ". From what i've seen, virtual machines usually cost more than twice the price of renting the equivalent "real" machine monthly.
They could have used dedicated servers; there are more dedicated server providers than VM providers, thus achieving the same goal, less expensively.
Probably not true for " better uptime " either; VMs are still hosted on real hardware, which fails, too. (Although distributing the work on more independent machines can improve uptime.)
They are more expensive, but they are usually easy and immediate to acquire. Which makes provisioning much more efficient in case of fluctuating traffic. And overall sysadmins will have less tendency to over-provision, meaning getting more and beefier machines than it's needed "to be safe".
From the article it sounds like they aren't provisioning for fluctuating traffic and have a fixed set of VMs. Most providers can get you hardware within a couple hours in any case.
Their historical traffic records are probably good enough to predict how much traffic will be coming in during the different times of the year and they could ramp up or scale down the number of VMs as needed possibly saving them some money.
Also VPS provisioning time is seconds/minutes instead of hours to where they could redeploy to another provider if they suddenly got the boot from one provider. And via Amazon/Digitalocean-type APIs this reprovisioning-on-failure could be fully automated.
In an ideal world yes. Or if your software works already seamlessly cross-datacenter. But in the real world is rare that your hosting provider is good at both VMs and metal. At least that's the biggest problem I've always encountered, specially with budget providers.
This is clearly the ideal setup for most use cases, and I'm somewhat puzzled as to why it's not more common. I guess using only virtual servers is a tiny bit simpler, so companies will just eat the extra cost.
1) Hardware seizure expenses vs LEOs duplicating the hdd of a virt.
2) TPB needs to locate in disparate jurisdictions to take advantages of different legal situations. That would involve a ton of shipping costs, probably more lost hardware, and paying for remote hands
3) They had been paying a premium for 'bulletproof' hosting.
For real, their dedicated hosting costs are most likely not going to be at all comparable to the ones most people commonly get quoted. Hosting costs go up when your host runs out of a former Cold War bunker.
There are almost no MPAA-proof countries, its pretty hard to hide from billion dollar companies with lobbyists and legal teams who are effectively above the law/make the laws. Countries that do ignore international copyright laws(Russia, China, Iran, pretty much any South American country) are usually very expensive to get a dedicated server in and have unreliable networks and piss-poor speeds. On top of that they have their own set of content laws(Russia and China censor anything that they perceive as against their government and jail/execute those who create and facilitate its distribution, good luck hosting anything in a muslim country that opposes Islam see:pretty much anything fun).
Sweden and Holland used to be considered anti-copyright havens, but the movie/recording industry mafia eventually pressured them into passing legislation that squashed this.
The only countries where you could operate in and be reasonably copyright resistant are Iceland and Switzerland, because they are non-EU members and have great data protection laws as of now. Dedicated servers there are quite overpriced though.
If the load balancer is the weak point that would be first to be discovered, then I imagine they must have some mechanism to stop it leaving evidence that leads to the other machines if it were to get raided (it isn't on their hardware, so they can't prevent the files being backed up).
Is there a way the codebase could be entirely encrypted and not even accessible to the cloud provider (with some 'boot password' needed each time the server starts up)?
I remember reading their load-balancer shuts itself off after 2 minutes if something strange happens. The whole thing is run from memory, so no traces will be left.
No need to get that fancy. If they're in the cloud, they're most likely on Xen, which natively supports full memory dumps. LE goes to the cloud provider with a warrant, they run "sudo xm dump-core", LE walks away with a memory dump without the client ever knowing anything.
Even if they are able to do that, they still need to shutdown 21 other VM, probably on 21 other hosting provider in different countries, all that in the same time. I'm sure they can but I'm also pretty sure TPB can start new ones faster then authorities can stop them.
It depends on what you mean by that. The only way to prevent a codebase from being seen by an adversary with physical access when the server is on is to not have the sensitive data on the server in the first place.
Encryption (with the decryption key being gotten at boot from, say, a particular .onion address) would work against backups, but won't protect against an adversary with admin access to the server when the virtual server is on.
This would be a cool application for some kind of homomorphic encryption. Server gets encrypted search request and matches it against an encrypted index and returns the encrypted results.
Homomorphic encryption isn't fast enough... yet. But it's getting faster!
It will be amazing if/when we get to the point where you can have a virtual server where you know that the person with physical access to the server cannot access your data.
Even if it is two orders of magnitude slower than raw hardware that's still fast enough for some things. (For example, being able to have a username+password -> personal info database safely run on someone else's hardware.) And once it takes off there probably will start to be hardware support/accelerators for it - like vector intrinsics and AES instructions currently.
The problem with homomorphic encryption in this case is that you need access to the private key in order to interpret the results of the computation. Joe Public encrypts queries using a public key, and the query runs on a possibly compromised server without the attackers learning anything about the database or the query, but then the result of the query looks like absolute gibberish to Joe Public.
Giving Joe Public access to the private key necessary for interpreting the query result allows attackers to inspect all of the intermediate states of the query finite state machine, which allows debugging and inspection just as if homomorphic encryption wasn't in use.
I suppose the routing proxy could hold the private key and decrypt the query result for the general public. However, the location of the routing proxy is almost certainly going to be compromised before the locations of the servers executing the queries, so in the decrypting proxy scenario, the attackers will almost certainly have the secret keys before they get access to the boxes executing the queries. There's also the problem that the messages being decrypted are the final states of finite state machines that executed the queries, so the messages to be copied over the network add up in size to at least the size of the dataset being queried. (The data can be sharded into many smaller databases, and almost certainly would be in order to speed up the homomorphic computation steps, but this doesn't cut down on the amount of network traffic necessary to retrieve all search results for a single query. A simple query on 1 TB of data, split into 10,000 databases each of 100 MB would require copying and remotely decrypting 10,000 messages, each over 100 MB in size.)
It might be possible to discover a homomorphic encryption scheme whereby knowledge of the private key allows one to devise a mapping from a higher dimensional finite state to a lower dimensional finite state machine, where the secret key for the smaller dimensional machine doesn't leak information about the secret key for the larger machine. In this case, it may be possible to perform some finishing operations on the query to prepare it for conversion to the smaller state machine and give the public the private key to the smaller state machine so that the query result could be read from the machine by the public without the public being able to observe intermediate states of the query computation. However, I believe this is far beyond our current mathematical understanding.
> Giving Joe Public access to the private key necessary for interpreting the query result allows attackers to inspect all of the intermediate states of the query finite state machine, which allows debugging and inspection just as if homomorphic encryption wasn't in use.
This doesn't apply to TPB, but one could give each user of, say, an email webapp the private key to his/her own data while still facilitating server-side search.
> I suppose the routing proxy could hold the private key and decrypt the query result for the general public. However, the location of the routing proxy is almost certainly going to be compromised before the locations of the servers executing the queries.
This wouldn't be completely useless since it lets you offload much of the storage and computation onto commodity cloud providers without revealing what's on the machines, even if they're scanning your RAM. From the article it seems like TPB is getting some kind of utility out of such a scheme: "All virtual machines are hosted with commercial cloud hosting providers, who have no clue that The Pirate Bay is among their customers. All traffic goes through the load balancer, which masks what the other VMs are doing."
> There's also the problem that the messages being decrypted are the final states of finite state machines that executed the queries, so the messages to be copied over the network add up in size to at least the size of the dataset being queried.
In a homomorphic encryption scheme that supported querying, only the encrypted results would need to be relayed back from each search shard, no?
> In a homomorphic encryption scheme that supported querying, only the encrypted results would need to be relayed back from each search shard, no?
Yes, but as I stated originally, the size of the result is the size of the finite state machine which encodes all of the query data plus the search mechanics. We may in the future discover ways around this, but it's a limitation of the current state of the art.
So you're saying that, effectively, anything that Joe Public can request an adversary can request, so that by giving Joe Public access to the database you'd be giving an adversary the same access?
No. You're over-simplifying what I said and arriving at a trivial statement. I suppose I would correct your summary to be "The mechanics of current homomorphic encryption mean that by giving Joe Public _ONE_KIND_ of access to the database you'd be giving an adversary _ANOTHER_, _MORE_POWERFUL_ kind of access to the same database."
I'm responding to the GP, who was hoping that homomorphic encryption would allow TPB to hand an attacker a working copy of the database on which the attacker could run queries, but not leak information about what the database was doing.
I'm making the statement that allowing Joe Public the ability to interpret query results allows the attacker the ability to observe the database's internal state at each step of the query, nullifying any advantages of homomorphic encryption.
I explained why current homomorphic encryption doesn't allow the kind of separation of access the GP was hoping for, and outlined one way a theoretical discovery advancing the state of the art might allow what the GP was hoping for.
"At the time of writing the site uses 21 virtual machines (VMs) hosted at different providers. [...] All virtual machines are hosted with commercial cloud hosting providers, who have no clue that The Pirate Bay is among their customers."
They may "have no clue" but it seems like that's only because they don't care and haven't looked. I don't see anything in the article that would prevent the providers from figuring this out unless I'm missing something.
Apart from the external-facing proxy (which is the most exposed link in this setup), these VMs don't need any sort of public presence. Unless the provider inspects processes running on all their customers machines, all they can see is a VM with opaque VPN connections to a few external ips.
I think only the load balancer would be vulnerable to discovery. Everything behind the load balancer could be a secure connection to a completely different datacenter if needed.
"If someone is paying the bill, do they really care?"
So you could cross reference names of the people raided with payment information of the VPS providers (usual suspects or top "n" providers let's say). Of course that could be hidden as well.
Other issue is how does anyone know this isn't misinformation anyway and that the VPS providers don't play a role or not as much of a role as is indicated. Just because someone is writing this or because they said it?
What advantage does it have for anyone (like this) to reveal anything about how they are situated security wise if not to lead people off the beaten track even given some possible marketing advantage?
It's not about the money, it's about not being able to care and survive as a business.
Because then they would also have to care about the thousands of other VM's that may run all kinds of stuff that is illegal somewhere, questionable, politically, socially, culturally or commercially sensitive etcetera.
No ISP can afford to be proactive about this. They cannot afford to care. Or even know.
Why can't people find where their servers? I understand they have their own IP allocation, thus they can use BGP tricks. But don't they need a sympathetic ISP or similar to help them get the routes in?
IIRC They have their load balancer hosted under a sovereign IP address (the IP block belongs to a political party). So attempting to mess with it could constitute infringement of free speech.
I guess if ISPs would really go sniffing on whether they host them they would probably be able to find out (probably!). But then when you have couple of hundred VPS customer and give a tiny bit about their privacy then as long as you get paid and receive no complaints why would you really go look for them?
Interesting, so I'm presuming there's several VPNs involved between the load-balancer and all the discrete servers. I wonder if they use a VPN provider with a static IP and no-logs policy or if it's simply yet another VPS.
I'd love to hear a little more about the architecture.
If memory serves, I think TPB is somehow related to iPredator [1][2], though I'm not sure if that is the case anymore. This may give them _lots_ of experience running VPN software, which would be usable if that is indeed how they're communicating between VPS providers.
To make the legal side effects "Somebody Else's Problem"? seeing as that's the service many VPN providers offer. Might save them having to get new load balancer VPS quite so often, making it that much harder and more time consuming to knock offline.
> In total the VMs use 182 GB of RAM and 94 CPU cores. The total storage capacity is 620 GB, but that’s not all used.
That level of hardware/cores seems a bit over the top given what TPB does.
When I was a boy we had this thing called 'Alta Vista'. It was the search engine before Bing! came along. Processors did not run at gigahertz speeds back then and a large disk was 2Gb. Nonetheless most offices had the internet and when people went searching 'Alta Vista' was the first port of call for many.
TPB has an index of a selective part of the internets, i.e. movies, software, music, that sort of thing. Meanwhile, back in the 1990's, AltaVista indexed everything, as in the entire known internets, with everything stored away in less than the 620Gb used by TPB for their collection of 'stolen' material.
Alta Vista is a very large project, requiring the cooperation of at least 5 servers, configured for searching huge indices and handling
a huge Internet traffic load. The initial hardware configuration for Alta Vista is as follows:
Alta Vista -- AlphaStation 250 4/266
4 GB disk
196 MB memory
Primary web server for gotcha.com
Queries directed to WebIndexer or NewsIndexer
NewsServer -- AlphaStation 400 4/233
24 GB of RAID disks
160 MB memory
News spool from which news index is generated
Serves articles (via http) to those without news server
NewsIndexer -- AlphaStation 250 4/266
13 GB disk
196 MB memory
Builds news index using articles from NewsServer
Answers news index queries from Alta Vista
Spider -- DEC 3000 Model 900 (replacement for Model 500)
30 GB of RAID disk
1GB memory
Collects pages from the web for WebIndexer
WebIndexer -- Alpha Server 8400 5/300
210 GB RAID disk (expandable)
4 GB memory (expandable)
4 processors (expandable)
Builds the web index using pages sent by Spider.
Answers web index queries from Alta Vista
From what I remember the whole of TPB server + data could fit onto a 90Mb usb stick in 2012. Sure we have had many episodes of really important reality TV series and other great stuff that all needs pirating, yet, in 2014 I doubt that 90Mb has ballooned into peta bytes. We are still in the same range - let's say 1Gb might be a reasonable size USB stick to buy for it.
Alta Vista started out with a modest size index of 20 million pages. Let's imagine those pages were all of 1Kb in size, then, 20 10^6 10^3 comes to 20 *10^9 or 20Gb. So, in terms of stuff indexed, that is considerably larger than TPB. Agreed?
Well, maybe not. They could have used compression to get the vastness of TPB onto that USB stick. Around that time - 2012 - they had 1.6 million torrents. That is some way off the Gb that AltaVista indexed, no matter how you bloat the maths. Sad to say, but, in the 1990's, the internet was actually larger than your porn collection.
How useful is reqs/second anyway? By that score Google probably does very badly as a search usually returns the answer on the first page. With old-style search engines you might need to go through scores of pages before getting what you want. I found TPB to be a bit like that too, wading through results pages more than necessary.
TPB is not 'safe for work' and in a lot of jurisdictions you cannot even access it from home. In the UK (which is a small but well populated country) it is not that easy to get onto TPB - you have to have hacker voodoo skills to do that or route through a VPN as none of the main ISPs will let you on. Most of the civilised world has the same need to protect citizens from the evils of TPB so places where it can be accessed are not that common. Even if you could access it, would you? Probably...
Meanwhile, back in 1998 - a year or two before the dotcom crash - plenty of people were using search engines such as AltaVista (which was the best back then) for actual work. Maybe not everyone, but enough people knew about computers and things like AOL disks, modems and what not. The internet was big.
Which reminds me of my main point, the one you thought so important to down vote rather than give kudos for being insightful. TPB uses a constellation of computers and consumes vastly more resources than the biggest search engine of the 1990's, yet, the utility of TPB is limited to only a few fortunate enough to live somewhere where TPB can be accessed. What can be searched for on TPB is a mere subset of what was on Altavista albeit different and not so useful stuff. I would say that with AltaVista they were doing far more with what they had, reaching a better audience, doing something more useful for the world (than serving weight loss adverts) and all together performing a miracle. TPB is a slouch in comparison.
I'm not sure what your point is. AltaVista probably had to put a lot of effort into tuning every part of their infrastructure to keep the site running on that hardware. Why would TPB do that when they can simply get another VM for a fraction of the cost?
Running a top 100 site[1] on 21 VMs in 2014 is quite impressive.
Probably not true for " This saved costs ". From what i've seen, virtual machines usually cost more than twice the price of renting the equivalent "real" machine monthly.
They could have used dedicated servers; there are more dedicated server providers than VM providers, thus achieving the same goal, less expensively.
Probably not true for " better uptime " either; VMs are still hosted on real hardware, which fails, too. (Although distributing the work on more independent machines can improve uptime.)