I'm a huge believer in colocation/on-prem in the post-Kubernetes era. I manage technical operations at a SaaS company and we migrated out of public cloud and into our own private, dedicated gear almost two years ago. Kubernetes and--especially--CoreOS has been a game changer for us. Our Kube environment achieves a density that simply isn't possible in a public cloud environment with individual app server instances. We're running 150+ service containers on each 12-core, 512 GB RAM server. Our Kubernetes farm--six servers configured like this--is barely at 10% capacity and I suspect that we will continue to grow on this gear for quite some time.
CoreOS, however, is the real game-changer for us. The automatic updates and ease of management is what took us from a mess of 400+ ill-maintained OpenStack instances to a beautiful environment where servers automatically update themselves and everything "just works". We've built automation around our CoreOS baremetal deployment, our Docker container building (Jenkins + custom Groovy), our monitoring (Datadog-based), and soon, our F5-based hardware load balancing. I'm being completely serious when I say that this software has made it fun to be a sysadmin again. It's disposed of the rote, shitty aspects of running infrastructure and replaced it with exciting engineering projects with high ROI, huge efficiency improvements and more satisfying work for the ops engineering team.
There is a huge difference between managing your own cloud infrastructure and letting clients run your big SaaS app in their own data centre. I think "on-prem" in the context of SaaS usually refers to the latter.
The article talks about something else: vendors should be able to run their big SaaS apps on customers' private servers as easily as they're doing it for public SaaS customers.
It effectively gives you, the SaaS vendor, the best of both worlds:
- Access to markets you can't serve otherwise.
- Development agility with frequent deployments.
- Scalable ops/support for cloud/private deployments.
Software vendors have often done some sort of limited operational management support in well-defined parameters. In a way I can see that SaaS is no different. For example, I was involved with a software product that we provided as a running system to the customer and took responsibility for maintaining: this was when everything was about "virtualization". Technically, we tightly defined the whole stack from hardware upwards.
Outside the excitement of containers and composability I'm not sure that the problem has been fundamentally simplified. The challenge is that a fully "operational" system from a customer perspective has to fully integrate with the rest of the neworks services. That's primarily around all the other elements such as monitoring, intrusion detection, backups and secure remote access: granted the article starts to address some of these. As a SaaS vendor you need absolute repeatability and a constrained set of 'additional services' to integrate with - otherwise the support matrix is going to grow massively. From a customer perspective I can see the advantages in being able to flexibly move across the boundaries depending on the circumstance.
Honest question, if you're running at 10%, why have you gone with 12-core, 512 GB RAM servers? Couldn't you start with a more reasonable 4-core, 64 GB RAM until some threshold? What value do you get from such an early overallocation of resources?
Who rips the $1,000 processors and $200 voltage control modules out of their servers to upgrade to $2,000 processors and $250 voltage control modules, then re-plans their entire infrastructure and possibly code back around that?
Symmetry between servers has a value.
The main board, raid controller, network, and usually the storage is going to be planned out meticulously ahead of time based on the maximum load the server is going to see during it's lifetime. Often, Processor and Memory come down to "If we needed X Feature and we didn't have it" or "If we took the server down for 1 hour to upgrade it, how expensive is that?".
I misunderstood, sorry. I was talking in the context of virtual servers that can scale resources somewhat dynamically. (side note, isn't it awesome we see numbers like "512GB RAM" and don't immediately assume its not a single node in a deployment?)
I initially pictured someone picking the highest spec option they can afford when setting up a new service, rather than choosing based on actual demand of each node.
> "If we took the server down for 1 hour to upgrade it, how expensive is that?"
Putting my CI / DevOps hat on for a sec: who takes production servers down for upgrades without some level of HA to avoid downtime? ;)
Several reasons. We did some POC testing before we deployed this gear and knew that we would achieve high density. Honestly, I didn't expect this much. Hard to believe that we've basically deployed every non-database app we have on this cluster and we're only 10% utilized.
Every service gets run in two environments: test and prod. Both are co-located on the same Kube cluster in different namespaces. We also don't put any datastores in Kube. That stuff still lives in OpenStack for now. Ceph can make it possible but for fast disk I/O, it's tough to beat the local SAS bus.
> Ceph can make it possible but for fast disk I/O, it's tough to beat the local SAS bus.
Personal anecdote - we have everything on-prem using a pretty standard vSphere setup, I've got a couple PostgreSQL databases that aren't even that heavily used (I mean, they top out around 20-40 tx/sec) backed by a hybrid Tegile array. Randomly throughout the day my IOWAIT starts spiking because even over 8Gb fiber channel our storage latency starts spiking because everything else is on the same storage (well over 200 full VM's), or sometimes I need to run a table scan over a 40GB table for a one-off query and the storage bottlenecks the crap out of us because only our small active set is cached on the SSD's.
You can run databases on remote storage, people obviously do it, but there's a reason why even on AWS the best practice is to use your instance disks instead of EBS volumes if you care about performance. Noisy neighbors suck.
because the marginal cost of hardware once the chassis is racked is trivial compared to the public cloud, which is insanely marked up for retail on exactly this concept.
I have heard from a staffer on the GCE team that they deploy three CPU cores for every user facing core. That might have something to do with the cost.
GCE provides significantly more reliability promises to us, from the PoV of user - As far as I know, DigitalOcean and RamNode don't provide things like no-downtime host maintenance (hell, AWS doesn't provide that)
Interestingly, the "3 cores for 1 user acing one" would explain the pricing of "preemptible" instances - which can cost as little as 1/3rd of full instance. And the main difference is that they are fully rescheduled every 24h or more often, and there's no live-migration for maintenance...
… I am now wondering if GCE runs fault-tolerant VMs. Because if it does, holy moly
Except they don't. You give up reliability and uptime and not being on oversold hosts for a tiny bit more money?? So some real world benchmarks between DO and GCE. Not the same ballpark.
I'm aware of one similar company that used to fit sixty $20/mo VPS in 1U. They were working on a couple hundred in 2 or 3U when I left (harder than you'd think), but they also have a cheaper tier now because of competitors pushing VPS down. Margins on VPS, even without oversubscribing RAM, are pretty decent once you get past the capital but they are decreasing in a race to the bottom just like shared hosting before it.
Yes but if you are targeting an on-prem solution toward certain customers, let's say enterprise, wouldn't you want to target the platform they are most likely to be running? Maybe that is k8s, but OpenShift and CloudFoundry seem much more enterprise-y to me.. How do you choose? How do you NOT choose?
If running your app on Cloud Foundry is an irreversible architectural decision, then those of us who work on it messed up.
CF is meant to provide a platform, not an API. The whole point is to liberate the developer from caring about how the app is brought to life so that they can focus on building the app. For operators the point is that no badly-behaved app can bring down production for everyone, so they can lower the drawbridge and let developers move faster.
I imagine that my counterparts at Red Hat would say the same about OpenShift. It's there to let you go faster.
Disclosure: I work for Pivotal, the majority donor of engineering to Cloud Foundry.
Nothing yet but it basically borrows a pattern that we use for nearly all of our Kube-related systems: descriptive configuration files bundled by the developer with their individual applications. So, alongside their code, their Dockerfile, their Jenkinsfile, their Kube specs, 'and monitoring spec, the dev adds a file that describes the load balancing. "This app needs a privatenet-only VIP on port ___, speaking protocol ___. For the health check do this ____. Load balance using ___ algorithm." This file (yaml) would be parsed by our build pipeline and fed to a service that manages the load balancing. This service initiates "watches" on Kube services and nodes and configures the service appropriately on the F5 via the LB's REST API, pointing the pool at the NodePorts of a subset of Kubernetes nodes.
We would probably build this around Kube's Ingress API. We're still a few backlogged projects away from addressing this internally. For now, we manage VIPs by hand. About 1-3 new VIP requests a week.
Recently just built and deployed an enterprise (on-prem) offering for my startup https://commando.io.
The single biggest problem by far was ripping/replacing 3rd party SaaS services that we use/pay for with native code. While building a SaaS it is wonderful to use 3rd parties to save time and complexity. Examples include Stripe (payments), Rollbar (error tracking), Intercom (support/engagement), Iron.io (queue and async tasks), Gauthify (two factor auth)... Howerver, when you go to on-prem often times the servers don't have a public interface, so your entire application has to work offline.
2nd tip. While it may seem like a good idea to create a separate branch in your repo for on-prem (i.e. enterprise branch), this is bad idea. It ends up being much better to just if/else in the master branch all over the place:
if(enterprise) {
// do something
}
If/else in the master branch is what GitHub Enterprise does (at least was told so).
Everybody works on master, with feature flags! Yes!
Also, if you have proper dependency injection for testing and developer sandboxes, that's the right layer to vary your back end. If statements only needed when setting up the dependency environment.
In PHP we do this with polymorphic classes; in Haskell with type classes and the ReaderT monad.
Totally agree from experience that if/else flagging is vastly better than separate branches - you will get burned _hard_ if you take the wrong approach!
A lot of our internal architecture is driver based. So the calls and interfaces are generally the same and the enterprise logic for example is determined in the driver, also helps migrating 3rd party vendors because we just write a new driver to handle that system.
I agree that Kubernetes is a game changer which makes it much easier to run your own applications. If all you have is VMs then the services (RDS, EFS, etc.) offered by AWS are more effective. With a container scheduler there is less maintenance and the decision is harder.
BTW Shoutout to Ev from Gravitational, they are proper Kubernetes experts and we appreciate their help on GitLab, especially making https://github.com/gravitational/console-demo for our k8s integration efforts.
If you have VMs, AWS is not necessarily the answer if you have a relatively static workload. You are lighting money on fire by the buckets if you pay AWS for a fixed amount of always on instances.
I agree that for just static VMs AWS is not a good deal. I meant that if you're still stuck with a VM architecture (using Chef/Puppet instead of a container scheduler) in many cases it makes sense to pay for AWS services (such as running a database for you) instead of doing it yourself. I think container schedulers will change that balance for many companies, making self managed service attractive for smaller organizations than before.
Netflix is also almost certainly far into the "contact us" price tiers, so they're a useless example for cost-effectiveness of AWS unless/until they actually publish what they pay. If it's anything near the listed tiers, they're being ripped off, based on what little I know of the kind of discounts much smaller players have negotiated.
Quick note, they define Private Cloud as "Customer’s private cloud environment on a cloud provider" but that is not the definition I go by. Just a heads up while reading.
I would define Private Cloud as a non-bare metal solution on Prem in a traditional colo setting where hardware is owned by the company in question. Hybrid Cloud would be a Private Cloud and Public Cloud bridged in some way.
To me, private cloud means a cloud like infrastructure run by your own ops people, either on-premise or in a colo, running on dedicated hardware you own or rent.
Funny how definitions diverge here.
The definition in the blog post I agree least with (secluded environment on a cloud provider).
Where I work we are building a private cloud but the hardware will be all provided and managed by one of the main server manufacturers. Its private because no other company will use these servers and they reside within our firewalls. But we rent the infrastructure and are not responsible for maintaining it or capacity management. We are just charged based on our consumption of it.
I would say that to be a cloud it also needs an automation layer on top like openstack or vRA
I would agree that would be called private cloud. I guess it needs a definition behind "cloud provider" according to the author. I think hardware solely for your use (including networking/disk) would be my minimum definition, but not bare-metal. I'm coming at this from a security boundary perspective and where liabilities are. Where provisioning of resources can be dynamic and moved.
The different meanings for different people are interesting..
Where I work we refer to a product offering as "private cloud" that means we will run the software on their own AWS account. This keeps their data separate from other customers and allows extra security(restricted access to US employees, restricted access via VPN, etc).
I'm also an ex cloud engineer(as in building, operating, etc) so I understand the hosting meaning of "private cloud" to mean a cloud under control of the user(software and hardware), such as a company running an Open Stack install in own datacenter or colo.
Count me among those for whom "private cloud" effectively means a colo solution (a sibling comment[1] mentions another version of this, which I would also consider a private cloud since it's on dedicated hardware). I would call what you first referred to as a "managed cloud" service.
This is a great article, but it would be good to state up-front that the author works for a company that sells a service designed to help you go on-prem. It's not totally clear until later in the article that this is the case. It would also help put the article in context better.
Another company that does this - very little marketing - is https://www.replicated.com . I don't get paid for recommending them but they do great work (some of our clients are theirs etc).
We've had the opposite experience using Replicated for npm's on-prem npm Enterprise software.
I was originally trying to build our Enterprise solution using Ansible, targeting a few common OSes (Ubuntu, Centos, RHEL); headaches gradually began to pile up, surrounding the "tiny" differences in each of these environments -- I'm VERY happy to offload this work.
It took me a little while to wrap my head around best practices regarding placing our services in Docker containers, but once I was over this conceptual leap I was quite happy.
Sandstorm.io is another. FWIW, we've been talking about these types of startups coming for a few years now. It's probably a few years out, but on-prem (or what I call the Intercloud) is definitely going to be bigger than anything we've seen yet.
I'm the OP and author. I figured that because it's on the company blog with an about us that says what we do this was properly disclosed. But maybe it could use an additional lead in.
My first thought was about some advice I would give someone considering this (having been pushed into this situation by a large healthcare client years back).
But once I found it was a sales piece, that thought faded.
Note: This may not generalize to your use case. We mainly serve big data customers including banks, telco, and have also seen other environments similar to the "air gapped environments" listed below.
That being said:
Would just like to add some coloring to this.
>> However, not every customer that wants on-premise is a government agency with air gapped servers.
This is the bulk of our customer base and also a very large portion of the market still. I deliver software via dvd (flash drives nor wifi not allowed)
A few notes from these kinds of customers:
They won't let you just install anything.
Docker is great but doesn't have a lot of enterprise adoption (despite the self perpetuating hype cycle) outside the companies that already have mature software engineering teams as a core competency.
They are often running centos 6.5 or less yet.
A lot of these environments still require deb/rpm based installation.
Admins at companies that run on prem installations tend to be very reserved about their tech stack. Docker looks like the wild west to folks like that.
Our core demographic:
We do a lot of hadoop related work. We have a dockerized version of[1] that we deploy for customers.
We have also been forced to go the more traditional ssh based yum/deb route. We have automation for both.
They are right that many "on prem" accounts are now "someone else's AWS account".
We also have to run stuff on windows server as well. Docker won't fly in that kind of environment either. Microsoft still has large market share in enterprise yet and will for a long time.
K8s and co is great where I can use it, but it shouldn't be assumed that it will work everywhere let alone in most places in the wild yet. Hopefully that changes in the coming years.
Again: This is one anecdote. There are different slices of the market.
A lot of these environments still require deb/rpm based installation.
Still? I didn't know the "dev" in devops had eaten "ops" and standard practice is to start shitting files wildly all over a server. Just because it's in a container/vm doesn't make that any less gross of a practice.
A package(s) that behaves well within the rest of the distribution, and lets sysadmins use their favorite configuration management to setup /etc/foo/bar is exactly the level of control I want as a customer, and would want to afford admins if I was a vendor.
Giving customers the option of a prepackaged virtualization/container solution is fine and all, but I'd take a wizard-installer-for-linux over a VM image any day. I deal with installers already, and while less than ideal they're easy enough to have wrestle the output into a .deb and throw on our internal repo. Then I can apt install it wherever I want.
Some are on Oracle Linux, which on a few occasions does not have compatibility with how RHEL nor CentOS work despite the similarities and pitches. Patching OEL is more awkward in my experience mostly because of Oracle rather than for any technical reason.
We also have to run stuff on windows server as well. Docker won't fly in that kind of environment either. Microsoft still has large market share in enterprise yet and will for a long time.
Containers are built into Windows Server 2016, and they're powered by Docker. (or Hyper-V for full-VM containers).
No one is going to bet on containers in windows in production for at least a year or 2. And the ones that do will likely be "early adopters" even after a few years. Look no further than the companies trying to fork docker due to lack of stability:
http://thenewstack.io/docker-fork-talk-split-now-table/
Windows 2016 just became GA so it will still take some time for bugs to be smothered out. It addition to this running docker containers inside a windows 2016 VM does not help as you are still end up running static VMs.
Unless cloud providers offering ability to run Windows containers directly on the cloud platform, there are no efficiency gains.
I work for Pivotal, which has a slightly different horse in this race: Pivotal Cloud Foundry. It's based on the OSS Cloud Foundry, to which Pivotal is the majority donor of engineering.
Lots of customers want multi-cloud capability: they want to be able, relatively easily, to push their apps to a Cloud Foundry instance that's in a public IaaS or a private IaaS. They want to be able to choose which apps go where, or have the flexibility to keep baseload computing on-prem and spin up extra capacity in a public IaaS when necessary.
It also happens that lots of CIOs have painful lockins to commercial RDBMSes, and they don't want to repeat the experience. They want to avoid being locked into AWS, or Azure, or GCP, or vSphere, or even OpenStack.
CF is designed to achieve all of these. The official GCP writeup for Cloud Foundry[1] literally says "Developers will not be able to differentiate which infrastructure provider their applications are running in..." (can't say I completely agree, GCP's networking is pretty fast, which is I guess what happens when you have your own transoceanic fibre).
If I push an app to PCFDev -- a Cloud Foundry on my laptop -- it will run the same way on a Cloud Foundry running on AWS, GCP, Azure, vSphere, OpenStack and RackHD.
As a devops engineer working with CF daily, I definitely know where my apps are running, and that moving an app from cloud to cloud or to on-prem is not 'seamless'.
I have moved dockerized apps to public cloud CF to on-prem CF. It all worked because the apps where stateless to begin with. Pivotal appropriating all the merit of easy migration of stateless apps is dishonest at best.
Stateless apps are easy to move. They can be moved from CF to Kubernetes to ECS to Heroku.
Now, let's talk about the state, because that's where the real problem and the lock-in might be?
> Now, let's talk about the state, because that's where the real problem and the lock-in might be?
State is hard, state has momentum. We do what everyone else does: we push it out into a separate parallel space. Heroku do that with their PostgreSQL and Kafka services. We do it with services. IBM, Amazon, Google, Microsoft all do it with services.
Cloud Foundry is agnostic to the services. More agnostic if you use BOSH. Less agnostic if you use a non-BOSH service (like RDS).
Heroku isn't. You're married to their PostgreSQL unless you want to build your own or switch to RDS.
Kubernetes has more of an emphasis on attaching and detaching volumes. I can do that with BOSH; in future it'll be an app-level feature too.
Cloud Foundry is opensource, so there's "lockin" in the sense that Rails or NodeJS or PostgreSQL have "lockin". Every commercial distribution is required to have a consistent base design.
In terms of quitting Cloud Foundry entirely, you have two ways out.
The first is that you used buildpacks to stage your apps. In that case you stop pushing sourcecode to CF and replace it with whatever build pipeline makes sense for the replacement platform.
The second is that you decided to push container images instead of sourcecode. In that case you push to the replacement platform instead.
That said, you'll need to recreate the other niceties: orchestration, routing, log collection, service injection and service brokers, self-healing infrastructure and a bunch of other bits and bobs.
If you write a full infrastructure for deploying apps, monitoring apps, service-injecting apps, routing to apps on AWS, then you are probably using their APIs fairly heavily.
All the IaaSes have APIs for fundamental components like compute, storage and network. BOSH provides a relatively uniform layer for them, which is how Cloud Foundry manages to get around to so many platforms.
Can't you just, like, go dedicated first ? Or better, start dedicated and turn up cloud every day at 6pm when your traffic is 10x (really, none explained how they can scale their db that fast(because you can't), only the app-tier, which is probably badly-designed to be that slow).
Is anyone else noticing that bare metal is getting unbelievably cheap? OVH is well known but there are now many vendors also offering crazy cheap bare metal hosting with <1hr setup times, etc. Compared to the popular virtualized cloud vendors it's a steal, especially when it comes to CPU, RAM, and storage. (Bandwidth tends to be about the same.)
It's not quite apples-to-apples. Bare-metal and VPS providers lump all traffic together into a single pool.
If you do that with an IaaS, you're nuts. Your static assets (which typically represent the bulk of bytes served) should be served from a blobstore or a CDN, which are priced dramatically cheaper.
If your mp4s are served from an EC2 instance instead of S3 or CloudFront, for example, you're keeping warm by setting $100 bills on fire.
Take serverloft as an example - not a low-cost provider. With your 99$/month server with a 1gps uplink you get "flat bandwidth":
"Fair Use Policy
Use as much bandwidth as required: With all dedicated servers from serverloft, you receive a traffic flat rate free of charge. Here our Fair Use Policy applies, which requires a fair usage of all our resources as well as a reasonable traffic usage.
Therefore, in case the bandwidth exceeds 50 TB on a 1 Gbit/s server during the current billing month, we will decrease the bandwidth to 100 Mbit/s.
You will be able to monitor you current usage in the customer panel. We will also inform you via e-mail should you approach the 50 TB maximum usage. In this case you can purchase additional bandwidth for $10 per TB in order to extend your traffic usage."
So, 99 for the server and 50TB a month. Now if you want to max out your 1gpbs connection, lets say 50% of max (1gbps * 24 * 30 * 60 * 60s/2 TB ~ 160 TB, or a additional commitment of 110 * 10 = 1100+99 = grand total of 1200 dollars a month. How much transfer does that get you out of s3? First off, 160TB out of S3 each month, clocks in at 13227$ for just the bandwidth - not including the any storage. For the price of 160TB on dedicated, you get approximately 12TB on s3 (again without storage).
120TB out from cloudfront? 25693/month.
Now, one, low-end box might not be enough to keep everything going, but notice the price difference here. If you bought four "low-end" boxes you'd get 200TB of bandwidth included, and still be paying a lot less than for just the bandwidth on S3. It might even be cheaper to rent dedicated boxes set up as caching proxies, than to serve your data from s3/cloud-front...
Now, I'm not saying the cloud doesn't have merit, but just pointing out that the price difference is real.
Indeed :-) I just felt the need to go and look at some current numbers, from a more expensive host than what I currently use (happy hetzner customer here).
Paying by the Mbps is a lot cheaper at scale than paying by the GB, so I wouldn't even dismiss the cost benefit for bandwidth either.
If all you need or want is hardware, power, and bandwidth I would say you should definitely go dedicated or colo. If your business runs on the value-added services than you don't really have that choice.
The problem is that cloud providers build services and solutions outside of simple server space. AWS for example has some killer apps with S3, RDS, ELB, Kinesis, etc. Yes there's open source alternatives for all of those, but they bring on huge operational costs.
We've done our own hosting for twelve years now; hundreds of physical servers in a co-lo. We do continuous deployment (have for basically ever) and multiple identical shards with horizontal balancing, so there's almost no "special" server that we can't afford to lose or upgrade.
While our hardware and ops team and premise costs are a little lower than high volume EC2 to the same scale, the real win comes at the network. High quality transit through multiple redundant links, high capacity routers, our own BGP, is all much lower cost and more flexible than we can get from EC2. It's not even in the same ballpark!
(Also, we out put SSDs into databases years before they were usable in EC2, so there was a win there, too!)
This is an interesting post, as Gravitational is doing something very similar to us - https://jidoteki.com - in fact that blog post reflects much of what I wrote a year ago on our company blog - http://blog.unscramble.co.jp/post/134388066008/why-we-care-a.... We focus on the more commonly known concept of on-prem (not AWS/Kubernetes), in other words on dedicated hardware in a private datacenter or hosting facility, by creating fully offline, updateable, and secure virtual appliances designed specifically for self-hosting.
Small team here... 20 customer on-prem. It takes more than a year to update all of them. In some cases "policy" want us to connect staging over TeamViewer sharing mouse with the the DBA.
CI is basically impossible to accomplish.
It is hilarious discovering a bug after both (us and customer) approval to go live with the last version.
The most exotic technology we have implemented is a CDN to speed up some component's delivery. And... In some cases we did not yet have authorization for that.
Honestly... on-prem is perfect to preserve your job... but man...
The whole point of cloud and one of the main drivers for rapid adoption has been that it removes the need to invest and manage your own hardware, staff and capex.
Amazon, Google or others takes care of that unless regulations require you to be on-prem.
It never really had anything to do with the lack of software or management tools. Not to take away from Kubernertes but Vmware, Cisco, Microsoft and other enterprise vendors have always had this market sealed up with relatively sophisticated capabilities.
Things like Vcenter seem far ahead in terms of management, capabilities, feature set, policy and fine tuned for enteprise use. Just taking distributed storage as an example where options like VSan, Nutanix or ScaleIO do not really have any open source counterparts. But this is not surprising given the billions of dollars spent in enterprise.
I don't know of any that require you to be on prem, but if you can't arrive at a reasonable liability agreement with your providers, cloud can be taken off the table pretty easily for any information that requires public disclosure of breeches (e.g. ssn) or comes with hefty fines (e.g. patient information)
It's not so much that regulations require you to be on premise as much as on premise is how you fulfill a combination of so many requirements emphasizing meticulous control and auditing of your entire operations from background checking your datacenter janitors on up. FedRAMP and similar standards have historically been fulfilled by companies built around bureaucracy-ware-first business models for enterprise customer demands rather than technology-first like most technology companies, and most of these companies focusing upon managing as much of a company's technology as possible such as EDS (now HP).
The biggest obstacle is going to be keeping yourself lean in the face of employees that want to build their own empires internally.
Since I worked at Amazon for 5 years I saw how to cut costs on networking and hardware and to build only as much as you actually need. My experience after working at Amazon is that most Enterprise IT is full of people who hide behind Best Practices in order to expand their budget.
You can very easily wind up with massive amounts of spending to VMWare, EMC and Cisco. Unless you have an executive with a vision on how to contain costs with networking, storage and servers, I wouldn't recommend on prem, you can easily wind up spending an order of magnitude more than you should be.
We've been running a hybrid on-prem solution for nearly 3 years now. It's been challenging, but Kubernetes has drastically simplified it for us. It now means we can spin up a client site in 1-2 business days, provided we have a server on-site ready to go.
I think unless there is a big turn of the tide, supporting on-prem is just trouble these days. Most large enterprise customers are willing to use SaaS already.
Forget about the huge issues in the support organization for a second: the impact on-prem has on your release cycle has consequences that are hard to fully grasp. So much for "continuous development and release" if you have to keep supporting old versions of software for a year.
Build your stack so that you can easily migrate clouds (i.e. don't use all the super high-level AWS APIs). It's a good idea in general, and it should make going on-prem doable enough if you are worried about having that option at all.
One of the points the author makes is that "on-prem" these days often mean "someone else's AWS account". When we started about a year ago we suspected it would come up, but today it's turning into a most popular use case. And adopting something like Kubernetes in a smart way largely takes care of your concerns regarding agility.
Now, the story time!
When I was at Rackspace, I was trying to analyze the top reasons our startup customers would stop using some of our SaaS offerings. The most common one, unsurprisingly, was they'd run out of business. But another top one was "they got successful". As they got bigger and more successful (can't mention names) they'd bring more and more in-house, eventually getting to a point that the only products they were interested in were just servers and bandwidth.
So it depends. Some things just don't work well as SaaS, especially for customers with money who can't scale their IT fast enough.
The managed services are the killer feature in the Cloud, RDS Aurora or DynamoDB especially for AWS, and once your DB is in AWS you pretty much have to put everything else in there.
The product I work on started life as an on-prem-only software. We're mostly hosting customers in AWS these days.
The smallest on-prem customers run on a single server. We have multiple separate web applications, and a couple separate backend services. In AWS, we have each separate web app/backend service on its own set of auto-scaling EC2 instances, and have ELB+haproxy and some scripts to manage everything. Each customer (100s) get their own DB, but the DBs are all hosted across a few actual DB servers.
We use the same codebase for on-prem and AWS, and there's really not much difference in operation: as far as the applications are concerned, on-prem is just the same multi-tenant environment that happens to have a single tenant.
However, I would absolutely go all-cloud if I could. Two major reasons: AWS/cloud services, and database.
The first point is easy: we just can't really use any AWS or other 3rd party SaaS stuff ourselves, other than for our own support infrastructure. The app has to deploy to a self-contained on-prem environment. This is really both a blessing and a curse. On one hand, we have to re-invent some things that could otherwise be off-the-shelf from AWS or other offerings. On the other hand, while switching to another cloud would not be trivial, there's nothing in our core application code that relies on AWS at all.
This also makes our technology stack a bit constrained: we could benefit immensely from a time-series database, but instead we use a RDBMS (MS SQL Server) because, well, we have enough trouble with on-prem administering the database as it is.
The second point -- database -- is the cause of so, so much frustration. Very few people can run databases well. It's the same old story: when you're small, it's no problem, but as you grow the DB performance becomes critical (for reference, while our smaller customers have DBs that are 1-2GB, dozens of our customers have DBs in the 100s of GB, and a couple are over 1TB, and typically there's about 15% inbound data per day, though after processing not that much gets stored). I/O speed is always an issue.. we've had customers trying to run with their I/O on a NAS that's also shared with a dozen other applications ("We spent 100k on our NAS and the sales guy said it'll do everything, so it must be your application that's slow"). I have heard our customers say things like "But I can't make a full database backup, I don't have enough disk space".
There's also the inevitable "what hardware do I need to buy?" question, that even our own sales team is frustrated by. Overspec, and they are upset at cost, or -- probably worse -- are frustrated you made them spend all that money when their machine sits mostly idle. Underspec, and it's now your problem that it doesn't work, because they bought what you told them. It took me months before I convinced even our own team the best we can do is provide a couple examples of hardware setups, and caveat the heck out of it ("with this many devices monitored, with an average of this many parameters, at this frequency, with this many alerts configured, this many concurrent users," etc).
It can be hard enough to isolate the performance problems under load when you operate the whole stack -- it's incredibly difficult when you can only approximate a customer's setup, and even then, you have to somehow convince them their DB config or I/O is inadequate and then need to spend money to fix it.
Reading through the bullet points near the top, I was mostly on board until this one:
> Economies of scale: some customers have the capacity and ability to acquire and operate compute resources at much cheaper rates than cloud providers offer.
AWS is probably spending a billion dollars per year or more on hardware... they get special things that are not available to the public like chips and hard drives. You really can't compete with that.
Exactly. There's a reason multiple large IT companies have invested heavily in cloud services, and it's not so they can be in a market with hard to obtain profits.
In terms of why cloud services are becoming popular for their users, it makes things cheaper in the short term, and is a easier sell to management than on-premises resources. It's also useful for putting together solutions when you work in a company with overly restrictive local computing resources. However if you're looking to save money in the medium to long term, and you can find people you trust with server administration, then you're better off with on-premises (perhaps only keeping a cloud-hosted load balancer if you wanted to guard against unexpected spikes in network traffic).
I work in on-prem infrastructure software. The reasoning that our customers have given us is that they care about security. We have to deploy to customers who don't air gap their systems, but have fun access control policies. For example, we regularly support customers who have issues using a web session. We can't control their computers, but we can ask them to type (or copy and paste) anything in we'd like. Often times, the commands we ask them to run are like: [K || K = {kv, [navstar, _]} <- ets:lookup(kv)]. Do you know what that does? I can almost assure you that our customers do not. We ship compiled binaries to the customer, and although we would never ever send them code that could hurt them, we link against a dozen libraries (that we ship), and who knows who could have poisoned those?
We also have a requirement for our software that time has to be synced. We also recently started to actually check if people's time was synced (using the requisite syscalls), and it quickly became our #1 support issue for a little bit. Customer environments are far too uncontrolled to simply be tamed by such a system as Kubernetes.
I think that if you can avoid on-prem software and use XaaS, you should. You probably don't need to run your own datacenters, databases, etc.. because it's unlikely that you're better than GOOG, AMZN, Rackspace, DigitalOcean, and others. It's very unlikely that your application will be running at sufficient economies of scale to benefit from the kind of work Amazon, and others have done to run datacenters efficiently at scale. Not only this, but Google has figured out how to run millions of servers.
Although GOOG / EC2 tend to makes hardware available (NVMe, SSDs, etc..) far after the market releases it, if you compare it to enterprise hardware cycles, it's lightning fast. We still have customers who run our system on SAS 15K disks, and prefer that over SSDs, even though we recommend it. On the other hand, if you control the environment, rather than spending days, if not months on how to make your application more I/O efficient, you can simply spend a few more cents an hour, and get 1000 more IOPs.
I have a technique (Checmate) to make containers 20-30% faster and expose other significant features. Unfortunately, we have customers which run Docker 1.12, the latest version of our software, want to use containers, and yet they want to use a kernel no newer than a third of a decade old. I literally have spent months making code work on old kernels, at significant performance, and morale cost. If we controlled the kernel, this wouldn't be a problem, and it's yet another problem that K8s cannot solve.
Lastly, networking on-prem tends to be an afterthought. IPv6 would make many container networking woes go away nearly immediately. Full bisection bandwidth could allow for location oblivious scheduling, making the likes of K8s, and Mesos significantly simpler. BGP-enabled ToRs give you unparalled control. Unfortunately, I have yet to see a customer environment with any of these features.
I really hope the world doesn't become more "on-prem".
Going on premise might have some advantages but it comes with completely different problems that you didn't even thought about. Some of them:
* You thought of all the power redundancy, ideal cooling, humidity, etc... but then your office gets robbed and all your computers stolen.
* Network wiring... some people are lousy, create an entire spaghetti mess with them, used a crappier type of cable, cable crosstalk from other equipment, someone stepped over a cable and damaged it. Which one is it? good luck finding it out.
* Hardware fails, power supplies fail, disks fail, everything can fail.
You can pick these battles, or you can focus on your software based service. I suggest the latter.
I think when people say on-prem they mean a lot of different things but generally that doesn't include a crappy server room in your office building. They're probably thinking of a real data center.
Do your first two points happen frequently in a semi-professionally run data center? I haven't even heard of such stuff. Hardware failures happen often, I agree.
CoreOS, however, is the real game-changer for us. The automatic updates and ease of management is what took us from a mess of 400+ ill-maintained OpenStack instances to a beautiful environment where servers automatically update themselves and everything "just works". We've built automation around our CoreOS baremetal deployment, our Docker container building (Jenkins + custom Groovy), our monitoring (Datadog-based), and soon, our F5-based hardware load balancing. I'm being completely serious when I say that this software has made it fun to be a sysadmin again. It's disposed of the rote, shitty aspects of running infrastructure and replaced it with exciting engineering projects with high ROI, huge efficiency improvements and more satisfying work for the ops engineering team.