How We Knew It Was Time to Leave the Cloud

boulos · on Nov 12, 2016

I appreciate that you didn't sling mud at Azure, but re-reading the commit for the move to Azure [1] there were tell tale signs then that it might be bumpy for the storage layer.

For what it's worth, hardware doesn't provide an IOPS/latency SLA either ;). In all seriousness, we (Google, all providers) struggle with deciding what we can strictly promise. Offering you a "guaranteed you can hit 50k IOPS" SLA isn't much comfort if we know that all it takes is a single ToR failure for that to be not true (providers could still offer it, have you ask for a refund if affected, etc. but your experience isn't changed).

All that said, I would encourage you to reconsider. I know you're frustrated, but rolling your own infrastructure just means you have to build systems even better than the providers. On the plus side, when it's your fault, it's your fault (or the hardware vendor, or the colo facility). You've been through a lot already, but I'd suggest you'd be better off returning to AWS or coming to us (Google) [Note: Our PD offering historically allowed up to 10 TiB per disk and is now a full 64 TiB, I'm sorry if the docs were confusing].

Again, I'm not saying this to have you come to us or another cloud provider, but because I honestly believe this would be a huge time sink for GitLab. Instead of focusing on your great product, you'd have to play "Let's order more storage" (honestly managing Ceph has a similar annoyance). I'm sorry you had a bad experience with your provider, but it's not all the same. Feel free to reach out to me or others, if you want to chat further.

Disclosure: I work on Google Cloud.

[1] https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/...

taneq · on Nov 12, 2016

> rolling your own infrastructure just means you have to build systems even better than the providers

As a cloud provider, though, you're trying to provide shared resources to a group of clients. A company rolling their own system doesn't have to share, and they can optimise specifically for their own requirements. You don't get these luxuries, and it's reasonable to expect a customised system to perform better than a general one.

boulos · on Nov 13, 2016

As a counter argument: very few teams at Google run on dedicated machines. Those that do are enormous, both in the scale of their infrastructure and in their team sizes. I'm not saying always go with a cloud provider, I'm reiterating that you'd better be certain you need to.

taneq · on Nov 13, 2016

Interesting, presumably they're very well informed and they obviously feel that Google's cloud offerings are the best way to go.

If I may ask a few questions:

- Are they charged the same as external customers or do they get a 'wholesale' rate?

- As internal clients, do they run under the same conditions as external clients? Or is there a shared internal server pool that they use?

- Do they get any say in the hardware or low-level configuration of the systems they use? (ie. if someone needs ultra low latency or more storage, can they just ask Joe down the hall for a machine on a more lightly loaded network, or with bunch more RAM, for the week?)

- Do they have the same type of performance constraints as the ones encountered by gitlab?

I feel like most of the reason to use cloud services is when you have little idea what your actual requirements are, and need the ability to scale fast in both directions. Once you're established and have a fairly predictable workload, it makes more sense to move your hosting in-house.

Pyxl101 · on Nov 13, 2016

The Google teams that he's referring to probably don't run on Google Cloud Platform, and rather run on Google's internal infrastructure that GCP is built upon. So most of your questions may not apply. However, his points about cloud infrastructure are still valid.

taneq · on Nov 13, 2016

If you're right, then Google teams using internal Google server infrastructure is literally Google rolling their own.

zbuttram · on Nov 22, 2016

He technically only said they don't run on dedicated machines, not that they run on GCP. My guess would be Google has some sort of internal system that probably uses a bunch of the same software but is technically not GCP.

remus · on Nov 14, 2016

This is mostly speculation base don having read this http://shop.oreilly.com/product/0636920041528.do

Hopefully someone who actually knows what theyre talking about will be along shortly!

> Are they charged the same as external customers or do they get a 'wholesale' rate?

Id be quite surprised if internal customers are charged a markup. Presumably the whole point in operating an internal service is that you lower the cost as much as possible for your internal customers.

> As internal clients, do they run under the same conditions as external clients? Or is there a shared internal server pool that they use?

From the above book, it seems that the hardware is largely abstracted away so most services aren't really aware of servers. I assume there's some separation between internal and external customers, but at a guess that'd largely be because of the external facing services being forks of existing internal tools that have been untangled from other internal services.

> Do they get any say in the hardware or low-level configuration of the systems they use? (ie. if someone needs ultra low latency or more storage, can they just ask Joe down the hall for a machine on a more lightly loaded network, or with bunch more RAM, for the week?)

As above, the hardware is largely abstracted away. From memory, teams usually say "we think we need ~x hrs of cpu/day, y Gbps of network,..." then there's some very clever scheduling that goes on to fit all the services on to the available hardware. There's a really good chapter on this in the above book.

> Do they have the same type of performance constraints as the ones encountered by gitlab?

Presumably it depends entirely on the software being written.

jiqiren · on Nov 13, 2016

But some workloads get all the priority while others get zero/idle priority. Not true in public cloud.

dastbe · on Nov 13, 2016

Multitenancy is a large part of what makes public cloud providers profitable, but they all understand the need to isolate customer resources as much as possible.

datadata · on Nov 13, 2016

Using a resource abstraction layer such as mesos can alleviate this downside by consolidating many of your workloads onto a pool of large dedicated machines.

mprovost · on Nov 13, 2016

In the end it doesn't really say that they rolled their own on prem solution. For the kind of money they were forking out in the cloud you could just buy a Netapp or Isilon and get something that provides enough consistent storage performance. You don't need a distributed FS for the kinds of numbers they're looking at, using one is just a complicated way of working around the underlying limitations of cloud storage. In your own datacentre getting storage that works is pretty easy.

sytse · on Nov 13, 2016

Most appliances are not geared towards many small random reads. And if you scale they start to be very expensive. And we would love to use an open source solution all our users can reuse.

mprovost · on Nov 13, 2016

I'm not an expert in Ceph but I've built many other storage solutions and typically where distributed filesystems fall down in performance is with small files. Even something like an Isilon can get into trouble with those kinds of workloads. The files are too small to be striped across multiple nodes and there's a lot of metadata overhead. Monolithic systems tend to do better with small files but even then you can run into trouble at the protocol (NFS) level with the metadata.

Disk systems do get expensive at scale but the scale that they're usually sold at these days is pretty huge. You talk about going up to a petabyte but that's a fraction of a single rack's worth of disk these days. Not everyone wants to be a filesystem expert and distributed filesystems are jumping in at the deep end.

pcarranza · on Nov 13, 2016

You are right regarding many small files. Interestingly reading from many small files didn't seem to be so much of a problem with CephFS as it was to keep a large file open while reading and writing to it from thousands of processes (the legacy authorized_keys file).

Clearly CephFS as weak spots, but for what I've seen those are sports that we can work out, rough edges here and there. The good thing is that we are much more aware of these edges.

We are already working on what the next step will be to soften out these weaknesses so we are not impacted again. And of course to ship this to all our customers, either they run on CephFS, NFS appliances, local disks or whatever makes sense for them.

mtanski · on Nov 14, 2016

FYI. The kernel Ceph client has local fscache cache. I added it to the kernel :) https://lwn.net/Articles/563146/

We started using Ceph because we wanted to be able to grow our storage and compute independently. While it worked well for us we ended up having much larger latencies is as a result of this. So we developed FSCache support.

Even better yet, if you data is inherently shareable (or has some kind of locational affinity) you can end up always serving data with a Ceph backend from the local cache with the exception of a server going down or occasional request. I'm guessing it is (repo / account)

On your API machines serving out the git content of out the DFS you can setup a local SSD drive to read only caching. Depending on you workload you can end up significantly reducing the IOPs on the OSDs and also lowering network bandwidth.

With the network / IOPs savings we've decided to run our CephFS backed by Erasure Coded pool. Now we have lower cost of storage (1.7x vs 3x replication) and better reliability because now with our EC profile we can lose 5 chunks before data loss instead of 2 like before. That's because we more the 90% of requests are handled with local data and there's a long tail of old data that rarely accessed.

If you're going to give it a try, make sure you're using a recentish kernel such as a late 3.x series (or 4+). That has all the Cephfs FSCache / and upstream FSCache kinks work out. If you're using relatively recent kernel such late 3.x series or 4+ (as in ubuntu 16.04).

pcarranza · on Nov 14, 2016

Thanks mtanski, this is great data.

We are running a recent kernel as in ubuntu 16.04.

The reason I'm framing the caching not so much at the CephFS level is because we are shipping a product, and I don't think that all our customers will be running CephFS on their infra. Therefore we will need to optimize for that use case also, and not only focus on what we do at GitLab.com.

Thanks for sharing! Will surely take a look at this.

connorshea · on Nov 12, 2016

I just want to note that the blog post you linked to was never actually published since we decided against pointing fingers when – by and large – the problems weren't their fault and their engineers were quite helpful. See the comment at the bottom of that thread[1] from Sid.

We made mistakes ourselves that were the root of the problems at the time, and based on the update[2] a few months later things definitely improved significantly.

[1]: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/... [2]: https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/...

user5994461 · on Nov 13, 2016

Judging by the posts.

Random thought: Go talk to google and get $100k of free cloud credits.

You run 100 instances of n1-highcpu-2 (2 CPU, 2GB RAM), with a local 375 SSD. That gives you 37TB of storage and you're fine for the year.

This is an absolutely ridiculous setup but it could work.

1) Servers don't need to have more memory for caching. Local SSD access is fast enough.

2) Servers have 1Gb uplink. That can be saturated easily by a single local SSD.

3) I'm a firm believer that if you put 8-16 disks in a single box, you're gonna hit a hard limit with the network bandwidth and infrastructure (among other things).

4) Do you need IOPS or do you need storage???

5) Can CephFS shard fairly across a hundred systems?

6) Should CephFS run on raw disks, manage replica itself? and can it handle failure?

Bonus question) Why are we thinking about all of this. You should have just used s3?

geofft · on Nov 13, 2016

> Bonus question) Why are we thinking about all of this. You should have just used s3?

NFS and Ceph both support the standard POSIX file access model, notably including random-access reads and writes on a file without replacing the entire file and consistency models that let you reason about directories and their contents usefully. If your goal is to run the reference implementation of git, or something that is compatible with it, on the filesystem, you absolutely need something that supports this.

You could write something new that speaks the git protocol but is designed to be backed by an eventually-consistent object store instead of a POSIX-compliant file store ... but that seems like an equally big challenge, honestly.

kyrra · on Nov 13, 2016

Fun aside. Google wrote a backend to Mercurial so it could be stored on bigtable. They did a talk about it 7 years ago.

http://m.youtube.com/watch?v=ri796Hx8las

geofft · on Nov 13, 2016

And to SVN! https://lwn.net/Articles/194667/

It's not a bad decision; getting POSIX semantics to play well with modern expectations of a highly-consistent and highly-available distributed system is very difficult. But it's a work-intensive decision, and figuring out how to deploy NFS or Ceph at scale might be less work. Especially if Google's SVN/hg developers can work directly with the Bigtable developers on features, and GitLab's developers can't work directly with the S3 team (but can own their own Ceph or NFS stack).

zaphar · on Nov 13, 2016

Their Git hosting was also running on that same backend (I was on the team managing those products). But to a previous posters point That is a lot of work both to implement as well as to maintain. If you can just use a distributed filesystem you might save yourself some trouble.

Maarten88 · on Nov 13, 2016

Yes, write a Git backend using a distributed actor system, like Akka, Orleans or Service Fabric, backed by Table Storage, Bigtable or some other distributed hashtable store. Git maps naturally to this structure, using git hashes as table index.

You'll have a really cheap and very scalable solution that uses low level storage mechanisms yet it will also be really performant because when loaded everything will be served from memory.

sytse · on Nov 13, 2016

For sure, and people are trying to do this on the basis of libgit2. But so far nobody has been able to make it work for all of the cases. Even GitHub is using filesystems as far as I know.

markelliot · on Nov 13, 2016

One of the teams I work closely with used JGit's extensibility to run git on top of a KVS: https://medium.com/@palantir/stemma-distributed-git-server-7...

Notably this doesn't require a distributed filesystem.

sytse · on Nov 13, 2016

That looks really interesting, thanks for posting. I added it to our issue https://gitlab.com/gitlab-com/infrastructure/issues/727#note...

ethomson · on Nov 14, 2016

For Visual Studio Online and TFS, Microsoft is using libgit2 and some custom ASP.NET code for the upload-pack / receive-pack endpoints, stuffing the data into SQL Azure. No filesystem, no `git` executable.

nixgeek · on Nov 14, 2016

Relevant: http://githubengineering.com/introducing-dgit/

Rapzid · on Nov 13, 2016

Others have done this with libgit2 storage backends.

dragonne · on Nov 13, 2016

Anyone in particular, or anything open source? I could use this for a work project, but the closest I've been able to find is an old project which backed JGit to Cassandra and the Dulwich Swift backend[1].

[1]: https://www.dulwich.io/apidocs/dulwich.contrib.swift.html

sytse · on Nov 13, 2016

We looked at this but it didn't have either the features or the speed we needed. I do think it is very promising.

kijiki · on Nov 13, 2016

If you're rolling your own, you can tune these things.

For example, in this case, since they're IOPs-bound first, and probably network-bound second, you can cram 16 SSDs in a box, and then put a dual-port 25Gig Ethernet NIC up to 2 leaf switches at the top of rack, and then build a CLOS out of 100G spine switches, so that there are essentially never any network bottlenecks for Ceph.

This is very similar to DreamHost's DreamCompute cluster design, though they're using 2x 10G to the servers, and 40G up to the spines, since 25G/100G wasn't available when they built the cluster.

sytse · on Nov 13, 2016

We're currently at 70TB and we're planning for a 256TB setup https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7N... . So that would be a lot of servers. I don't think SSDs for the file storage are affordable. We'll use SSDs for the journals.

nixgeek · on Nov 14, 2016

I think that's a false assumption. Especially in 2016.

Unless something changed dramatically at GitHub they're a 100% solid-state fleet since 2012 when they started down the "building their datacenter cages" journey.

Whilst you could say as the 500lb gorilla in the market they can afford that luxury, there are other examples, i.e. DigitalOcean, where SSD is a core part of their product and it is offered inexpensively ($5/mo).

Even if you're looking at 3-way replication of your data, meaning you're buying 768TB of storage, the SSD should only tot up to $200-300k for Intel S3510 (or similar).

There's a non-trivial amount of additional effort involved in any kind of "multi-layered" storage (c.f. data on 7200rpm, NVMe cache) over and above just making all of your storage moderately performant in IOPS terms. That has cost too.

boulos · on Nov 13, 2016

Sorry about that! I was on my phone, and followed the first thing from the linked post/issue; I didn't realize it wasn't published.

connorshea · on Nov 13, 2016

No problem, the thread is pretty long, and it's not necessarily easy to tell without understanding how our blog post process works. :)

objectivefs · on Nov 13, 2016

An alternative to user5994461's suggestion would be to switch from CephFS to a different distributed filesystem. If you need POSIX and can't go directly to S3 you could try ObjectiveFS[1]. Using local SSD instance store and memory for caching you can get very good performance even for small file workloads[2].

[1]: https://objectivefs.com [2]: https://objectivefs.com/howto/performance-amazon-efs-vs-obje...

jamesblonde · on Nov 13, 2016

Another option would be to move to HDFS. HopsFS, a new distribution, can scale to over 1m ops/sec: http://www.logicalclocks.com/index.php/2016/10/14/hops-smash...

p_l · on Nov 13, 2016

As someone who has run Hdfs in production on bare metal, I wonder how much you dislik gitlab...

TL;DR - HDFS is rarely the answer, mainly due to pains usually involving crashed nameservers.

jamesblonde · on Nov 13, 2016

HopsFS has multiple redundant, stateless NameNodes.

sytse · on Nov 13, 2016

ObjectiveFS looks interesting but I wonder about the latency if you do many small random reads (this is what we're seeing with git).

objectivefs · on Nov 13, 2016

Latency for small random reads (16 parallel threads), using the Linux kernel source tree, when hitting in the SSD instance store disk cache have a 95th-percentile latency of 6ms with ~50MB/s throughput. When hitting in the memory cache the 95th-percentile latency is <1ms with ~380MB/s throughput. We have more performance data available at https://objectivefs.com/howto/performance-amazon-efs-vs-obje...

sytse · on Nov 13, 2016

That sounds very good, but I think we prefer Ceph because it is better known. Do you have a comparison between Ceph and ObjectiveFS?

Dwolb · on Nov 12, 2016

I think what you're seeing in this post has product implications for Google Cloud.

The OP is asking for guaranteed IOPS/latency SLA for which they're willing to pay HUGE money for by ROLLING THEIR OWN.

Possibly think about high service pricing tiers for systems that require it. This is a standard operations problem that can co-exist within pooled service models.

boulos · on Nov 12, 2016

See my note about SLAs and ToR failures. We probably could promise something for our Local SSD offering (tail latency < 1ms!), but high-performance, guaranteed networked storage is just tricky.

As I said, rolling your own will not give you a guarantee, it will just give you the responsibility for failure. We don't offer the guarantee, because we don't want you to believe it can't fail.

ocdtrekkie · on Nov 12, 2016

At a certain scale, it's better to just take that responsibility rather than rely on someone to be able to handle that burden. When something fails, the worst thing is being out of control of it and unable to do anything about it.

Last time I had a failed storage controller, HP delivered a replacement in four hours. Last time a service provider went down, I had no visibility into the repair process, and had to explain there was nothing I could do... For five days.

It seems like you've outright admitted you can't guarantee what they need, yet you still urge them not to leave your business model. Maybe come back when your business model can meet their needs.

ckozlowski · on Nov 13, 2016

One of the things I stress when talking to customers is that when one is moving to cloud, it's not just the business model that changes, but the architecture model.

That may sound like cold comfort on the face of it, but it's key to getting the most out of cloud and exceed the possibilities of an on-prem architecture. Rule #1 is, everything fails. The key advantage to a good cloud provider (and there are many) is not that they can deliver a guarantee against failure (as boulos stated correctly) but that they'll allow you to design for failure. The issue becomes when the architecture in the cloud resembles that which was on-premise. While there's still some advantages, they're markedly fewer, and as you said, there's nothing you can do to prioritize your fix.

They key to having a good cloud deployment is effectively utilizing the features that eliminate single points of failure so that the same storage controller failure that might knock you out in your on-prem can't knock you out in the cloud, even though the repair time for the latter might be longer. That brings its own challenges, but brings huge advantages when it comes together.

Disclosure: I work for AWS.

z3t4 · on Nov 13, 2016

Most ppl just want to deal with one provider though. Meaning there has to be a middle man btw you and the customer.

boulos · on Nov 13, 2016

That is quite honestly amazing turn around. Most people aren't in a position to get a replacement delivered from a vendor that quickly, but if you are, awesome. Again though, there's a big gap between simple, local storage (we could probably provide a tighter SLO/SLA on Local SSD for example) and networked storage as a service.

As someone else alluded to downthread: anyone claiming they can provide guaranteed throughput and latency on networked block devices at arbitrary percentiles in the face of hardware failure is misleading you. I don't disagree that you might feel better and have more visibility into what's going on when it's your own hardware, but it's an apples and oranges comparison.

wtbob · on Nov 13, 2016

> Most people aren't in a position to get a replacement delivered from a vendor that quickly

Back when I worked in enterprise services, it was a requirement that we had good support contracts — 1-, 4- or 8-hour turnarounds were standard.

If you're running a serious business, support contracts are a must.

boulos · on Nov 13, 2016

Physically delivered, no matter customer location? Sign me up ;).

nixgeek · on Nov 13, 2016

I'm not sure why you're so surprised, this is standard logistics management for warranty replacement services.

Manufacturer specifically asks customer to keep it informed where the equipment is physically located, and then prepositions spares to appropriate depots in order to meet the contractual requirement established when the customer paid them often a large amount of money for 4-hour Same Day (24x7x365) coverage for that device.

This isn't how hyperscale folks operate for the same reason Fortune 100's rarely take out anything more than third-party coverage when their employees rent vehicles, it becomes an actuarial decision, the # of 'Advance Replacement' warranty contracts and the $ involved, vs. buying a % of spares and keeping those in the datacenter then RMA'ing the defective component for refund/replace (on a 3-5 week turnaround time).

tl;dr - Operating 100-500 servers you should likely pay Dell and they'll ship you the spare and a technician to install it, operating >500 servers and you should do the sums and make the best decision for your business, operating >5000 servers and you probably want to just 'cold spare' and replace yourself.

phamilton · on Nov 13, 2016

We had such a contract with Dell when I worked in HPC (at a large university). Since our churn was so high (top 500 cluster) we had common spare parts on site (RAM, drives) but when we needed a new motherboard it was there within 4 hours.

Now, these contracts aren't like your average SaaS offering. We had a sales rep and the terms of the support contract were personalized to our needs. I imagine some locations were offered better terms for 4 hour service than others.

As I learned long ago, never say no to a customer request. Just provide a quote high enough to make it worthwhile.

emmelaich · on Nov 13, 2016

As nixgeek says, this is standard -- for a price.

That said I suspect most enterprises pay too much; as in their money might be better spent buying triple mirroring on a JBOD rather than a platinum service contract with a fancy all in one high end RAID-y machine.

matwood · on Nov 13, 2016

Back when I worked in a big corp with an actual DC onsite we had similar contracts. I assume they worked anywhere in the US as we had multiple locations. IIRC they were with HP.

kijiki · on Nov 13, 2016

At the scale gitlab is talking about, it is usually better to source from a vendor that doesn't provide ultra-fast turnaround, and just keep a few spare servers and switches, so you can have 0-hour turnaround for hardware failure.

Get servers from Quanta, Wistron or Supermicro, and switches from Quanta, Edge-core or Supermicro. The $$$ you save vs a name-brand OEM more than pays for the spares. Use PXE/ONIE and an automation tool like Ansible or Puppet to manage the servers and switches, and you can get a replacement server or switch up and in service in minutes, not hours.

If you're moving out of the cloud to your own infrastructure, it makes sense to build and run similarly to how those cloud providers do.

nixgeek · on Nov 14, 2016

I kinda disagree, see my other comment on logistical models, this isn't 5000+ system scale or even 500+ system scale.

There's non-trivial cost involved in simply being staffed to accommodate the model you propose, all of the ODM's have some "sharp edges" and you need to be prepared to invest engineering effort into RFP/RFQ processes, tooling and dealing with firmware glitches, etc.

Remember that 500-rack (per buy) scale is table stakes for "those cloud providers", it is their business, whereas GitLab is a software company. Play to your strengths.

kijiki · on Nov 15, 2016

In my experience, you'll be dealing with firmware glitches even with mainstream OEMs. You can avoid a lot of the RFP/RFQ and scale issues by going to a ODM/OEM hybrid like Edge-core, or Quanta. Or Penguin Computing or Supermicro. If you already have a relationship with Dell or HP, you probably won't get quite as good pricing, but they're still options.

I am shockingly biased (I co-founded Cumulus Networks) but working with a software vendor who can help you with the entire solution is very helpful.

The scale gitlab has talked about in this thread is firmly in the range where self-sparing/ODM/disaggregation make sense. I think 500 racks is a huge overestimate, I think the cross-over point is closer to 5 racks.

smallnamespace · on Nov 12, 2016

I think you're missing the point a bit -- the claim is not that the business model prevents these guarantees, but that it's inherently difficult for any party to do.

> had to explain there was nothing I could do... For five days

The good alternative would be that you controlled the infrastructure and it didn't break. The bad alternative is you directly control infrastructure, it breaks, and then you get fired.

toomuchtodo · on Nov 12, 2016

Wikipedia, Stack Overflow, and Github all do just fine hosting their own physical infrastructure.

It's untrue that "the cloud" is always the right way. When you're on a multi-tenant system, you will never be the top priority. When you build your own, you are.

Google and AWS have vested interests (~30% margins) in getting you into the cloud. Always do the math to see if it's cost effective comparatively speaking.

nixgeek · on Nov 13, 2016

Sure and GitHub also has literally an order of magnitude more infrastructure than is being discussed in these proposals and retains Amazon Web Services and Direct Connect for a bunch of really good reasons.

Transparency from GitLab is excellent but you shouldn't really generalise statements about cloud suitability without the full picture or "Walking a mile in their shoes".

sytse · on Nov 13, 2016

Yep, and we at GitLab have Direct Connect to AWS as a requirement for picking out new colo.

Pyxl101 · on Nov 13, 2016

If you guys use AWS, have you taken a look at EBS provisioned IOPS? EBS provisioned IOPS comes to mind here since it allows you to specify the desired IOPS of a volume, and provides an SLA.

> Providers don't provide a minimum IOPS, so they can just drop you.

The reason I ask is because the blog post generalizes a lot about what cloud providers offer or what the cloud is capable of, but doesn't explore some of the options available to address those concerns, like provisioned IOPS with EBS, dedicated instances with EC2, provisioned throughput with DynamoDB, and so on.

sytse · on Nov 13, 2016

EBS goes up to 16TB, we need an order of magnitude more.

You're right that the blog post was generalizing.

BTW We looked into AWS but didn't want to use an AWS only solution because of maximum size, costs, and the reusability of the solution.

hagbarddenstore · on Nov 14, 2016

Per disk. You can attach multiple disks to a single EC2 node.

So, RAID multiple EBS volumes and you have a larger disk.

solipsism · on Nov 12, 2016

That's not a strawman. It's just a false statement.

toomuchtodo · on Nov 12, 2016

Thanks for the correction.

empath75 · on Nov 13, 2016

By point of comparison, we had storage problems that went on for months or years. We had a vendor that was responsive to slas, but the problem was bigger than just random hardware failures, it was just fundamentally unsuited to what we were trying to do with it. That's the risk you take when you try to build your own.

ocdtrekkie · on Nov 13, 2016

In this case, it sounds like the cloud was fundamentally unsuited for what GitLab was trying to do with it. So definitely still a risk!

empath75 · on Nov 13, 2016

As someone on the private cloud team for a large internet company, dealing with storage problems is a nightmare that never ends. AWS is a walk in the park by comparison.

ethbro · on Nov 13, 2016

> See my note about SLAs and ToR failures. [...] but high-performance, guaranteed networked storage is just tricky

New job title: Networked Infrastructure Actuary

jwildeboer · on Nov 13, 2016

OTOH, solving the tricky parts in scalable ways is exactly the kind of unique selling point that cloud providers should offer.

hueving · on Nov 12, 2016

A ToR failure doesn't have to mean the end if you're willing to wire each server to two ToRs and duplicate traffic streams to both. It's a waste, but it's one way to achieve high reliability if you have customers willing to pay for it.

sytse · on Nov 13, 2016

I'm not sure we want to pay huge money. Right now it looks like our cloud hosting bill for 2 months (about $250k) can pay for the hardware to host 4x as much https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7N...

prirun · on Nov 13, 2016

My business partner and I ran a website (rubylane.com) for many years, just the 2 of us, colocated at he.net. With good hardware (we used Supermicro servers bought from acme.com), it's not a big deal. We mirrored every disk drive with hardware raid1, had all the servers connected in a loop via their serial ports, had the consoles redirected to the serial ports, and it was not much of a hassle. When we were first starting, we used cheap hardware and that caused us some pain. The other very useful thing we had setup is virtual IP addresses for all of the services: search engine, database, www server, image server, etc. The few times we ever had trouble with the site or needed to take a machine out of service, we could redirect its services to another machine with fakeip.

Recently I bought an old SuperMicro server on eBay, configured it with 2 6-way AMD Opterons, 3 8-way SATA controllers, and 32GB of memory. With 4TB 5700RPM drives in an 8-way software RAID6, it could do 800MB/sec. I realize it's not small file random I/O, but a blended configuration where you put small files on SSD and large files on spinning disk would probably be pretty sweet.

My intuition is that learning all the ins-and-outs of AWS, and how to react and handle all kinds of situations, is not that much easier than learning how to react with your own hardware when problems come up. Especially consider that AWS is constantly changing things and it's out of your control, whereas with your own hardware, you get to decide when things change.

If you can colocate physically close, it's a lot easier. Our colocation was in Fremont but our office was in San Fran, so it was a haul if we had to install or upgrade equipment. But even so, there was only 1 or 2 times in 7 years that we needed to spend 2 consecutive days at the colo. One of those was during a major upgrade where it turned out that the (cheap) hardware we bought was faulty.

sytse · on Nov 13, 2016

Thanks. We plan to use a remote hands service to install new servers.

Dwolb · on Nov 13, 2016

How are you calculating cost to your organization to run your own hardware? With a cloud provider you're benefitting from their own engineering pooled across thousands of customers.

When you run your own hardware you have all the engineering you were already doing plus investing to upkeep and to improve your architecture.

As Google's boulos said, that's where the real costs are.

sytse · on Nov 13, 2016

Indeed with metal you have less flexibility and much higher engineering costs that offset your savings. We think metal will be more affordable as we scale but that is not the reason for doing it. We do it because it is the only way to scale Ceph.

nixgeek · on Nov 14, 2016

I know at least two 500TB+ clusters running on IaaS and don't think "only way to scale Ceph" is to buy and rack machines.

In an earlier comment you said EBS only goes to 16TB and that is "an order of magnitude less" than your requirement, however, thats per volume, you can attach many volumes in much the same way as servers have many disks.

Scale horizontally not vertically, add more OSD instances? With each you can attach a number of EBS or PD volumes which each IOPS characteristics that in aggregate are sufficient to service your workload?

If you want to avoid EBS or PD entirely, is there a reason you can't look at 'i2' or 'd2' instance types?

https://cloud.google.com/compute/docs/disks/performance https://aws.amazon.com/ebs/details/#VolumeTypes

At a fundamental level you're just moving the problem and trading managing metal (which is hard) for I/O guarantees.

"Why is this harder than you might expect?" - you stated elsewhere that you'll have Remote Hands do rack/stack. Providers like Equinix refer to this as "Smart Hands". Everyone who's managed a reasonable-sized environment finds this term highly ironic, as the technician can and will replace the wrong drive, pull the wrong cable, etc.

I've done an non-trivial amount of infrastructure 'stuff' (design, procurement, install, maintenance, migration) for some well-known companies, if you want to Hangout for an hour and pick my brain, gratis, my e-mail is in my profile.

user5994461 · on Nov 13, 2016

> The OP is asking for guaranteed IOPS/latency SLA for which they're willing to pay HUGE money for by ROLLING THEIR OWN.

There is no guarantee of SLA in any distributed system. The best you can do is measure things and know what you'll get most of the time.

If you want SLA, you can make a single server with 10TB of memory as storage. That's solid choice! :D

ex3ndr · on Nov 13, 2016

Ah, really? We are moved from Google Cloud, we just needed one machine with average IOPS to store our files (50-100gb only), but we are constantly hitting limits of IOPS, file upload for our users at some point became a nightmare. What's a problem with cloud when we pay for it 1k$+? Renting one server costs ~50$/mo and will cover everything. Only thing is needed is some adjustments in configuration. But, thanks to k8s we can roll out servers in a week. I am wonder now why we was needed to sit on clouds when everything works 10x times slower at 10x price?

static_noise · on Nov 13, 2016

What if your hardware fails?

What if you need to scale up (or scale down again)?

Making servers run as reliable as in cloud datacenters is really hard work and imposes additional cost. I.e. you need not one, but two or three datacenters with 2-3 times the amount of servers you actually use. Then you need admin staff who knows their work and keeps things running smoothly and reliably. You need not one but at least 3 of that staff because they might get sick or go on vacation.

Your cost savings probably comes from cutting corners. Depending on the structure of your applications, downtime requirements and recovery plans, that may even be OK. You don't always need a fleet of tanks to deliver a box of milk bottles.

ZenoArrow · on Nov 13, 2016

> "Making servers run as reliable as in cloud datacenters is really hard work and imposes additional cost."

Not really. It's easy enough to have dedicated server hardware that runs just as smoothly as cloud hosted hardware. There are many options for doing so. The key is to pay for support.

For example, if you go with Microsoft or Red Hat or Ubuntu servers, you can pay for support contracts to get help from experts in configuring systems. Such options can work out cheaper than cloud hosting (depending on the IT infrastructure requirements), with the added benefit of having hardware more directly under your control.

Pyxl101 · on Nov 13, 2016

What happens when you run into trouble with one of the servers? Like it's suddenly stopped responding and you can't SSH?

For my systems built on EC2, I might not need to do anything at all for typical hardware failures, if I've set up EC2 auto recovery. It transparently relaunches my instance on another server. If I'm not using auto recovery, then I might just stop and restart the instance, in which case I'm also migrated to another physical server if the first one wasn't working. It has the same identity, by the way: same IP address, same disk contents, and all that, and all the machine perceived was an OS reboot (or maybe kernel panic followed by reboot, depending on the failure).

For stateless systems I don't even need to bother with that. I'll just configure it to spin up a replacement instance if the desired number of servers I want to have online is not met because one of them failed. When I have 100+ servers in a fleet, this is really convenient. I don't need to keep track of them individually, I just say: "This is the image I'd like to run, and I want to have 100 of them". Hardware failures that would bring down a normal physical machine don't need to involve me at all.

The old server hardware that's broken is now in the hands of EC2 to repair or swap out, and my server's back up and running in minutes, possibly with no action on my part.

Is something like that easy to achieve with colocation and support offerings with those vendors? Genuine question - I've never operated in colo or worked with those vendors. I realize that colo facilities can take care of repairing and swapping servers for me, but the benefit I get from the cloud is that I can perform those actions in moments with an API, or they can happen automatically, and I don't need to engage other humans. I can't imagine spending my time coordinating with vendors, getting on the phone, opening support tickets, waiting for people to do things, etc... right now I can do everything at the speed of computing with APIs and automation. It'd be tough to give that up.

ZenoArrow · on Nov 13, 2016

> "What happens when you run into trouble with one of the servers? Like it's suddenly stopped responding and you can't SSH?"

You've got plenty of options. Reboot the server (using a local login, as the server is hosted where employees/support staff are physically close to it), restore the server from a backup, etc...

Let's put it like this, how do you think companies managed before cloud servers? Do you think they just had the attitude of 'Oh damn the server has gone down, nobody can do any work today'? I'd put it to you that contingency plans existed long before cloud hosts did.

static_noise · on Nov 13, 2016

If you start running EC2 clusters, you're more in the "cloud storage" dominion and not really in the "self hosting" area anymore.

Not everybody needs that and not everybody who does need it also realizes it before a major failure strikes out one of their machines and there is no working failover and their one admin is not reachable and when they finally reach him he has to be flown in, then they need to wait for replacement parts, which all in all delays recovery by several days.

brazzledazzle · on Nov 15, 2016

Out of band management gets you remote console access and fallback to a local technician if there's a hardware problem.

static_noise · on Nov 13, 2016

Do you need to store data?

If yes, how is it replicated over multiple datacenters?

If yes, does failover work flawlessly?

ZenoArrow · on Nov 13, 2016

> "If yes, how is it replicated over multiple datacenters?"

Depends on the database engine used. For example...

https://msdn.microsoft.com/en-gb/library/ms151799.aspx

>"If yes, does failover work flawlessly?"

Again, depends on the database engine used. To use SQL Server as an example again...

https://technet.microsoft.com/en-us/library/ms189590(v=sql.1...

brazzledazzle · on Nov 15, 2016

Traditional clustering with SQL Server is a huge pain in the ass. When it works (most of the time) it works well but failures can get hairy quick. There's tradeoffs but if you can make it work AlwaysOn is so much easier to deal with from an ops perspective.

ex3ndr · on Nov 13, 2016

Hardware fails? It will start on other node automagically. May be we will need to switch psql instance. What kind of additional cost? Do you ever hit something really awful with k8s on self-hosted servers? What can go wrong except losing node?

static_noise · on Nov 13, 2016

So, you have redundancy and failover at the application level. You don't need reliable servers then.

empthought · on Nov 13, 2016

Reliability is with respect to some SLA; it could very well be that one would _like_ to timeout more quickly than can be reasonably expected of cloud infrastructure, but well within the capabilities of hardware itself.

user5994461 · on Nov 13, 2016

> You don't always need a fleet of tanks to deliver a box of milk bottles.

A bottle of milk filled with Gold is worth about $800k.

By extension the 6-pack is $4.8M

If tanks can fit seamlessly in the budget, we shall give tanks a serious thoughts! :D

user5994461 · on Nov 13, 2016

If your setup can run on a $50 server, you shouldn't use AWS/GCE ^^

Out of curiosity. What were you running? What was your setup? How much disk IOPS? and disk bandwidth used?

ex3ndr · on Nov 13, 2016

Are you sure that (just example) - https://www.hetzner.de/us/hosting/produkte_rootserver/ex41ss... is a weak server? Running same VM on AWS/GCE will cost 10x more.

user5994461 · on Nov 13, 2016

OMG A server WITHOUT ECC memory. #fear

Let's not call that a server but a desktop computer please ^^

---

The point of the cloud is to manage MANY servers. If all you have is a single box, you don't need help to manage it, you don't need AWS/GCE.

ex3ndr · on Nov 13, 2016

If i can buy two desktops instead of one server then i can made a nice failover configuration instead of using just one server. Also as i know google and fb think that ecc is useless, isn't it?

Well, just order 50 of them and install k8s.

puzzle · on Nov 13, 2016

Failover is useless when you have already increased your chances of corrupting data. Even with ECC there's some degree of corruption; doing without is suicidal.

Google figured quite early in its life that RAM was the one component not to skimp on: http://research.google.com/pubs/pub35162.html

Spooky23 · on Nov 12, 2016

If you have an application where the performance and growth patterns are well known, cloud isn't a choice made for lowest cost. It may be best value for different reasons.

I've done ROI studies for several applications like that, and usually cloud has a higher total cost unless there are specific availability requirements or you don't have a facility that can meet a 99.9 SLA.

The key assumption is that you know a lot about the app and it's in a operating mode. If you're in a hyper growth mode, have fluctuating or seasonal demand, or you have no capital funds, cloud is a no brainer.

brianwawok · on Nov 13, 2016

For sure, I think you can often save a decent amount of $$ with dedicated hardware. It does have downsides of course...

If your demand is 50% stable 50% fluctuating (say some base load + a big spike at US prime time), I still think you can win with a hybrid cloud... i.e. serve base load from a COLO, and serve spike load from the cloud. That does mean you need to configure at least 2 networks, but not a terrible idea from a DR standpoint anyway (Main DC fail? Push a button and run off the cloud until it is fixed)

northrup · on Nov 14, 2016

This thinking (hybrid cloud for dynamic scale out) is going into our decision at GitLab and makes a lot of sense for us.

qaq · on Nov 13, 2016

hardware provides so much headroom that it's not even funny to compare :) a good PCIE board will do 850K IOPS for reads for 10K (4TB). You can have a sane understandable setup. In our experience our clients on AWS had more downtime than on dedicated clusters. All major outages of public clouds have being due to control plane issues because it's too complex. Unless you are a major customer have fun trying to get things resolved with the cloud provider. At gitlub scale they would need 4 ops guys and some remote hands for hardware installs considering cloud pricing on bandwidth and storage and the cost of the ops team (+/- 24K month in Ukraine tops) they will not only get much better performance but will actually save decent amount of money.

merb · on Nov 12, 2016

I sense a collaboration oppurtunity between gcloud and gitlab might be benefitical, while google helps gitlab getting git running against your object storage ;) would be a win-win for both I guess!

WhatIsDukkha · on Nov 12, 2016

"""For what it's worth, hardware doesn't provide an IOPS/latency SLA either ;)."""

Actually it really does for all practical purposes.

When it fails its SLA you replace it.

diziet · on Nov 12, 2016

On the storage front, even enterprise grade SSDs with hundreds of thousands of 4k read and write IOPs set up in over-provisioned arrays won't give you a great SLA unless you really lowball it and say the SLA is 99th percentile latency / IOPS is 1/2 benchmarked and the hardware has to sustain that latency/IOPS for 99.99% (after benchmarking out to figure out cache vs sustained performance)

That's just storage, now you need to add so many layers on top of it. Like @boulos said, it is then your fault, but your customers still see the issues.

jfim · on Nov 13, 2016

ToR failure? Tried googling it but it wasn't much help.

kevan · on Nov 13, 2016

Top of Rack = the network switch that ties the servers on a rack to each other and provides a gateway to the rest of the datacenter.

stuntprogrammer · on Nov 13, 2016

I always recommend these:

https://scalableinformatics.com/assets/documents/Unison_Peta...

Surprisingly cost effective given the massive performance and support. Cloud is great for some things but not everything.

sijoe · on Nov 13, 2016

Thanks for the shout out! Without being a commercial, this high performance storage/analytics realm is what we focus on. We were demoing units like this: https://scalability.org/images/30GBps.png for years now. Insanely fast, tastes great, less filling. Our NVM versions are pretty awesome as well.

[edit]

I should point out that we build systems that the OP wants ... they likely don't know about us as we are a small company ...

https://scalableinformatics.com

and we do use gitlab

https://gitlab.scalableinformatics.com

...

Always happy to help ...

sytse · on Nov 13, 2016

Thanks for using GitLab and posting in https://gitlab.com/gitlab-com/infrastructure/issues/727#note... This looks really interesting.

grezql · on Nov 13, 2016

I did not realize they were on Azure until I read this comment. GitLab post did not mention it, why did you then?

NetStrikeForce · on Nov 13, 2016

I'm glad I wasn't the only one to notice! Literally the first thing the Google engineer does is mention Azure, while I had to go back to Gitlab's article all confused as I didn't notice at any point they were running on Azure.

I can't say it's a bad thing to do; I just can't help but notice how Google invests in social media participation. That's why they own the conversation in places like HN and can pull off this stuff.

What I fail to understand is, how does GCE or AWS tackle the issue described in the article? As far as I understand, their problem seems difficult to work around due to the nature of the Cloud (shared).

How would GCE be better than AWS or Azure at this? I would be really interested to know and I'm sure that'll be useful for other HNers with the same worries.

illumin8 · on Nov 14, 2016

Disclaimer: I work for AWS.

To solve problems just like these, we offer EBS (elastic block storage) with provisioned IOPs guarantees. Essentially, you can get guaranteed IOPs if you need it for I/O intensive applications; up to 30,000 IOPs per EBS volume.

But, PIOPs EBS volumes wouldn't be my first recommendation. It sounds like what they really need is an elastic, scale-out filesystem with NFS semantics. We have Elastic File System, or EFS, which is exactly that. It's a petabyte scale filesystem that is highly available across multiple availability zones, and scales in IOPs and performance as it scales in size.

Their application should also look at leveraging S3 object storage, rather than NFS, because that is a highly distributed, highly available object storage system, that is likely to give better scalability, availability, and performance, than rolling your own Ceph infrastructure.

NetStrikeForce · on Nov 18, 2016

This is a great answer, thanks illumin8! (and cool nick too :D)

hitekker · on Nov 13, 2016

Technical details aside, the "buts", parentheses, and generally weird structure of this reply made it very confusing.

I personally believe that, when a prospect feels wronged or poorly served by a competitor, it's the company's responsibility to reach out to that prospect directly. Ideally 1-on-1, the company should aim to listen carefully to the prospect's concerns and not to immediately start spouting opinions or solutions. From there, an honest dialogue can take place which can form the basis of a sale. And, more importantly, a relationship.

Of course, if the goal is simply to castigate the competition or to ward off uncomfortable questions by implying the dissatisfied prospect is thinking poorly or emotionally, then a public forum post works better, I guess.

boulos · on Nov 13, 2016

Sorry if you found my writing style confusing. I'm not looking to shame anyone. Instead, my hope was to counter the overly broad "It's clear we must not be in the Cloud" with my experience that this may be particular to their provider.

I'm not here trying to sell them on coming to Google. If they're interested, my contact info is in my profile. It's quite likely that the best option for them would be to return to AWS, as they have experience running there and EBS has improved a lot over the years.

jtwaleson · on Nov 12, 2016

So Gitlab is in a bit of a strange position here. Sticking to a traditional filesystem interface (distributed under the hood) seems stupid at first. Surely there are better technical solutions.

However, considering they make money out of private installs of gitlab, it makes sense to keep gitlab.com as an eat-your-own-dog-food-at-scale environment. Necessary for them to keep experience with large installs of gitlab. If one of their customers that run on-prem has performance issues they can't just say: gitlab.com uses a totally different architecture so you're on your own. They need gitlab.com to be as close as possible to the standard product.

Pivotal does the same thing with Pivotal Web Services, their public Cloud Foundry solution. All of their money is made in Pivotal Cloud Foundry (private installs).

From a business perspective, private installs are a way of distributed computing. Pretty clever, and good way of minimizing risk.

candiodari · on Nov 13, 2016

It's not just that. The cloud is far, far more expensive than bare metal, and that is under completely optimal financial conditions for the cloud providers (extremely low, even 0%, interest rates for the companies owning them, hence required return is only a few percent)

Dedicated servers, and colocation are going to be far cheaper than the cloud, and worse, the savings directly related to the size of the infrastructure you need.

That, combined with the fact that even the very best of virtualization on shared resources still kills 20-30% of performance.

So there's 3 things you can use the cloud for:

1) your company is totally fucked for IT management, and the cloud beats local because of incompetence (this is very common). And you're unwilling or unable to fix this. Or "your company doesn't focus on IT and never will" in MBA speak.

2) wildly swinging resource loads, where the peak capacity is only needed for 1-5% of time at most.

3) you expect to have vastly increasing resource requirements in a short time, and you're willing to pay a 5x-20x premium over dedi/colo to achieve that

The thing I don't understand is that cloud has both a lower limit (cloud is (far) more expensive than web hosting, and having a VPS) that is an extremely common case, and far more expensive once you go over a certain capacity (doesn't matter which one, CPU, Network, Disk, ... all are far more expensive in the cloud). Even if you have wildly varying loads there's an upper limit to the resource needs where cloud becomes more expensive.

The thing I don't understand is why so many people are doing this. I ran a mysql-in-the-cloud instance on Amazon for years, with a 300 Mb database, serving 10-50 qps to serve as a backend to a website, and a reporting server that ran on-premise. Cost ? 105-150 euros per month. We could have easily done that either locally or on a VPS or dedicated server for a tenth of that cost.

Cloud moves a capital cost into an operational cost. This can be a boon or a disaster depending on your situation. You want to run an experiment that may or may not pan out ? Off the cloud you'll have spare capacity that you can use but don't really have to pay for. On the cloud cost controls will mean you can't use extra resources. You can't loan money from the bank ? The cloud(but also dedi providers) can still get you capacity, essentially allowing you to use their bank credit for a huge premium.

agrajag · on Nov 13, 2016

I think you're missing some use cases here. Cloud can be really helpful for prototyping systems that you may not even want running in a few months.

Another use case would be where infrastructure costs are minor compared to dev and ops staff costs. If hosting on AWS makes your ops team 2x as productive at a 30% infrastructure markup that can be a steal.

djsumdog · on Nov 13, 2016

I was thinking this too while reading the above comment. For starting up and prototypes, the aspect of a managed host far outweighs and initial performance losses. As companies scale, we see them move to their own hardware (Stackoverflow, Gitlab, etc.)

It's best to design your stuff so it can easily go from hosted solutions (this "cloud" bullshit term people keep using) to something you manage yourself. Docker containers are a great solution to this.

If you setup some ansible or puppet scripts to create a docker cluster (using mesos/marathon, kubernets, docker swarm, etc) and built it in a hosted data center; it's not going to take a whole lot of effort to provision real machines and run that same stack on dedicated hardware.

Macha · on Nov 13, 2016

I don't think stack overflow ever used cloud providers for their main compute and storage needs, just periphery stuff like cloudflare.

candiodari · on Nov 13, 2016

Look at those cost factors. If you have an ops team (ie. 2+ people) it is a near certainty that cloud is more expensive. And if you absolutely do not need an ops team, VPS is going to beat the cloud by a huge margin.

2 people doing ops work cost 7-8k $ per month. Let's assume each of them is managing at least 5x their own cost in infrastructure spend, ie 35k+ $/month. That easily buys you 20-30 extremely high spec dedicated machines, if necessary all around the world, with unlimited bandwidth. On the cloud it wouldn't buy you 5 high spec machines with zero bandwidth, and zero disk.

Let's compare. Amazon EC2 m3.2xlarge (not a spectacularly high end config I might add, 8vCPU, 30Gig ram, 160G disk, ZERO network) costs $23k per month. So this budget would buy you 2 of those. Using reserved instances you can halve that cost, so up to about 4, maybe 5 machines.

Now compare softlayer dedicated (far from the cheapest provider), most expensive machine they got: $1439/month. Quad cpu (32 cores), 120G ram, 1Tb SSD, 500G network included (and more network is about 10% of the price amazon charges for the same). For that budget it gets you 25 of these beasts (in any of 20+ datacenters around the globe). On a low cost provider, like Hetzner, Leaseweb or OVH you can quadruple that. That's how big the difference is.

It used to be the case that Amazon would have more geographic reach than dedicated servers, but that has long since ceased to be true.

There is a spot in the middle where it makes sense, let's say from $100+ to maybe $10k where cloud does work. And you are right that it lets a smaller team do more. But there's 2 things to keep in mind : higher base cost that rises far faster when you expand compared to dedicated or colo. This is not a good trade to make.

boulos · on Nov 13, 2016

Your math is off by 60x for AWS and 100x for GCE.

An m3.2xlarge is $.532/hr or $388/month, not $23k/month [1]. A similar instance on GCE (n1-standard-8) is $204/month with our sustained use discount, and then you need to add 160 GB of PD-SSD at another $27/month (so $231 total) [2].

Disclosure: I work on Google Cloud, but this is just a factual response.

[1] https://aws.amazon.com/ec2/pricing/on-demand/ [2] https://cloud.google.com/compute/pricing

jcrites · on Nov 13, 2016

EC2 m3.2xlarge can be had for $153/month as well when purchased as a reserved instance.

EC2 reserved instances offer a substantial discount over on-demand pricing. The all-up-front price for a 3-year reservation for m3.2xlarge would be an amortized monthly rate of $153/month, which is a 61% saving vs. the on-demand price of $388/month, according to the EC2 reserved instances pricing page.

Granted, using this capacity type requires some confidence in one's need for it over that period of time, since RIs are purchased up front for 1 or 3 year terms. But RIs can also be changed using RI modification, or by using Convertible RIs, and can be resold on a secondary marketplace. As a tradeoff in comparison to GCE's automatic sustained use discount, the EC2 RI discount requires deliberate and up-front action.

encoderer · on Nov 13, 2016

For the saas business I run, Cronitor, aws costs have consistently stayed around 10% total MRR. I think there are a lot of small and medium sized businesses who realize a similar level of economic utility.

Edit: I do see in another comment you concede the value of the cloud for ppl spending under $10k a month.

pritambarhate · on Nov 13, 2016

One use case where cloud shines is managed databases. Having a continuously backed up and regularly patched DB with the promise of flexible scalability is definitely worth it.

closeparen · on Nov 13, 2016

Unless

1) Your needs are very static, or

2) Your IT department can competently replicate the PaaS experience on its in-house metal (common big tech company strategy)

The cloud is likely to do wonders for velocity, as when you have a new use case, you can "just" spin up a new VM and run the company Puppet on it within an hour or so, vs. wait weeks to months for a purchase order, shipping, installation at the colo facility, etc.

If your IT department is doing Mesos or Kubernetes or something with a decent self-service control panel for developers, then you get the best of both worlds, but you also have to build and maintain that.

jtwaleson · on Nov 12, 2016

A risk is of course that your public environment is so much bigger than your biggest private install, that it doesn't make sense anymore to keep the same technology for both tiers. I think both PWS and gitlab.com suffer from this.

cryptarch · on Nov 12, 2016

Does Gitlab support some kind of federation between instances?

If so, they could (in theory) split Gitlab.com into a bunch shards which as a whole match the properties of N% of enterprise users. That'd be a pretty cool way to avoid the different-in-scale problem (although you might still run into novel problems as you're now the Gitlab instance with the most shards..).

woot01 · on Nov 13, 2016

this is what Salesforce does - they have publicly visible shards in the url (na11, na14, eu23, etc) and login.salesforce.com redirects you to your shard when it figures out who you are.

londons_explore · on Nov 13, 2016

For anyone considering this, reconsider. It bit me many times.

In general, over time, load on some shards will increase while others decrease. Migrating a customer from one shard to another will likley cause a short outage for them, and many bugs down the line when they've bookmarked all kinds of things.

brazzledazzle · on Nov 15, 2016

You need to preserve the name<->customer association and maintain another key that you can use to split traffic at the LB in case a customer outgrows their shard. But personally I think it looks hokey and should not be something a customer sees anywhere but perhaps a developer tool or sniffer.

eonw · on Nov 13, 2016

could that not be resolved with a reverse proxy or load balancer trickery? ie; hide the shard name externally.

user5994461 · on Nov 13, 2016

> However, considering they make money out of private installs of gitlab, it makes sense to keep gitlab.com as an eat-your-own-dog-food-at-scale environment.

Depends on how they bill.

If they bill on a few private installs while giving unlimited storage & projects for free on the cloud, they're setting up themselves for bankruptcy.

Gotta take care of the accounting. Having an unsustainable pricing is a classic mistake of web companies.

merb · on Nov 12, 2016

well they can probably include a better storage layer.

at the moment you can just setup multiple mount points, but I guess it would be superior to actually setup a moint point or object storage. I'm pretty sure that you can put git into something like s3.

jtwaleson · on Nov 12, 2016

s3fs? ;)

A) I think it would be kind of difficult to store git on S3 (the magic would be in the caching / consistency layer to keep it performant)

B) it would make sense for their customers that run on AWS. However, many of them just run on a single VM / physical server.

merb · on Nov 12, 2016

well I just said s3 since it's the best known, but I guess it would be feasible to run on object storage. I heard that libgit2 can actually change his storage layer..

btw here is the alibaba approach: http://de.slideshare.net/MinqiPan/how-we-scaled-git-lab-for-...

Edit: and I heard github uses gitrpc.

sytse · on Nov 13, 2016

Indeed this is a big factor. We want our customers and users to be able to use open source to host their repositories. That is one of the reasons to use Ceph instead of a proprietary storage solution such as Nutanix or NetApp.

brongondwana · on Nov 13, 2016

Having never really gone to the cloud, thankfully, FastMail can strongly recommend New York Internet if you're looking for a datacentre provider. They've been amazing.

https://blog.fastmail.com/2014/12/10/security-availability/

Actually, we don't appear to blog quite enough about how awesome NYI are. They're a major part of our good uptime.

And some stuff about our hardware. I'd strongly recommend hot stuff on RAID1 SSDs and colder stuff on hard disks. The performance difference between rust and SSD is just massive.

We're looking at PCI SSDs for our next hardware upgrades:

http://www.intel.com.au/content/www/au/en/solid-state-drives...

(there are two types of SSD on the market. Ones that lose your data, and SSDs from Intel) - we're currently running mostly DC3700 SATA/SAS SSDs.

We customise the layout of our machines very much to match the heat patterns of our data:

https://blog.fastmail.com/2015/12/06/getting-the-most-out-of... https://blog.fastmail.com/2014/12/15/dec-15-putting-the-fast... https://blog.fastmail.com/2014/12/04/standalone-mail-servers...

blorgle · on Nov 12, 2016

Coming from OpenStack land where Ceph is used heavily, it's well known that you shouldn't run production Ceph on VMs and that CephFS (which is like an NFS endpoint for Ceph itself) has never been as robust as the underlying Rados Block Device stuff.

They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.

I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.

x0x0 · on Nov 12, 2016

Also -- if you look at the infra updated from the linked article, they mention something about 3M updates/hour to a pg table ([1], slide 9) triggering continuous vacuums. This feels like using a db table as a queue which is not going to be fun at moderate to high loads.

[1] https://about.gitlab.com/2016/09/26/infrastructure-update/

pcarranza · on Nov 13, 2016

It's not updates, it's just querying, updates are not that bad, the main issue there was that the CI runner was keeping a lock in the database while going for the filesystem to get the commit, this generated a lot of contention.

Still this is something we need to fix in our CI implementation because, as you say, databases are not good queueing systems.

sytse · on Nov 13, 2016

For sure there is much we can improve to reduce DB updates, but we do use Redis for queues.

YorickPeterse · on Nov 13, 2016

> They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.

We have been in contact with RedHat and various other Ceph experts ever since we started using it.

> I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.

Users can self host GitLab. Using some complex custom block storage system would complicate this too much, especially since the vast majority of users won't need it.

sytse · on Nov 13, 2016

You're right. We talked to experts and they warned us about running Ceph on VMs and we tried it anyway, shame on us.

You do need either a distributed FS (GitHub made their on with Dgit http://githubengineering.com/introducing-dgit/, we want to try to reuse an existing technology) or buy a big storage appliance.

camkego · on Nov 13, 2016

Bingo! Seasoned developers and architects with 15-20+ years of experience would very likely question using software stacks like CephFS with warnings on it's website about production use! You really want no exotic 3rd party stuff in your design, and plain-Jane components like ext3 and Ethernet switches. Choosing a newer exotic distributed filesystem may really come back to bite you in the future.

wandernotlost · on Nov 12, 2016

It sounds like they didn't design for the cloud and are now experiencing the consequences. The cloud has different tradeoffs and performance characteristics from a datacenter. If you plan for that, it's great. Your software will be antifragile as a result. If you assume the characteristics of a datacenter, you're likely to run into problems.

Rapzid · on Nov 12, 2016

This got me curious again about the pluggable storage backends in Git(I assume AWS code commit is using something like this). I've looked at Azures blob storage API in the past and found it incredibly flexible..

Here is an article from a few years ago: http://blog.deveo.com/your-git-repository-in-a-database-plug...

In any case, GitLab is amazing and I can see how it's tempting to believe that GitLab the omnibus package is the core product. However, HOSTED GitLab's core product is GitLab as a SERVICE. That might require designs tailored a bit more for the cloud than simply operating a yoooge fs and calling it a day.

solatic · on Nov 13, 2016

Gitlab is limited by the fact that they need their hosted product to run the same code as their on-prem product in order to avoid forking the codebase.

If on-prem customers can't get AWS/GCE/Azure SuperFastCloudStorage™, then it can't be part of their codebase.

sytse · on Nov 13, 2016

Exactly, we want our users to be able to scale with us using open source technologies. Some are at 20k+ users so they are close to needing something like Ceph themselves.

Rapzid · on Nov 13, 2016

From looking at the comments here, thinking about what I would do/want, and what others have done to scale git GitLab is in the minority of wanting to solve this issue with a Ceph cluster.

> Some are at 20k+ users so they are close to needing something like Ceph

Or they will scale it themselves ala Alibaba: http://www.slideshare.net/MinqiPan/how-we-scaled-git-lab-for... . They appear to have written a libgit2 backend for their object store(among other things).

I don't see a good reason why solutions using different storage backends could not make it into the OSS project. Many companies run their own Swift cluster, which is OSS.

If you're using CephFS and everyone else wants to be using other Cloud storage solutions, that would actually put you at a disconnect with your users and leave room for a competitor with the tools and experience to scale out on Cloud storage to come in offering support. I would at least consider all the opinions in this thread and maybe reach out to that Minqi Pan fellow from Alibaba with questions..

I actually really like GitLab and wish we could be using it at my company; this is why I'm spending so much effort on this topic(and scaling git is interesting). Hopefully my opinions are not out of place.

sytse · on Nov 13, 2016

Thanks, we're in touch with Minqi Pan I think we all agree the Ceph solution is great if we can make it work.

jghn · on Nov 12, 2016

Can you go more into the difference in the tradeoffs and how one should design differently?

wandernotlost · on Nov 12, 2016

A blunt summary would be that everything is unreliable and disposable. You have to design for failure, because most things will almost certainly fail (or just slow down) at some point.

I learned a lot when I was first moving to the cloud from Adrian Cockroft. He has a ton of material out there from Netflix's move to the cloud. I recommend googling around for his conference talks. (I haven't watched these, but they're probably relevant: http://perfcap.blogspot.com/2012/03/cloud-architecture-tutor...)

edejong · on Nov 12, 2016

I'm trying to understand 'antifragile'. Are you trying to say: 'robust'? If not, what is the difference?

nicky0 · on Nov 12, 2016

There's a whole book[0] about it. But to summarise (poorly):

'robust' - resilient against shocks, volatility and stressors

'antifragile' - thrives on shocks, volatility and stressors (ie. gets better in response)

Antifragile is a step beyond robust. Examples of antifragility are evolutionary systems and true free market economies (as opposed to our too-big-to fail version of propped-up, overly interconnected capitalism).

0: https://en.wikipedia.org/wiki/Antifragile

dsr_ · on Nov 12, 2016

Can you provide any examples of working, real-world antifragile systems which were designed and built by humans and accomplish their purposes? Preferably in terms of software and hardware?

So far you've named "evolutionary systems", which are quite fragile in the real world, and an imaginary thing called a "true free market economy".

woot01 · on Nov 13, 2016

Taleb's book often came back to the example of the relatively free market that is a city's restaurants. Any one restaurant is fragile, but the entirety of the restaurant business in a city is antifragile - failure of the worst makes the marketplace better.

nicky0 · on Nov 14, 2016

Just trying to define it. I'm not advocating for anything.

That said, if you are calling the entire multibillion year phenomenon of life on this planet "fragile" then we are not going to get on well.

wandernotlost · on Nov 12, 2016

Netflix with their chaos monkeys is a great and relevant example.

hueving · on Nov 12, 2016

But their system doesn't thrive on chaos monkey. It's just resilient to it.

jerf · on Nov 13, 2016

In this case, the anti-fragile system is the entire system, including Netflix engineers and the cloud over time. The cloud is stressed, maybe even goes down, but in response, becomes stronger and more reliable because engineers make changes.

Point-in-time human-engineered systems still can't really be anti-fragile, except perhaps in some weird corner cases, but the system as a whole with the humans included, over time, can be.

It should also be pointed out that "anti-fragile" was always intended to be a name for things that already exist, and to provide a word that we can use as a cognitive handle for thinking about these matters, not a "discovery" of a "new system" or something. There are many anti-fragile systems; in fact it's pretty hard to have a long-term successful production project of any kind in software without some anti-fragility in the system. (But I've seen some fragile projects klunk along for quite a while before someone finally came in and did whatever was managerially necessary to get someone to address root causes.)

hueving · on Nov 13, 2016

Ah, true when you include the engineers in the loop I suppose. But then that becomes a vague term for any system where engineers fix problems after some load/failure testing.

When I think of anti fragile systems, truly adaptive algorithms come to mind that learn from a failure. For example, an algorithm that changes the leader in a global leader election system based on the time of day because one geographic region of the network is always busier depending on time of day and latency to the leader impacts performance.

wandernotlost · on Nov 13, 2016

Yes. It is stronger because it is attacked by it.

smallnamespace · on Nov 12, 2016

Isn't that bad because now you're depending on having stressors and shocks to have the best performance?

If you can't control the amount of stressors and shocks, you want a system that is neither antifragile or fragile, but strictly indifferent to the level of shocks.

nicky0 · on Nov 14, 2016

Good questions. It is a very interesting subject. If you haven't read the book, I recommend it. I think you will find it interesting if you read it with an open mind.

Guzba · on Nov 12, 2016

I don't think antifragile works in the parent comment, robust would be better.

As for a definition, antifragile things get stronger from abuse--like harming muslces during a workout so they grow back stronger. If they were just robust, it would be like machinery (no healing / strengthening).

wandernotlost · on Nov 12, 2016

I suppose I had Netflix in mind...as part of their move to the cloud, they developed chaos monkeys to actively attack their own software to make sure it is resilient to the failures that will inevitably, but less frequently, happen as a result of operating in the cloud.

solipsism · on Nov 12, 2016

Stop using the term, you don't understand what it means. Chaos monkey is a form of testing. Testing is not antifragility.

wandernotlost · on Nov 13, 2016

Chaos monkey attacks their production environment. They must make their software stronger/more resilient/less fragile in response. It would be more helpful if you clarified how you think that's different from antifragility.

solipsism · on Nov 13, 2016

Antifragility describes a system that becomes more resilient automatically, because it is built such that it must, in response to attack or damage. Such a system doesn't require management. It doesn't require people to actively test, to think about how to make the system better, to implement the fixes.

wandernotlost · on Nov 13, 2016

I think that's an overly narrow definition of the term antifragility and interpretation of the system in this case.

While I can imagine software that is in itself antifragile, I think it's entirely reasonable to include the totality of the people and process that make up an operational system, in which case even your narrow definition applies here.

solipsism · on Nov 13, 2016

If you are including the developers in the system you're calling antifragile, then Chaos Monkey is also part of the system. Antifragility refers to a system that benefits from attacks or disorder from outside itself.

You are watering down the word so much that it wouldn't have reason to exist. Is every chair-making process antifragile since chairs get stress tested before being sold?

wandernotlost · on Nov 13, 2016

Chaos Monkey is just a tool to accelerate/simulate the disorder inherent to the cloud. The original point was that the cloud is unstable and hostile, so software designed for it benefits from that disorder. Granted, it's not doing so all on its own, but exhibits the effect nonetheless, and I think Taleb's book is full of similarly impure examples.

There's a world of difference between stress testing before something is sold or released and welcoming ongoing hostility throughout its lifecycle, and this difference is absolutely in line with the concept of antifragility.

user5994461 · on Nov 13, 2016

They designed for a small operations with limited resources. That might be fair given their budget/funding which we don't know.

At the moment, the discussion in the GitHub issues looks like people who are buying servers to put and run in their garage ^^

solipsism · on Nov 12, 2016

Planning for the cloud doesn't make your software antifragile. antifragile != robust

zzzcpan · on Nov 12, 2016

> when you get into the consistency, accessibility, and partition tolerance (CAP) of CephFS, it will just give away availability in exchange for consistency.

Not their fault, POSIX API cannot fit into eventual consistency model to guarantee availability (i.e. latency). Moving to your own hardware doesn't actually solve the problem, just gives some room to scale vertically for some time. After that the only way to keep global consistency but minimize impact of unavailability is to shard everything, at least this way unavailability will be contained in its own shard without any impact on every other shard.

It's better to avoid POSIX API in the first place.

chetanahuja · on Nov 13, 2016

This is the correct answer to run github, the SAAS product at scale. Probably not practical for github the on-prem installable software. They may have to bite the bullet and separate out the two systems at some point anyway though.

edit: it's gitlab not github. The point still stands though.

diziet · on Nov 12, 2016

I priced out the hardware that they specced out to be around $1.4 million:

https://gitlab.com/gitlab-com/infrastructure/issues/727#note... The R730xd are probably around $50k after dell discounts - depends a bit on what they ended up configuring with regards to support, exact network configuration, etc

The R830s are about $50k as configured - 1.5gb ram is expensive as the R830 only has 48 DIMMs and they need the relatively expensive 32GB RDIMMs

The R630s should be about 15k each:

The switches say they are 48x 40G QSFP+ which are very expensive (I'd put them at 30k each from Dell)

50k * 20 + 50k * 4 + 15k * 10k + 2 * 30k

1m + 200k + 150k + 60k ~= $1.4m invested

updated with better R830s pricing

From my perspective consumer Samsung 850 EVO drives which can be under-provisioned to match the 1.8tb (and get better performance characteristics) would give Gitlab cheaper and more reliable storage in terms of IOPS/Latency when compared to 10k 1.8tb drives.

connorshea · on Nov 13, 2016

Unfortunately I don't believe we'll be able to comment on the accuracy of this since the quotes from companies are private information, but I do love math like this :D

(Community Advocate at GitLab)

Rapzid · on Nov 13, 2016

I would take a look at Supermicro servers and compare pricing. Worked as cloud engineer at a hosting provider built entirely on Supermicro kit. Didn't see anything there I didn't see from the Dell's and HP's when I worked at Rackspace. And Rackspace sure didn't build their cloud on top of HP's or Dell's.

nixgeek · on Nov 13, 2016

Fetch a quote from Supermicro and tell your Dell or HP folks that you'll buy that if they don't match the price.

If you're large enough, then you should be talking to other folks including Wiwynn, ZT, Quanta, Celestica, et al.

They won't always get all the way there, depends on volume, but often they'll get pretty close, and it'll probably be a 25%+ additional discount vs. what they'll offer without you twisting their arm using a cheaper manufacturer as leverage.

They have to believe you're serious, obviously, which can mean visibly throwing a couple small orders to another manufacturer to tell your sales team they're not the only game in town.

In quality terms, Supermicro is worse vs. Dell/HP, which can still be fine if you've got enough scale to work through all the little issues with firmware.

Rapzid · on Nov 13, 2016

We never had issues and I never noticed a quality issue over Supermicro vs Dell's and HP's at Rackspace. There were firmware niggles and updates with both; we were always upgrading server, storage card, and storage controller firmware at the Rack. I'm not privy to studies at scales large enough to tell which is higher quality.

I'm not beholden to either though haha. If you can get Dell to come close to Supermicro prices then that might make sense. When I was last in the infrastructure game a few years ago we also ran into configuration limitations with Dell. At the time we could get more storage and RAM in the form factors we needed from Supermicro.

nixgeek · on Nov 13, 2016

This is fair and buying at end of quarter or end of year is often the little extra kick needed to get Dell/HP to compete.

One of my main beefs with Supermicro is their tooling around remote patching and configuration of BIOS and other firmware, they charge for it, on top of warranty, and the UX is awful.

With both Dell and HP if you're under a support contract all the updates and tools to apply them are included, and whilst both manufacturers have some "sharp edges" to their tools for managing large fleets, neither is "stab yourself in the eyeballs with a blunt fork" bad like Supermicro.

sytse · on Nov 13, 2016

Thanks for looking into this. I do believe that the linked comment configuration was over the top, we'll have less memory and 7.2k rpm drives. We're working on the configuration in https://docs.google.com/spreadsheets/d/1XG9VXdDxNd8ipgPlEr7N... feel free to leave comments there.

sijoe · on Nov 13, 2016

Without knowing more on the specific needs (I followed some of the threads to try to grok it), it would be hard to guess what they really need.

[commercial alert]

My company (Scalable Informatics) literally builds very high performance Ceph (and other) appliances, specifically for people with huge data flow/load/performance needs.

Relevant links via shortener:

Main site: http://scalableinformatics.com (everything below is at that site under the FastPath->Unison tab) http://bit.ly/1vp3hGd

Ceph appliance: http://bit.ly/1qiOYpy

Especially relevant given the numbers I saw on the benchmarking ...

Ceph appliance benchmark whitepaper: http://bit.ly/2fMahfJ

Our EC test was about 2x better than the Dell unit (and the Supermicro unit), and our Librados tests were even more significantly ahead.

Petabyte scale appliances: http://bit.ly/2fuTTAH

We've even got some very nice SSD and NVM units, the latter starting around $1USD/GB.

[end commercial alert]

I noticed the 10k RPM drives ... really, drop them and go with SSDs if possible. You won't regret it.

Someone suggested underprovisioned 850 EVO. Our strong recommendation is against this, based upon our experience with Ceph, distributed storage, and consumer SSDs. You will be sorry if you go that route, as you will lose journals/MDS or whatever you put on there.

Additionally, I saw a thread about using RAIDs underneath. Just ... don't. Ceph doesn't like this ... or better, won't be able to make as effective use of it. Use the raw devices.

Depending upon the IOP needs (multi-tenant/massive client systems usually devolve into a DDoS against your storage platforms anyway), we'd probably recommend a number of specific SSD variations at various levels.

The systems we build are generally for people doing large scale genomics and financial processing (think thousands of cores hitting storage over 40-100Gb networks, where latency matters, and sustained performance needs to always be high). We do this with disk, flash, and NVMe.

I am at landman _at_ the company name above with no spaces, and a dot com at that end .

redstripe · on Nov 13, 2016

What kind of dell discounts does dell typically give larger customers? We get 0 (from HP) at my workplace and I've always wondered.

I'm sure it depends on how much you buy.

Does it vary by components? They seem to charge a lot for drives so I'm guessing those can be heavily discounted.

icelancer · on Nov 13, 2016

I helped design a DC for a big data mining / science company and we got just under 20% off the higher margin packages. I felt very good about the number and Dell was great to work with on the purchase side.

late2part · on Nov 13, 2016

Please stop and read this and show it to your boss.

If you are really getting 0% discount you need to private message me and send me $10k so I can tell you this one weird trick.

You simply ask for them.

Just do that. Everyone gets discounts.

diziet · on Nov 13, 2016

Well, it depends on how much stuff you're buying and from whom. If you're buying $800 dollar workstations you won't get as many discounts as when you're buying 400+GB quad CPU 2U servers

late2part · on Nov 13, 2016

Agreed. But even a single purchase is likely to be below list price.

eonw · on Nov 13, 2016

the last time i was in a position that i oversaw hardware purchasing... you get better deals based on your overall spend. after we had spend 250k we got a new sales rep that had access to offer us lower pricing, same thing happened once we breached the $1m mark. new sales rep and pricing went lower. In fact, our first sales rep couldn't even make up a reasonable deal at $80k because he didn't have access to that level of discounting.

usually i always send everything back and tell them to do better.

my biggest gripe about dell pricing was no line items, just a bottomline price. so i couldnt tell which things i could save money on by sourcing them elsewhere(ie: larger drives, large amounts of memory)

patrickg_zill · on Nov 13, 2016

Subtract about 30% , maybe more. Dell pricing can be aggressive, if you are buying enough.

Maarten88 · on Nov 13, 2016

Pro-tip: always buy around the end of their sales quarter. Have internal approval ready. Just promise you'll sign immediately if they lower the price to XXX. Around the close of the quarter their sales guys are pushed to get deals done, and they are sometimes willing to give even bigger discounts.

I think at Dell the quarters are one month off (end of januari, april etc)

diziet · on Nov 13, 2016

Indeed, there would be discounts from Dell, but I also expect companies to buy some extra add-ons and subscription/support services that cancel them out that I didn't see in the high level specs.

yuhong · on Nov 13, 2016

32GB DDR4 RDIMMs are not that expensive anymore. 8Gbit DDR4 is close to crossover now.

diziet · on Nov 13, 2016

Yeah, you're actually right. I relied on my memory of pricing them out a bit back, and it looks like the prices are much closer to other DIMMs now.

kmf · on Nov 12, 2016

As always, the neat thing about GitLab is how open they are with their process. I enjoyed this read, and followed the trail down to a corresponding ticket, where the staff is discussing their actual plans for moving to bare metal. Very cool.

https://gitlab.com/gitlab-com/infrastructure/issues/727