Proposed server purchase for GitLab.com

gr2020 · on Dec 11, 2016

Think much harder about power and cooling. A few points:

1. Talk to your hosting providers and make sure they can support 32kW (or whatever max number you need) in a single rack, in terms of cooling. At many facilities you will have to leave empty space in the rack just to stay below their W per sq ft cooling capacity.

2. If you're running dual power supplies on your servers, with separate power lines coming into the rack, model what will happen if you lose one of the power lines, and all of the load switches to the other. You don't want to blow the circuit breaker on the other line, and lose the entire rack.

3. Thinking about steady state power is fine, but remember you may need to power the entire rack at full power in the worst case. Possibly from only one power feed. Make sure you have excess capacity for this.

The first time I made a significant deployment of physical servers into a colo facility, power and cooling was quite literally the last thing I thought about. I'm guessing this is true for you too, based on the number of words you wrote about power. After several years of experience, power/cooling was almost the only thing I thought about.

walrus01 · on Dec 12, 2016

Something you should do to understand the actual power consumption of a server (ac power, in watts , at its input):

Build one node with the hardware configuration you intend to use. Same CPU, ram, storage.

Put it on a watt meter accurate to 1W.

Install Debian amd64 on it in a basic config and run 256 threads of the 'cpuburn' package while simultaneously running iozone disk bench and memory benchmarks.

This will give the figure of the absolute maximum load of the server, in watts, when it is running at 100% load on all cores, memory IO, and disk IO.

Watts = heat, since all electricity consumed in a data center is either being used to do physical work like spinning a fan, or is going to end up in the air as heat. Laws of physics. As whatever data center you're using will be responsible for cooling, this is not exactly your problem, but you should be aware of it if you're going to try to do something like 12 kilowatts density per rack.

Then multiply wattage of your prototype unit by number of servers. This will tell you how many fully loaded systems will fit in one 208v 30a circuit on the AC side.

Also, use the system bios option for power recovery to stagger bootup times by 10 seconds each per server, so that in the event of a total power loss and an entire rack of servers does not attempt to power up simultaneously.

btgeekboy · on Dec 12, 2016

This is good advice. I'd consider not having them auto power on, however. This allows them to bring things up in a controlled manner.

That brings me to my next point: GitLab should also being mindful on what services are stored on which hardware - performing heroics to work around circular dependencies is the last thing you want to be doing when recovering from a power outage.

walrus01 · on Dec 12, 2016

Yes, this is a choice that depends on the facility, and what sort of recovery plan you have for total power failure. In many cases you would want to have everything remain off, and have a remote hands person power up certain Network equipment, and key service, before everything else. On the other hand, you may want to design everything to recover itself, without any button pushing from humans.

Depends a lot of server/HA/software architecture.

davidbrent · on Dec 11, 2016

Think much harder about power and cooling. A few points:

I've only ever built desktop machines, and this top comment drew a surprising parallel to most help me with my desktop build type posts. Granted, I'm sure as you dig deeper, the reasoning may be much different, but myself being ignorant about a proper server build, it was somehow reassuring to see power and cooling at the top!

late2part · on Dec 12, 2016

Nowadays once you get past 10 racks or 50kw, you generally only pay for power - the space/cooling/etc is "free" as your limiting rate is power and the vendor's ability to move thermal. You'll likely want a chimney cabinet like the Chatsworth GlobalFrame [1].

[1] - http://www.chatsworth.com/products/cabinet-and-enclosure-sys...

walrus01 · on Dec 12, 2016

This does depend a lot on geographical location. You're not going to get free racks in a major carrier exchange point in Manhattan, or downtown Seattle, or close to the urban core of San Francisco. There will definitely be a significant quantity discount, as you increase in numbers, and a lot of your cost will be power and not racks.

In somewhere that is very close (OSI layer 1) topologically to a major traffic interexchange point, you will definitely be paying somehow for the monthly cost of the square footage occupied. For example a colo in 60 Hudson or 25 Broadway in NYC, or in one Wilshire in LA.

Large-scale colocation pricing that is based on power only will be found in locations that are not also major traffic inter exchange points. For example quincy WA or the many data centers in suburban new jersey.

late2part · on Dec 12, 2016

I think your point is right for 111 8th or Equinix/SV1/Great Oaks.

But for other sites with high value peering, including EQIX in Ashburn, Coresite, your limiting rate is likely to be the power, not the space or power. I.e. they'll "give" you 1 cabinet for every 15kw you buy.

So my assertion assumes you're doing large number of dense cabinets.

If one is doing a large number of dense cabinets, they almost certainly should not do it at a high value peering point, and should backhaul it. You should be able to get diverse metro dark fiber for <$5k/mo if not substantially less. Put a single cabinet (or pair for redundancy) of $2k/mo cabinets in Equinix or Fisher and off you go.

spydum · on Dec 11, 2016

fully agreed here, especially leave yourself room for when you misjudged some resource utilization (and need more). Nothing worse than having a resource crunch (cpu/mem/io) and not being able to resolve it because your rack is out of power/cooling/etc - you'll come to appreciate how easy it was in cloud just clicking the button and turning out your wallet.

sytse · on Dec 11, 2016

Great points. We'll make sure to wire to separate power feeds that can both handle the entire load. Suggestions in how to calculate this? Taking the maximum rated load seems over the top.

KenoFischer · on Dec 11, 2016

I recently had to do this. The server I was putting up was rated for 3kW. To determine the expected load, I put it under a dummy load that I reasonably considered the maximum for what I would expect on the server (this was a dev machine, so I picked compiling the linux kernel as a benchmark). I ran that until the power stabilized (SuperMicro servers can measure power consumption in hardware and expose this via IPMI - very handy), because power consumption may keep creeping up for a few minutes as the fans adjust to the new operating temperature. I then repeated the same exercise with a CPU torture test (Prime95), just to see what the maximum I could possible get out of the machine was. The numbers turned out to be about 1.8kW for the linux kernel benchmark, and 2.3kW for the torture test. What I ended up doing was to provision for 2kW and use the BIOS' power limiting feature to enforce this. That would kill the machine if it exceeded its designed load, but that's usually better than tripping the breaker and killing the entire circuit. You may also want to talk to your data center provider about overrages. Some of the quotes I got wouldn't kill your circuit when you went over, but would just charge you (a crazy amount, but better than losing all your servers).

Hope that helps. Your deployment is larger than ours, so there may be other techniques, but that's what we did.

sytse · on Dec 12, 2016

I love the idea of killing the server instead of the circuit, thanks!

BuuQu9hu · on Dec 12, 2016

Might be better to just throttle the CPU/etc when too much power is being used.

toomuchtodo · on Dec 11, 2016

Prime95! I haven't seen that since 2000-2001! Solid tool for burning in a box and putting the CPU under maximum load.

zonywhoop · on Dec 12, 2016

Having designed and run colo data centers for many years my rule of thumb for calculating this for customers was to use the vendors tools if available or take 80% of power supply rating. Keep in mind that most servers will have a peak load during initial power on as the fans and components come online and run through testing. If a rack ever comes up on a single channel and the circuits are not rated right that breaker will just pop right back offline and you'll have to bring it up by unplugging servers and bringing them up in sets. Also most breakers are only rated at 80% load so you have to de-rate them for your load. So, e.g., a 20amp x 208 3 phase circuit really only has 8KW of constant power draw available.

late2part · on Dec 12, 2016

+1. As tempting as it is to say "oh this server only needs full power when it boots" this may come back to bite you. As you grow usage in CPU and Disk, the power used by the server will increase substantially.

late2part · on Dec 12, 2016

note that you can get around the 80% breaker limit by having your DC hardwire the power, if you have enough scale to have them do this for you.

DanielDent · on Dec 12, 2016

Circuit breakers aren't just there for the amusement value of watching a clustered system go into split-brain mode...

I.e. you would also want to be sure that your wiring was rated for 100% utilization, and that other circuit-breaker-like functions exist.

Fire is an actual thing, and figuring out the best way to recharge a halon system isn't exactly what you want to be doing.

late2part · on Dec 12, 2016

Yes. That's why they hardwire and use a breaker rated at 100%. In most jurisdiction, code requires breaker at 80% if a plug/receptacle is used. If you are hardwired you can use 100% of your capacity before tripping the breaker. You have incorrectly assumed that I suggested that you ignore breakers.

When you hardwire the circuit the electrical code allows you to use a 100% breaker.

DanielDent · on Dec 12, 2016

Thank you for adding this detail... what you said makes WAY more sense to me now.

Dylan16807 · on Dec 12, 2016

What exactly gets hardwired to what? This is surprisingly hard to search for details on.

late2part · on Dec 12, 2016

Instead of your plug of your PDU going into a receptacle, the wires that would go into the plug are hardwired to a panel circuit breaker.

This is less common in DCs historically but more and more as folks do 208v 3phase 100A circuits.

mohctp · on Dec 12, 2016

This really is only a concern when you are paying a monthly recurring charge (MRC) by the breaker amp with many power drops.

For a deployment of this scale it should be metered power (For example 1 (or more) 3phase a+b drops to each cabinet) where you only pay a Non-Recurring setup Charge (NRC) and then the MRC is based on actual power draw.

3phase also means fewer physical PDU's (uses less space), but more physical breakers. Over-building delivery capability will eliminate any over-draw concerns for startup cycles.

late2part · on Dec 12, 2016

Not really, although I agree with your reasoning. The other is issue is capex. When I deploy 240kW pods, if I use 80% breakers, I have to deploy 25% more PDUs than if I have 100% breakers.

Since my cabinet number is usually evenly divisible by N*PDUs, this impacts overall capital.

mohctp · on Dec 12, 2016

We are talking about 1 to 2 cab density here so capex doesn't carry that much weight.

Having a little headroom on your power circuits is also incredibility important, and not every facility will sell 100% rated breakers. It may make more sense to be in a facility with 80% rated breakers than 100%, even with the added capex of an extra PDU or two.

Goes back to my previous comment. What is important to you, at the pod / multiple pod level, isn't as important to the 1-2 cab deployment.

late2part · on Dec 12, 2016

That's fair and accurate.

Similarly, ensure spare room in the cabinet for adjustments, that thing you forgot, and small growth. Much better to have 70% full and not need the space then to have no free RU and need the space.

late2part · on Dec 12, 2016

Most DC PDUs will stagger your outlet power on to stagger the initially large power draws.

mohctp · on Dec 12, 2016

Some DCs provide PDU's and some don't.

Rapzid · on Dec 11, 2016

At RS we would burn the servers in measure the load with a clamp meter. For large scale build-outs(swift, cloud servers) we would burn-in entire racks and measure the load reported by the DC power equipment.

I would recommend verifying everything is fault tolerant/HA as expected every step of the way. We ran into issues where the power strips on both sides were plugged into the same circuit(D'oh), wrong SST's, redundant routers getting cabled up to the same power strips, etc and you name it.

After a rack is setup have people at the DC(your own employees or the DC's techs) help simulate(create) failure in power, networking, and systems to verify everything is setup correct. It sounds like you have people coming onboard with experience provisioning/delivering physical systems though, so I would expect them to be on the ball with most of this stuff.

sytse · on Dec 11, 2016

Thanks, that is very helpful. And we haven't made these hires yet but we hope we can in the coming months.

DanielDent · on Dec 11, 2016

You can certainly do some math/estimates based on looking at individual component specifications, but I like using a power meter (built in to some PDUs - which is a feature worth having, or you can buy one, they are quite inexpensive).

A system at idle vs full CPU vs full cpu + all disk will produce very different measurements.

Also keep in mind 80% derating - many electrical codes will state that an X amp circuit should only be used at 0.8X on a continuous basis (and the circuit breakers will be sized accordingly).

__d · on Dec 11, 2016

For HP gear, use HP Power Advisor utility. For Dell, see the Data Center Capacity Planner. Not sure what SuperMicro has -- check with your VAR.

late2part · on Dec 12, 2016

you want 2*N Power feeds where N = 1 or 2. I recommend ServerTech PDUs. You'll want to factor in the maximum load, but with the amount of servers and power you're doing, you should be able to negotiate metered power at under $250/kW/month. Give them a commit that's 1/3 of your TDW/reportd power load, and then pay for what you use.

gr2020 · on Dec 11, 2016

I just remembered a blog post I wrote a while back about exactly these points:

http://www.rassoc.com/gregr/weblog/2011/04/13/choosing-a-col...

sytse · on Dec 11, 2016

Thanks!

jasoncchild · on Dec 11, 2016

These are all points a competent engineer would raise :)

_euvw · on Dec 11, 2016

I've found that being considered a 'competent engineer' merely means 'never too arrogant to learn'.

I didn't know half of the stuff in grandparents' post.

jasoncchild · on Dec 12, 2016

I was attempting to confirm that these are all points are important. Being too arrogant to learn and being in incompetent are two entirely different things.

_euvw · on Dec 12, 2016

Sorry, I misunderstood. To me, your response came across as a "well, duh!".

jasoncchild · on Dec 14, 2016

No problem; as I read it back to myself I can see how it came across that way.

Spooky23 · on Dec 11, 2016

I'm a cranky old person now, I think this is a crazy approach to take and I would be having a very challenging conversation with the engineer pitching this to me.

My underlying assumption is that this is a production service with customers depending on it.

1. Don't fuck with networking. Do you have experience operating same or similar workloads on your super micro sdn? Will the CEO of your super micro VAR pickup his phone at 2AM when you call?

My advice: Get a quote from Arista.

2. Don't fuck with storage.

32 file servers for 96TB? Same question as with networking re:ceph. What are your failure domains? How much does it cost to maintain the FTEs who can run this thing?

3. What's the service SLA on the servers? Historically, supermicro VARs have been challenged with that.

If I were building this solution, I'd want to understand what the ROI of this science project is as compared to the current cloud solution and a converged offering like Cisco/NetApp, HP/3Par or something like Nutanix. You're probably saving like 20-25% on hardware.

This sounds to me like pinching pennies on procurement and picking up tens of dollars of cost on the labor side. If you're accustomed to SLAs from AWS, this will be a rude awakening.

manacit · on Dec 11, 2016

I'm happy to see this - I could not agree more with these points.

I think they are coming at this problem from the wrong perspective - instead of growing from virtual servers to their own dedicated hardware to get better CephFS performance, they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS. At some point in the future, it will bite them. Not an if, but a when.

In addition, this means that running physical hardware, CephFS and Kubernetes (among other things) are now going to be part of their core competencies - I think they are going to underestimate the cost to GitLab in the long run. When they need to pay for someone to be a 30 minute drive from their DC 24/7/365 after the first outage, when they realize how much spare hardware they are going to want around, etc.

As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

daxelrod · on Dec 12, 2016

> they should take a hard look at their application and see if they can architect it in a way that does not require a complex distributed filesystem to present a single mount that they can expose over NFS

As someone who administers GitLab for my company, yes please.

Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us. (We've found mounting NFS across different datacenters to be too unreliable and our failover scenarios include loss of a data center.)

It would also be wonderful to be able to run GitLab in any of the PaaSes that only have ephemeral disk, but that's a secondary concern.

closeparen · on Dec 12, 2016

>Any high availability scenario that involves "just mount the same NFS volume on your standby" is a nonstarter for us

What are the alternatives?

I suppose there's MySQL's "stream asynchronously to the standby at the application level."

... which, now that I think about it, should be pretty easy to do with Git, since pushing and pulling discrete units of data are core concepts...

daxelrod · on Dec 12, 2016

Yep! See GitHub's Spokes, née DGit. http://githubengineering.com/building-resilience-in-spokes/

ninkendo · on Dec 11, 2016

I don't see why it wouldn't be difficult to do a git implementation that replaces all the FS syscalls with calls to S3 or native ceph or some other object store. If all they're using NFS for is to store git, it seems like a big win to put in the up front engineering cost.

I mean, especially because git's whole model of object storage is content-addressable and immutable, it looks like it's a prime use for generic object storage.

manacit · on Dec 12, 2016

This is precisely how I would recommend doing this if I were to be asked. Because git is content-addressable, it lends itself very well to being stored in an object storage system (as long as it has the right consistency guarantees). Instead of using CephFS, they could use Ceph's rados gateway which would allow them to abstract the storage engine away to working with any object storage system.

Latency and consistency would be my concerns - S3 does not quite have the right semantics for some of this, so you'd have to build a small shim on top to work around this. Ceph's rados doesn't even have these problems, so it is quite a good contender.

haileys · on Dec 12, 2016

Latency is an issue. Especially when traversing history in operations like log or blame, it is important to have an extremely low latency object store. Usually this means a local disk.

cmn · on Dec 12, 2016

Yuup. Latency is a huge issue for Git. Even hosting Git at non-trivial scales on EBS is a challenge. S3 for individual objects is going to take forever to do even the simplest operations.

And considering the usual compressed size of commits and many text files, you're going to have more HTTP header traffic than actual data if you want to do something like a rev-list.

ninkendo · on Dec 12, 2016

I'm trying to think of the reason why NFS's latency is tolerable but S3's wouldn't be. (Not that I disagree, you're totally right, but why is this true in principle? Just HTTP being inefficient?)

I would imagine any implementation that used S3 or similar as a backing store would have to heavily rely on an in-memory cache for it (relying on the content-addressable-ness heavily) to avoid re-looking-up things.

I wonder how optimized an object store's protocol would have to be (http2 to compress headers? Protobufs?) before it starts converging on something that has similar latency/overhead to NFS.

discodave · on Dec 13, 2016

That's how AWS CodeCommit works which makes it unique amongst GitHub, Gitlab and friends.

Source: https://aws.amazon.com/codecommit/faqs/

"AWS CodeCommit is built on highly scalable, redundant, and durable AWS services such as Amazon S3 and Amazon DynamoDB."

Spooky23 · on Dec 11, 2016

> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

This is a really good point. That's easily $1M in payroll. You could probably run a decent tiered SAN with 80-95% fewer labor dollars. Plus have the benefit of torturing the vendor when you hit scaling hiccups.

kyledrake · on Dec 12, 2016

> As someone who has run large scale Ceph before (though not CephFS, thankfully), it's not easy to run at scale. We had a team of 5 software engineers as well as an Ops and Hardware team, and we had to do a lot to get it stable. It's not as easy as installing it and walking away.

Can you give some examples of the problems you ran into?

manacit · on Dec 12, 2016

Definitely! Not going to be an exhaustive list, but I can talk about some of the bigger pieces of work.

Something that always seemed to cause nagging issues was that we wanted our cluster to have data encryption at rest. Ceph does not support this out of the box, which means that you need to use dmcrypt on top of your partitions, and present those encrypted partitions to Ceph. This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions. In addition, we ran into several issues where device mapper or otherwise would lock an OSD, which would send the entire machine into lockup, messy!

We also had to work pretty hard to build quality monitoring around Ceph - by default, there are very little tools that provide at-scale fine grained monitoring for the various components. We spent a lot of time figuring out what metrics we should be tracking, etc.

We also spent a good amount of time reaching out to other people and companies running ceph at scale to figure out how to tune and tweak it to work for us. The clusters were all-SSD, so there was a ton of work to tune the myriad of settings available, on ceph and the hosts themselves, to make sure we were getting the best possible performance out of the software.

When you run dozens-to-hundreds of servers with many SSDs in them that are doing constant traffic, you tend to hit every edge case in the hardware and software, and there are a lot of lurking demons. We went through controller upgrades, SSD firmware upgrades, tweaking OP settings, upgrading switches, finding that the write workload on certain SSDs caused problems and much more.

That's just a snapshot of some of the issues that we ran into with Ceph. It was a very fun project, but if you are getting into it for a high-throughput setup with hundreds of OSDs, it can be quite a bit of work.

Happy to chat more w/ anyone that's curious - there is some very interesting and fun stuff going on in the Ceph space.

emilburzo · on Dec 12, 2016

> This requires some work to make sure that decrypt keys are setup properly, and that the machine can reboot automatically and remount the proper partitions.

I've always wondered how automatic reboots are handled with filesystem encryption.

What's the process that happens at reboot?

Where is the key stored?

How is it accessed automatically?

StreamBright · on Dec 11, 2016

Couldn't agree more. It seems Gitlab is severely underestimating the cost of having similar infrastructure on bare metal as their AWS infra. I would probably start to re-architect their software and replace Ceph with something better, probably with S3. Much easier to scale up and operate. Also their Ruby stack has a lot of optimisation potential. Starting to optimise on the other end (hardware, networking, etc.) is a much harder job, starting with staffing questions. AWS, Google and MSFT has the best datacenter engineers you can find and there is a huge amount of effort went into engineering their datacenters. Not to mention your leverage you have being a small startup vs a cloud vendor when talking to HW vendors. Anyways, in few years we gonna know if they managed to do this successfully.

kabdib · on Dec 12, 2016

I cannot tell you how many weird-ass storage systems I've decommissioned. Typically people go cheap, and are burned. It can be a year after deployment, or three, but one thing all the cheap stuff seems to have in common is that it fails very, very badly just when you need it very, very badly. Usually at 3 in the morning.

My philosophy today is that if the data is important at all, it's worthwhile going spendy and getting a SAN. Get a good one. I like Nimble a lot right now, but there are other good ones, too. (Don't believe crazy compression numbers or de-duplication quackery; I've told more than one SAN vendor to fuck off after they said they'd get 20:1 on our data, without doing any research on what our data was).

Have everything backed up? Great! How long until you can go live again after water drips on your storage? If you spend a week waiting for a restore, that's indistinguishable from failure. If you wait a month for replacement hardware to be shipped, you might have killed your product.

JustARandomGuy · on Dec 12, 2016

Thank goodness someone said it.

Perhaps I don't understand the problem domain, but I don't understand why CephFS is being considered for this task. You're trying to treat your entire set of files across all repos as a single filesystem, but that's an entirely incorrect assumption. The I/O activity on one repo/user does not affect the I/O activity of an entirely different user. Skip the one filesystem idea, shard based on user/location/whatever.

I'd appreciate any comments explaining why I'm wrong, because this doesn't seem to be a productive design to me.

_delirium · on Dec 12, 2016

Treating the whole thing as one FS is the current architecture GitLab uses, so is more of an existing constraint than a proposed architecture. To get distributed storage you either need to rewrite GitLab to deal with distributed storage, or run it on another layer that presents an illusion of one big FS (whether that's CephFS or a storage appliance).

kijiki · on Dec 12, 2016

If you're going to spend top dollar on Arista/Cisco/EMC/NetApp, you might as well stay in the cloud.

None of the clouds use any of that super-expensive gear, so if you're going for cost savings, you'll need to use the same sort of commodity gear they use.

Gitlab is obviously Linux-savvy and comfortable writing automation, so things like Cumulus Linux and minimal-handholding hardware vendors shouldn't cause them any indigestion.

<disclaimer: co-founder of Cumulus Networks, so slightly biased>

Spooky23 · on Dec 12, 2016

But this is a couple of hundred boxes, not AWS. I've been to a Microsoft data center... the scale is infinitely larger and solutions are different well.

My point isn't to knock them down. It takes cohones to be public about stuff like this. My instinct as a grumpy engineering director type is that there are holes here that need to be filled in.

Putting a major product at risk to save $30k against an Arista switch isn't a decision to make lightly. That means pricing the labor, upside benefit and business risk. If they are going to 100x this environment, Cumulus will save millions. If it will 3x, it will save a few thousand bucks -- who cares.

gsylvie · on Dec 12, 2016

cojones

late2part · on Dec 12, 2016

I disagree. With over 300 servers in AWS, you can almost certainly build a redundant data center with hardware at less than 60% of the costing assuming 3 year depreciation.

Arista and Cisco shouldn't cost top dollar; though anyone buying EMC or Netapp for any new build should have their union card revoked. FreeNAS ftw uber alles.

Source: Did it twice.

closeparen · on Dec 12, 2016

Is FreeNAS something people actually run Serious Business, at-scale production datacenters on?

I've run it in my home a few times out of curiosity, and that was never my impression.

late2part · on Dec 12, 2016

Yes. I know of several billion dollar companies that run it in a web facing production operations capacity.

Spooky23 · on Dec 12, 2016

Any good talks or other resources about this sort of use case that you might recommend?

We're... displeased with the current solution that we're using at work for this use case. :)

late2part · on Dec 12, 2016

I've created a subreddit here [0] for discussion. Ask some questions and I'll give you the information I have that's relevant.

[0] https://www.reddit.com/r/WebScaleFreeNAS/

songco · on Dec 12, 2016

Agree, as a dev on a large distributed project(include software/hardware), I don't recommend any team to use distributed storage system in production IF they don't have lots experience on it..especially opensource system.

wmf · on Dec 11, 2016

I have built out a few racks of Supermicro twins. In general I would suggest hiring your ops people first and then letting them buy what they are comfortable with.

C2: The Dell equivalent is C6320.

CPU: Calculate the price/performance of the server, not the processor alone. This may lead you towards fewer nodes with 14-core or 18-core CPUs.

Disk: I would use 2.5" PMR (there is a different chassis that gives 6x2.5" per node) to get more spindles/TB, but it is more expensive.

Memory: A different server (e.g. FC630) would give you 24 slots instead of 16. 24x32GB is 768GB and still affordable.

Network: I would not use 10GBase-T since it's designed for desktop use. I suggest ideally 25G SFP28 (AOC-MH25G-m2S2TM) but 10G SFP+ (AOC-MTG-i4S) is OK. The speed and type of the switch needs to match the NIC (you linked to an SFP+ switch that isn't compatible with your proposed 10GBase-T NICs).

N1: A pair of 128-port switches (e.g. SSE-C3632S or SN2700) is going to be better than three 48-port. Cumulus is a good choice if you are more familiar with Linux than Cisco. Be sure to buy the Cumulus training if your people aren't already trained.

N2: MLAG sucks, but the alternatives are probably worse.

N4: No one agrees on what SDN is, so... mu.

N5: SSE-G3648B if you want to stick with Supermicro. The Arctica 4804IP-RMP is probably cheaper.

Hosting: This rack is a great ball of fire. Verify that the data center can handle the power and heat density you are proposing.

sytse · on Dec 11, 2016

Wow Wes, these are all awesome suggestions. All of them (CPU, disk, memory, network, hosting) are things we'll consider doing. I added the Dell server with https://gitlab.com/gitlab-com/www-gitlab-com/commit/34bd78d8...

Would you mind if we contact you to discuss?

wmf · on Dec 11, 2016

Go ahead.

kabdib · on Dec 12, 2016

I too have built out several racks of SuperMicro twins.

I'm never buying another SuperMicro, for many reasons. The amount of cabling for properly redundant connections is a killer; it's at least three cables per system (five in our case); a rack will have hundreds of wires to manage. Access to the blade is in the back, where the cables are, and you have to think ahead and route things cleverly if you want be able to remove blades later.

The comment about doing something other than three 48-port switches is bang on. And if you're running Juniper hardware, avoid the temptation to make a Virtual Chassis (because it becomes a single point of failure, honest, and you will hate yourself when you have to do a firmware upgrade).

19KW is still a ton of power, and I'm surprised the datacenter isn't worried (none of the datacenters we use worldwide go much above 16KW usable). Also, you need to make sure you're on redundant circuits, and that things still work one of the power legs totally off. Make sure you know what kind of PDU you need (two-phase or three-phase), and that you load the phases equally.

MichaelGG · on Dec 11, 2016

Can you elaborate on what deficiencies 10GBase-T has in server applications?

walrus01 · on Dec 12, 2016

One, category 6A cables actually cost two and a half times as much as a basic single mode fiber patch cables. Two, cable diameter. Ordinary LC to LC fiber cables with 2 millimeter diameter duplex fiber are much easier to manage then category 6A. Three, choice of network equipment. There is a great deal more equipment that will take ordinary sfp+ 10 gig transceivers, then equipment that has 10 gigabit copper ports. As a medium-sized ISP, I rarely if ever see Copper 10 gigabit.

hhw · on Dec 12, 2016

SFP+ 10Gbase-T 3rd party optics started hitting the market this year, but none of the major switch vendors offer them yet, so the coding effectively lies about what cable type it is. Just an option to keep in your back pocket, as onboard 10Gb is typically 10Gbase-T. Thankfully, onboard SFP+ is becoming more common however.

For short distances of known length, twinax cables which are technically copper can be used. They're thinner than regular cat6a but only about the same as thin 6a and thicker than typical unshielded Duplex fiber patch cables. Twinax can be handy if connecting Arista switches to anything else that restricts 3rd party optics, as Arista only restricts other cable types. Twinax is also the cheapest option.

kijiki · on Dec 11, 2016

Latency, power and cost.

The PHY has to do a lot of forward error correction and filtering, so it adds latency (for the FEC), power (for all the DSP) and cost (for the silicon area to do all of the above).

late2part · on Dec 12, 2016

The latency is almost certainly immaterial, 0.3uSec v. 2uSec. That's MICROseconds not MILLIseconds. Power draw used to be an issue, but not anymore.

Consider the power pull from copper v fiber listed here [2]. The Arista 7050TX 128 port pulls 507W while the 7056SX 128 port pulls 235W. Yes, copper is more but we're talking half a kW for 2 TOR switches. And for this you get much cheaper cabling, as the SFP+ are much more expensive (go AOC if you do go fiber, BTW) and you have to worry about clean fiber, etc.

Where fiber is REALLY nice is with the density, although the 28AWG [3] almost makes that moot.

There's few DC builds that can't do end-to-end copper any more, at least for the initial 50 racks.

[1] - http://www.datacenterknowledge.com/archives/2012/11/27/data-... [2] - https://www.arista.com/en/products/7050x-series [3] - https://www.monoprice.com/product?p_id=13510

revelation · on Dec 11, 2016

I figure it's simply the availability of datacenter level switches (and other components).

late2part · on Dec 12, 2016

I politely disagree on the 10GBaseT. I built out hundreds of nodes with TOR 10G using Arista switches and it worked great.

lykron · on Dec 11, 2016

With regards to the switches, I would argue that they should skip SDN switches all together and get some Cisco Catalysts as TOR Switches. 2x 48-port switches for each rack with redundant core routers in spine-leaf. SDN is cool, but at the current scale, it seems like it would be more resource intensive than it is worth.

wmf · on Dec 11, 2016

This is one rack of equipment; no spines are needed. http://blog.ipspace.net/2014/10/all-you-need-are-two-top-of-...

Cumulus isn't really SDN so I'm not sure what you're saying there.

Traditional networking is fine but it's totally different than Linux so you need dedicated netops people to manage it. And Cisco is the most expensive traditional vendor.

lykron · on Dec 11, 2016

I'm not super well-versed in the networking area. I misread the posting. I thought they were looking at 64Us of servers.

I've just gotten the notion that SDN is almost ready for primetime, just not yet.

bronson · on Dec 11, 2016

It's been almost ready for primetime for over a decade now.

mohctp · on Dec 12, 2016

Agreed, skip SDN. Skip MC-LAG.

Juniper, Cisco, and Arista all have solutions for this environment.

compumike · on Dec 11, 2016

The raw performance benefits of bare metal vs cloud are incredible, but why does that necessarily mean building & maintaining your own hardware when you can lease (or work out whatever financing you want, but still let the hosting company maintain a lot of the responsibility for HW)? And besides financing, taking on all the HW maint? I'm not sure your needs are so unique as to require custom hardware.

You're talking about only 64 nodes right now. Your storage and IOPS requirements are not huge. A lot of mid-size hosting companies will give you fantastic service at the 10-1000 servers range. If I were you I'd talk to someone like https://www.m5hosting.com/ (note: happy dedicated server customer for many years -- and I'm sure there's similar scale operations on east coast if that's really what you need) who have experience running rock solid hosting for ~100s of dedicated servers per customer.

I suspect you may just be able to get your 5-10X cost/month improvement (and bare metal performance gains) without having to take on the financing and hardware bits yourself.

fabian2k · on Dec 11, 2016

When I looked at how expensive a rented dedicated server was compared to AWS I expected a big difference, but I was still suprised how cheap you can get dedicated servers today. Not worrying about actually owning the hardware is nice, and you still get vastly more hardware for the same price as in the cloud.

There are many advantages to AWS and similar services, but if you can't really take advantage of all the goodies because e.g. you also need to provide a local version of your software (which is the case for Gitlab, as far as I understand), renting dedicated servers is an order of magnitude cheaper.

BJanecke · on Dec 11, 2016

Right now that makes sense yes. What about 5 years from now when it doesn't? This way they get to start building a team and culture to support that kind of infrastructure and wean the baby teeth on nice soft furniture.

twic · on Dec 11, 2016

I wonder if they could go with a hosting provider, hire a sysadmin, then second them to the hosting provider to learn how it's done, in return for a credit against their costs.

In fact, i wonder if that's a business model for hosting providers. Kind of a sysadmin incubator.

BJanecke · on Dec 11, 2016

Except the really good -> exceptional talent just ends up on staff at the operations incubator and the rest go to the world at large :'(

sytse · on Dec 11, 2016

We looked at providers such as Softlayer but while they guarantee the performance of the servers they typically can't guaranty network latency. Since we're doing this to reduce latency https://about.gitlab.com/2016/11/10/why-choose-bare-metal/ this is essential to us.

We'll be glad to look into alternatives that manage the servers and network for us although the argument in https://news.ycombinator.com/item?id=13153455 that this is the time to build a team that can handle this makes sense to me too.

pyrox420 · on Dec 11, 2016

Been a softlayer customer for 4 years now. Their network is pretty awesome. When hosted in the same datacenter it's sub-ms response time always. If there is an issue they get right on it. You can even ask them to host the stuff in the same rack to get even better response time.

sulam · on Dec 11, 2016

We are also a SL customer -- 4-figures of hosts with them. We have had networking problems in the past (latency and loss far higher than I would expect to see in a well-provisioned DC) and talked to them about it. It ended up being contention with another customer, it got fixed, and our network performance has been great since.

I would encourage you to look at what you can get without trying to do your own colo. You're not at the scale where you should be thinking about that.

sytse · on Dec 12, 2016

Thanks for the suggestion. Any idea if they can offer 40 Gbps networking?

ihsw · on Dec 12, 2016

Few will offer 40Gbps without charging a pretty penny, generally the jump will go from 10Gbps to 100Gbps but not until it becomes cost-effective (and not anytime soon).

That said, SoftLayer does provide 20Gbps access within a private rack and 20Gbps access to the public network.

rphlx · on Dec 12, 2016

Most large scale operations are (or soon will be) deploying 25GE and/or 50GE in place of 10Gbps Ethernet. 100GE to each node is unnecessary for most workloads & more importantly it's obscenely expensive and likely to remain so for at least 3 more years.

sulam · on Dec 12, 2016

Architecturally I would honestly wonder why you need that. Git, with its heavy reliance on immutable objects, should replicate very well. I would expect to be NIC-bound, but I would also expect it to horizontally scale reasonably well. Storage is obviously a concern -- you will have to make sure you don't end up needing a full copy of your data on every node -- but there are well known solutions to this problem that account for hot keys/objects well enough for you to get really really far. Even reaching as far back as the BigTable paper there is valuable stuff to look at and you can obviously look at Cassandra to see how the OSS world has tackled that problem.

aberoham · on Dec 11, 2016

Same experience here, going on seven or eight years. They offer a great solution between cloud native and 100% owned/leased/caged bare metal.

sytse · on Dec 11, 2016

Cool, I was not aware you can ask for same rack hosting.

krenoten · on Dec 11, 2016

in the past they've spread servers into different rooms that had interconnects that challenged a team I was on to spend extra time writing compression code. is this better now?

tootie · on Dec 11, 2016

None of that would convince me to lift a finger. Not unless you put dollar amounts on one vs the other. Version control isn't something I even care about performance.

a_t48 · on Dec 11, 2016

I do. We had horrible download\merge\resolve times at my place because of some poor choices (storing many copies of large binary files across all branches in perforce) and it made a huge improvement in development time when those issues were resolved. Granted, this was an extreme case, but VC performance is not irrelevant to the cost of software development.

toomuchtodo · on Dec 11, 2016

Version control is the primary tool your developers are using at any org doing software development. It's performance is key to ensure devs aren't wasting their time waiting on version control.

Performance along with reliability are the most important metrics for someone providing VCS as a service.

tootie · on Dec 13, 2016

IDE and local dev environment are primary. I'm currently saddled with a proprietary SVN-based enterprisey monstrosity that costs me 10 minutes a day at the absolute worst. It sucks, but we can easily live with it.

rhizome · on Dec 11, 2016

To be sure, I've seen benchmarks (can look up later) that found a 10x+ performance improvement over and above the cost savings, leading to I guess a 100x price/performance improvement.

theptip · on Dec 11, 2016

If you're committed to having a robust architecture (this may not be financially viable immediately) you should study the mistakes that Github have made, e.g. https://news.ycombinator.com/item?id=11029898

Geo-redundancy seems like a luxury, until your entire site comes down due to a datacenter-level outage. (E.g. the power goes down, or someone cuts the internet lines when doing construction work on the street outside).

(This is one of the things that is much easier to achieve with a cloud-native architecture).

StreamBright · on Dec 11, 2016

Exactly this. I don't think that if you take into account __all__ parameters bare metal is cheaper. My other problem is that they are moving from cloud to bare metal because of performance while using a bunch of software that are notoriously slow and wasteful. I would optimise the hell out of my stack before commit to a change like this. Building your own racks does not deliver business value and it is extremely error prone process (been there, done that). There are a lot of vendors where the RMA process is a nightmare. We will see how it turns out for Gitlab.

mohctp · on Dec 12, 2016

The RMA process is pretty much a moot point at this scale. It simply doesn't make sense to buy servers with large warranties. Save the money, stock spares, and when a component dies replace it. In the long run it will come out to cost a lot less. I'd much rather pay much less knowing a component may fail here and there and you buy a new one.

As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you.

I 100% agree that there should be more than one region deployed for this service. As others have said, all it takes is 1 event and the site will be down for days to weeks to months (It may not happen often, but when it does, you go out of business). The size and complexity of this infrastructure will make it nearly impossible to reproduce on short order in a new facility. If I were the lead on this I would have either active / active sites, or an active / replica site.

I would also have both local (fast restores), and off-site backups of all data. A replica site protects against site failure not data loss and point-in-time recovery.

StreamBright · on Dec 12, 2016

"As far as moving from cloud to bare metal, another thing to take into consideration (that this was replied to), if you don't architect your AWS (or other cloud solution) to take advantage of multiple geographic regions, the cloud won't benefit you."

Yep, this is why scaling starts with scalable distributed design. We were moving a fairly large logging stack from NFS to S3 once, for the same reason Gitlab is trying to move to bare metal now. Moving off cloud was not an option, moving to a TCO efficient service was. NFS did not scale and there was the latency problem. I think moving to bare metal cannot help with scale as much as a good architecture can. We will see how deep the datacenter hole goes. :)

mohctp · on Dec 12, 2016

Agreed. Application Architecture is far more important than Cloud vs. Bare Metal. It is just easier and more cost effective to through more bare metal hardware at the problem than it is cloud instances. For some this does make bare metal the better option.

To add to my previous comment though, AWS (and cloud in general) tends to make much more sense if you are utilizing their features and services (Such as Amazon RDS, SQS, etc.), and if you aren't using these services I can absolutely guarantee I can deliver a much lower TCO on bare metal than AWS. (Which is why I offered to consult for them) I see this all the time. Company moves from bare metal to AWS as bare metal is getting expensive, then they quickly find out AWS can't deliver the performance they need without massive scale (because they aren't using a proper salable distributed design and can't afford to re-architect their platform)

HolyLampshade · on Dec 11, 2016

This is definitely what perked my ears up when they mentioned the US East Coast, especially when you consider the risk that a natural disaster might take out the facility.

scott_karana · on Dec 12, 2016

There's no reason you can't have a hot cloud site, and operate in lipo mode (or even scale it up temporarily) if you lose your DC. Best of both worlds.

moreati · on Dec 12, 2016

What is lipo mode? Closest I can guess is a typo for 'limp mode'.

scott_karana · on Dec 13, 2016

Haha, yep... was also autocorrected to limo. :(

dsr_ · on Dec 11, 2016

Z1: the word "monitoring" does not appear in this document.

You will need to monitor: - ping - latency - temperatures - cpu utilization - ram utilization - disk utilization - disk health - context switches - IP addresses assigned and reaching expected MACs - appropriate ports open and listening - appropriate responses - time to execute queries - processes running - process health - at least something for every bit of infrastructure

once you collect that information, you need to record it, graph it, display it, recognize non-normal conditions, alert on those, page for appropriate alerts, and figure out who answers the pagers and when.

sytse · on Dec 11, 2016

Good point, we know monitoring is very important.

Our Infrastructure lead Pablo will do a webinar of our Prometheus monitoring soon https://page.gitlab.com/20161207_PrometheusWebcast_LandingPa...

We're bundling Prometheus with GitLab https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481

Brian Brazil is helping us https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1481#not...

On January 9 our Prometheus lead will join us (who was very valuable already helping with this research behind this blog post) and we're hiring Prometheus engineers https://about.gitlab.com/jobs/prometheus-engineer/

In the short term we might send our monitoring and logs to our existing cloud based servers. In the long term we'll host them on our own Kubernetes cluster.

For our monitoring vision see https://gitlab.com/gitlab-org/gitlab-ce/issues/25387

favadi · on Dec 12, 2016

Gitlab package includes too many things! I remember Gitlab sponsors a guy to do proper packaging to include Gitlab in debian last year, not sure what is the status.

jrv · on Dec 11, 2016

Since GitLab is heavily betting on Prometheus now, my guess is that it will be used to cover almost all of those points and more. Yep, lots of work though.

Disclaimer: Prometheus co-founder and just started as a Prometheus contractor for GitLab.

EDIT: ah, didn't reload the page to see that sytse had already responded with this :)

hathym · on Dec 12, 2016

neither the word "IT salary(ies)"

nik736 · on Dec 11, 2016

Are you sure about the location? I would go with Frankfurt, Germany. Biggest IX in the world and if you want a "low-latency" solution for all users this is basically the middle of everything. NYC will have a worse connection to Asia and I don't want to begin with India or something. While Frankfurt is basically only 70ms away from NY and around 120ms to the west coast, while even south america should be < 200ms. Russia is 40-50ms and Asia should be between 100ms and 200ms. Australia will probably have the worst ping with around 200-300ms.

Just as a suggestion since you seem to be so certain about the location. I basically know every DC in Frankfurt, so if you need any help or info in that regard feel free to contact me :-)

skuhn · on Dec 11, 2016

If you can only deploy one place in the world, east coast US is generally the right fit (if you have typically distributed user traffic, of course). You can cover western Europe and the major population centers of eastern North America, and still have reasonable performance in western North America as well. While it's increasingly less true over time, keep in mind that this is the place that other countries all prioritized building connectivity to, because historically this is where early commercial websites lived.

Of course there's the issue of potential legalities for your business, but don't kid yourself that you're safe from prying eyes by deploying in a particular country.

The next step (and particularly important for a business like GitLab) is to land a second site in either western Europe or west coast US. Honestly you should be thinking about this right away, and look to sign leases on both spaces simultaneously with a 3-mo delay built in for the second site. This should help you negotiate price as well, if you're able to go with the same dc provider, but be aware that that itself is a single point of failure as well. Make absolutely sure you negotiate your MSA to the n-th degree, get a good SLA, etc. You can still get burned, but do your legal due diligence now because you won't have a chance to change terms later.

Then continue to optimize by having multiple sites per-region (so that failover doesn't involve a big performance hit), adding APAC / AUNZ regions, and so forth. For a service like GitLab, I wouldn't think that time to sync the repo is hyper important, but responsiveness of the web interface is fairly key. So that may lead to a hub/spoke design where there are a few larger sites storing the bulk of the data and more small sites to handle metadata and such to present the web views.

That's all years down the road though. For a first pass, I can't see a problem with northern Virginia as the first site.

sytse · on Dec 11, 2016

Frankfurt surely has a great IX. Even if Frankfurt would make everyone better off (which I'm not sure about) there is another problem. People in the US are used to lower latencies because most SaaS services are hosted there.

samstave · on Dec 11, 2016

I have less of a concern on latency than I do with who has taps in the lines...

I would assume that the .de lines are saturated with 5 eyes...

nikanj · on Dec 11, 2016

I remain skeptical about the existence of non-tapped lines.

samstave · on Dec 11, 2016

+1

EDIT: Funny, every single documented, proven link for the last several years has proven me and GP correct and all you nay-sayers wrong.

predakanga · on Dec 12, 2016

I believe the downvotes are due to "+1" not adding anything to the conversation, as opposed to people disagreeing with the sentiment.

samstave · on Dec 12, 2016

Isnt it ironic that a (+1) on HN will get downvotes yet thats basically what every single HNer seeks to see in their requests on git. :-)

cpach · on Dec 11, 2016

So what location would be out of reach of FVEY? And why would you put secret stuff on Gitlab.com?

Spooky23 · on Dec 11, 2016

Why would you care?

samstave · on Dec 11, 2016

What's your email address, SSN and password. If you have nothing to hide, then you shouldn't care why I want to read all your messages.

Spooky23 · on Dec 11, 2016

Where are the customers coming from that are immune to national intelligience apparatus?

I wouldn't make a business decision around a general feeling of paranoia. German spies aren't an improvement over American or British ones.

closeparen · on Dec 12, 2016

Why would you care, as in, why would you have cleartext data on the wire?

This is Gitlab. There should never be any data in or out except HTTPS and SSH.

samstave · on Dec 13, 2016

thank you. i didnt read it with that context

nkw · on Dec 11, 2016

> I basically know every DC in Frankfurt, so if you need any help or info in that regard feel free to contact me :-)

I (and others I bet) would be interested in a quick summary of the options (and your opinions of them) for facilities in Frankfurt.

nik736 · on Dec 12, 2016

There are basically a shitload of datacenters in Frankfurt so I won't talk about every DC, but more of a general overview. It doesn't matter what your budget is or what your requirements are, you will find a DC in Frankfurt.

You basically have the big brands like Equinix (7 facilities in Frankfurt), Interxion (They are currently building their 11th facility in Frankfurt), Telecity (now part of Equinix) but also some local ones like e-Shelter, First Colo or Accelerated. There is also a DC only meters away from the Interxion campus that is called Interwerk and is basically the cheapest DC in Frankfurt, but it has an awesome price/performance ratio and you have access to all peering/transit options of Interxion (via CWDM) and can get a 1/1 rack for under 200 EUR/mo. I have been in that DC for several years and never had a single issue, so if you don't rely on any certificates this is a cheap option.

I was also colocating in the First Colo DC for some time, they are also in the locality around the Interxion campus. It's a pretty small DC but it is more premium than the Interwerk and also is kinda cheap. I personally wouldn't go with Accelerated, since they had multiple issues in the past.

For the big brands I would definitely go with Interxion, they are great, have all the certificates and are premium while not going crazy with their pricing like Equinix does. DigitalOcean, Twitch, etc. they are all in one of the many Interxion facilities in Frankfurt. If Price isn't a concern I would probably go with Equinix FRA5.

DE-CIX is present in around 7 facilities, 3 of them are Interxion and I think 2 Equinix.

BJanecke · on Dec 11, 2016

So what about the data privacy in the USA, and patriot act and all that. Also think fin tech where geolocation means more things than just latency.

_jcwu · on Dec 11, 2016

Why does latency matter for fin tech for a repository / source control?

BJanecke · on Dec 11, 2016

"more things than just latency"

Some financial institutions require all sensitive* data to be stored/hosted in the same country/state(archaic yes).

It's real hard to actually define sensitive data but the IP* in some code a quant wrote can totally be considered a trade secret by a non technical person Please don't get me started on how stupid I think it is that people consider code to be IP

hueving · on Dec 12, 2016

>archaic yes

Definitely not archaic. If the FISA courts taught you anything, it should be that the country your things are living in determines which entities can tap your traffic/hardware.

HolyLampshade · on Dec 11, 2016

To that end I'd be cautious of facilities in NJ. If you are dead-set on the US, Ashburn, or even better further inland such as Chicago or Dallas, may be better suitable if you do not plan on having redundancy in another facility.

Frankfurt may be a better option solely on the basis that it is more "environmentally" stable (in terms of events that may disrupt the operation of the facility).

thejosh · on Dec 11, 2016

Australia will be 300-350ms.

olegkikin · on Dec 11, 2016

It really depends on where most of their high-paying customers are. Which is, very likely, in the US/Canada and some of EU. So US East Coast makes sense.

Spooky23 · on Dec 11, 2016

If the customer base has lots of US enterprise customers, that choice will cost you money. Ex-US data residency is an issue for many compliance standards.

darklajid · on Dec 11, 2016

Isn't that the same vice versa? Sure, if the majority of the customers sit in the US _and_ care about that: Fine. Otherwise I felt that the online community rather likes to avoid the US for data if that is an option.

(I'm from Germany, but I couldn't care less about the country per se or GitLab being co-located in Frankfurt)

rphlx · on Dec 12, 2016

It's a problem IMO that they are trying to pick 1 site. Geodiversity is necessary for any serious enterprise, and given that, I would suggest 1 site on the US west coast to please users in SF/LA/SEA (and most of Asia) plus 1 site somewhere in Europe (AMS, FRA, etc). Once you can afford 3 sites, add Asia or US east, depending on where the users are.

ninjakeyboard · on Dec 11, 2016

I'm sure that you guys have done a lot of analysis but flipping to steel is the very last thing I would consider before reviewing the tech in the world today in relationship to TCOS as things move very fast.

You could frame infrastructure cost savings in many different ways though. The most obvious solution may seem to be the spend to move from the cloud to in house bare metal but I feel like you'll have a lot of costs that you haven't accounted for in maintenance, operational staff spend, cost in lost productivity as you make a bunch of mistakes.

Your core competency is code, not infrastructure, so striking out to build all of these new capabilities in your team and organization will come at a cost that you can not predict. Looking at total cost of ownership of cloud vs steel isn't as simple as comparing the hosting costs, hardware and facilities.

You could reduce your operating costs by looking at your architecture and technology first. Where are the bottlenecks in your application? Can you rewrite them or change them to reduce your TCOS? I think you should start with the small easy wins first. If you can tolerate servers disappearing you can cut your operating costs by 1/8th by using preemptible servers in google cloud platform for instance. If you optimize your most expensive services I'm sure you can cut hunks of your infrastructure in half. You'll have made some mistakes a long the way that contribute to your TCOS - why not look at those before moving to steel and see what cost savings vs investment you can get there?

Ruby is pretty slow but that's not my area of expertise - I wouldn't build server software on dynamic languages for a few reasons and performance is one of them but that's neither here nor there as you can address some of these issues in your deployment with modern tech I'm sure. Aren't there ways to compile the ruby and run it on the JVM for optimizations sake?

Otherwise do like facebook and just rewrite the slowest portions in scala or go or something static.

Try these angles first - you're a software company so you should be able to solve these problems by getting your smartest people to do a spike and figure out spend vs savings for optimizations.

deadbunny · on Dec 11, 2016

Yeah, link aggregation doesn't work how they think it does. And not having a separate network for Ceph is going to bite them in the arse.

GitLab is fine software but fuck me, they need to hire someone with actual ops experience (based on this post and their previous "we tried running a clustered file system in the cloud and for some reason it ran like shit" post).

btgeekboy · on Dec 11, 2016

They'd save themselves a whole lot of time, effort, and money if they looked at partitioning their data storage instead of plowing ahead with Ceph. They have customers with individual repositories. There is no need to have one massive filesystem / blast radius.

comex · on Dec 12, 2016

I don't know much about this stuff, but won't that stop working if they ever decide to expand to multiple geographical sites, to reduce latency to customers in different locations? In that case, different sites can receive requests for the same repositories, and ideally each site would be able to provide read only access without synchronization, with some smarts for maintaining caches, deciding which site should 'own' each file, etc. They could roll their own logic for that, but doesn't that pretty much exactly describe the job of a distributed filesystem? So they'd end up wanting Ceph anyway, so they may as well get experience with it now.

olalonde · on Dec 11, 2016

Seems this is one of the goals of the article: "We're hiring producton engineers and if you're spotting mistakes in this post we would love to talk to you (and if you didn't spot many mistakes but think you can help us we also want to talk to you)".

deadbunny · on Dec 11, 2016

My bad, I didn't make it that far down the post before hitting rant mode.

chrido · on Dec 12, 2016

Ceph should get a separate network which is only used for re-replication in case something happens. Consider a node goes down.

Another thing I might recommend is a third network (just a simple 1GB and a quality switch) for consensus. Re-replication can max the network out and further cause consensus fails, causes more re-replication winding down everything ... If that's not possible, add firewall rules to prioritize all consensus related ports high.

Dylan16807 · on Dec 12, 2016

They said very little about how they think link aggregation works. Just that they can send packets on both and continue working with only one port. That's basically the definition of link aggregation. So what's wrong about the post?

lykron · on Dec 11, 2016

If you are looking for performance, do not get the 8TB drives. In my experience, drives above 5TB do not have good response times. I don't have hard numbers, but I built a 10 disk RAID6 array with 5TB disks and 2TB disks and the 2TB disks were a lot more responsive.

From my own personal experience, I would go with a PCIe SSD cache/write buffer, and then a primary SSD tier and a HDD tier. Storage, as it seems you guys have experienced, is something you want to get right on the first try.

Edit:

N1: Yes, the database servers should be on physically separate enclosures.

D5: Don't add the complexity of PXE booting.

For Network, you need a network engineer.

This is the kind of thing you need to sitdown with a VAR for. Find a datacenter and see who they recommend.

sytse · on Dec 11, 2016

Thanks for commenting! You're right that 2TB disks will have more heads per TB and therefore more IO. We want to increase performance of the 8TB drives with Bcache. If we go to smaller drives we won't be able to fit enough capacity in our common server that has only 3 drives per node. In this case we'll have to go to dedicated file servers, reducing the commonality of our setup. We're using JBOD with Ceph instead of RAID.

If the 8TB drives don't offer the IO we need we'll have to leave part of them empty. I assume that if you fill them with 2TB they should perform equally well. Word on the street is that Google offers Nearline to fill all the capacity they can't use because of latency restrictions for their own applications.

It is an interesting idea to do SSD cache + SSD storage + HDD instead of just SSD cache + HDD. I'm not sure if Ceph allows for this.

AnthonBerg · on Dec 12, 2016

I recommend testing and evaluating Bcache carefully before planning to use it in a production system. I found it unwieldly, opaque, and difficult to manage when testing on a simple load; Benefits were mild.

I'm sure it has its uses! But I'd emphatically advise anyone to obtakn hands-on experience with Bcache before planning around it! As I'm sure you are already doing or considering of course :)

lykron · on Dec 11, 2016

It doesn't work like that. 2TB drive will always be faster than an 8TB drive. The amount of data has no effect when compared to the physical attributes of the drive. More platters will increase the response time.

Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.

toomuchtodo · on Dec 11, 2016

> Ceph seems to offer Tiering, which would move frequently accessed data into a faster tier while the infrequent data to a slower tier.

By "Tiering", is this moving data between different drive types? Or by moving the data to different parts of the platter to optimize for rotational physics?

lykron · on Dec 11, 2016

By Tiering, I'm talking about moving blocks of data from slower to faster or visa verse. If I have 10TB of 10k IOPS storage and 100TB of 1k IOPS storage in a tiered setup, data that is frequently accessed would reside in the 10k IOPS tier while less frequently accessed data would be in the 1k tier. In this case, the blocks of popular repositories would be stored in SSD, while the blocks of your side project that you haven't touched in 4 years would be on the HDD. You still have access to it, it might just take a bit longer to clone.

Ceph can probably explain it better than I can. http://docs.ceph.com/docs/jewel/rados/operations/cache-tieri...

sytse · on Dec 12, 2016

That is pretty awesome. Should the SSD's for the fast storage be on the OSD nodes that also have the HDD's or should it be separate OSD nodes?

illumin8 · on Dec 12, 2016

The fact that you're asking this question on Hacker News leaves little doubt in my mind that you and your team are not prepared for this (running bare metal).

I read the entire article, and while you talked about having a backup system (in a single server, no less!) that can restore your dataset in a reasonable amount of time, you have no capability for disaster recovery. What happens when a plumbing leak in the datacenter causes water to dump into your server racks? How long did it take you to acquire servers, build the racks and get them ready for you to host customer data? Can your business withstand that amount of downtime (days, weeks, months?) and still operate?

These questions are the ones you need to be asking.

In other words, double your budget because you'll need a DR site in another datacenter.

lykron · on Dec 12, 2016

They don't need another datacenter. AWS has had more issues they NYI's datacenters have had in the past year. There are many companies that are based out of a single datacenter that haven't had any major issues.

toomuchtodo · on Dec 12, 2016

Backblaze comes to mind.

lykron · on Dec 12, 2016

That would be something I would test. I don't know Ceph, so I would be taking a shot in the dark. I would guess it would not make much of a difference as everything is block level. I, personally, would do 1x PCIe SSD for cache, 1x 2/4TB SSD, and 2x 4TB HDD for each storage node.

Edit. If Ceph is smart enough, it would be aware of the tiers present on the node, and would tier blocks on that node. So a Block on node A will stay on node A.

matt4077 · on Dec 11, 2016

Though 2TB of data on an 8TB drive does mean only 1/4 as many requests hit it, right?

lykron · on Dec 11, 2016

I'm not sure I understand your question. A 2TB/2TB disk will have the same number of requests as a 2TB/8TB disk, as they both have the same amount of data.

If you are talking physically, 2TB/8TB would theoretically be faster than a 2TB/2TB disk if the performance attributes were the same. But a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.

Dylan16807 · on Dec 12, 2016

> a 2TB HDD will have a faster average seek time than an 8TB HDD due to physical design. Any performance gains of only partially filling a drive would probably be offset by the slower overall drive.

I'm skeptical here. So your minimum seek time goes up on the 8TB because it's harder to align. But your maximum seek time should drop tremendously because the drive arm only has to move along the outer fifth of the drive. And your throughput should be great because of the doubled linear density combined with using only the outer edge.

lykron · on Dec 12, 2016

I'm not saying that you wouldn't see any performance gain. 2TB on the outer track will be faster than 8TBs on the same 8TB disk, but I'm saying any gains will be lost due the dense nature of the drives.

A quick google search shows that there are marginal gains on the outer track vs the inner, but that is only on sequential workloads. For something like GitLab, the workloads would be anything but.

http://superuser.com/questions/643013/are-partitions-to-the-...

https://www.pythian.com/blog/hard-drive-inner-or-outer/

Dylan16807 · on Dec 12, 2016

Ignore the part about where the partition is, then.

1. If I look at 2TB vs. 8TB HGST drives, their seek times are 8ms and 9ms respectively. But if you're only using a quarter of the 8TB drive, the drive arm needs to move less than a quarter as much. Won't that save at least 1ms?

2. The 8TB drive has a lot more data per inch, and it's spinning at the same speed. Once a read or write begins, it's going to finish a lot faster.

3. Here's a benchmark putting 8TB ahead in every single way http://hdd.userbenchmark.com/Compare/HGST-Ultrastar-He8-Heli...

alfalfasprout · on Dec 11, 2016

So, the 8TB drives are fine if you're doing sequential writes, mainly doing write-only workloads, or have a massive buffer in front.

In my experience, the PCIE drives they mention (the DC P3700) are incredible. They're blazing fast (though the 3D X-Point stuff is obviously faster), have big supercaps (in case of power failure), and really low latency for an SSD (150 microseconds or so). They're a pretty suitable alternative to in-memory logbuffers where latency is less crucial. Much cheaper than the equivalent memory too. Having a few of these in front of a RAID 10 made up of 8TB drives will work just fine.

FWIW your experience with a RAID 6 of big disks is unsurprising. Raid 6 is almost always a bad choice nowadays since disks are so big. Terrible recovery times and middling performance.

lykron · on Dec 11, 2016

True about RAID6, I was just commenting on performance. A 2TB will spank an 8TB in any configuration.

I would think that GitLab's workload is mostly random, which would pose a problem for larger drives. The SSDs are a great idea, but I've only seen 8TB drives used when there are 2 to 3 tiers; with 8TB drives being all the way on the bottom. I'm not sure how effective having a single SSD as a cache drive for 24TBs of 8TB disks will be.

BJanecke · on Dec 11, 2016

Don't have hard numbers you say?

https://www.backblaze.com/blog/hard-drive-reliability-stats-...

[EDIT]

Oops that's reliability, I can jump the gun sometimes, sorry.

sytse · on Dec 11, 2016

I'll be here all day to learn from suggestions. I'm hoping for much feedback so please reference questions with the letter and number: 'Regarding R1'.

bobf · on Dec 11, 2016

Moving to your own hardware will almost certainly improve performance, reduce incidental downtime, and cut costs substantially. Including hiring more engineers, you might expect total costs to be ~40-50% of what you would have spent on cloud-based services over the first 24 months. If your hardware lifecycle is 36-48 months, you will see large savings beyond 24 months.

A few things to watch out for, if your team doesn't have a lot of experience in direct management of dedicated servers/datacenters:

- There is an increased risk of larger/longer very rare outages due to crazy circumstances. Make sure you have a plan to deal with that. (I've had servers hosted at a datacenter that flooded in Hurricane Sandy, and another where an electrical explosion caused a >3 day outage..)

- It's easy to think you'll rely on managed services, but that rarely works out well. It also can become very, very expensive -- possibly more so than cloud-based hosting.

- Specifically, regarding H1, H2: Dedicated hardware is substantially cheaper than cloud-based hosting, but if you rely too much on managed services you negate a large portion of the savings. Consider that most service providers will be both more expensive and less competent than doing it yourselves. Also, having your own team have direct knowledge + their own documentation of the setup will be beneficial.

- I'd recommend budgeting for and ordering some extra parts to keep on hand for replacement, if you are having datacenter ops handle hardware or can have an engineer located relatively close to your datacenters. (A few power supplies, some memory, a couple drives - nothing too crazy.)

- Supermicro's twins systems are great. In the past I've gone with their 1U models vs. 2U to slightly reduce the impact of unit downtime. (Having to take one 2U Twin down affects four nodes. It sounds like you'll have to decide on balancing that against the increased drive capacity, in your case.)

viraptor · on Dec 11, 2016

> reduce incidental downtime

In a long run, probably. Immediately after deploying, unless they hire very experienced, I expect quite a few "never seen before" issues (may not result in publicly visible downtime though). Monitoring for "thermal events", weird and hard to debug issues requiring firmware updates, "bad cable" issues, etc. are not what you have to deal with in the cloud.

neom · on Dec 11, 2016

I'm curious to know if you think this work is within the core competency of GitLab. if so, how did you decide it was. If not, how do you realize the investment over time in something that isn't? Is the GitLab CEO here?

sytse · on Dec 11, 2016

GitLab CEO here. Hardware and hosting are certainly not our core competencies. Hence all the questions in the blog post. And I'm sure we made some wrong assumptions on top of that. But it needs to become a core competency, so we're hiring https://about.gitlab.com/jobs/production-engineer/

thebyrd · on Dec 11, 2016

I think you're under estimating the number of people required to run your own infrastructure. You need people who can configure networking gear, people swapping out failed NICs/Drives at the datacenter, someone managing vendor relationships, and people doing capacity planning.

I think you could probably get the IO performance you're talking about in your blog post from AWS instances or Google Cloud's local NVMe drives, but if you truly need baremetal, I'd recommend Packet or Softlayer. Don't try to run your own infrastructure or in a year you'll be: https://imgflip.com/i/1fs7it

joeemison · on Dec 12, 2016

I would challenge your assertion that you need a core competency in bare metal. AWS and GCE are performant enough--you're just not using them correctly. Invest in IaaS expertise and be successful; invest in bare metal at this day and age and live to regret it forever.

shimon_e · on Dec 12, 2016

You are ruling out providers like OVH because lack of sufficient SLA but then you are replacing it with your own solution that will have no SLA.

Lower your SLA requirements and go with multiple providers like OVH. Make your site work at multi datacenters. At the end of the day your users will be much happier.

My 2 cents.

darklajid · on Dec 11, 2016

I love that this question baited your well-known introduction out again. Wasn't there a post at some point that searched for 'GitLab CEO here' on this very site, just for kicks and giggles?

toomuchtodo · on Dec 11, 2016

> But it needs to become a core competency, so we're hiring

Impressive.

halbritt · on Dec 11, 2016

Just a few quick notes. I've experience running ~300TB of usable Ceph storage.

Stay away from the 8TB drives. Performance and recovery will both suck. 4TB drives still give the best cost per GB.

Why are you using fat twins? Honestly, what does that buy you? You need more spindles, and fewer cores and memory. With your current configuration, what are you getting per rack unit?

Consider a 2028u based system. 30 of those with 4TB drives gets you the 1.4PB raw storage you're looking for. 2683v4 processors will give you back your core count, yielding 960 cores (1920 vCPUs) across that entire set. You can add a half terabyte of memory or more per system in that case.

Sebastien Han has written about "hyperconverged" ceph with containers. Ping him for help.

The P3700 is the right choice for Ceph journals. If you wanted to go cheap, maybe run a couple M2.NVME drives on adapters in PCI slots.

I didn't really need the best price per GB in my setup, so I went with 6TB HGST Deskstar NAS drives. I'm suggesting you use 4TB as you need the IOPs and aren't willing to deploy SSD. Those particular drives have 5 platters and a higher relatively high areal density giving them them some of the best throughput numbers in among spinning disks.

If you can figure out a way to make some 2.5" storage holes in your infrastructure, the Samsung SM863 gives amazing write performance and is way, way cheaper than the P3700. I recently picked up about $500k worth, I liked them so much. They run around $.45/GB. Increase over-provisioning to 28% and they outperform every other SATA SSD on the market (Intel S3710 included).

You'll probably want to use 40GE networking. I've not heard good things about Supermicro's switches. If I were doing this, I'd buy switches from Dell and run Cumulus linux on them.

Treat your metal deployment like IaaC just like any cloud deployment. Everything in git, including the network configs. Ansible seems to be the tool of choice for NetDevOps.

sytse · on Dec 11, 2016

Thanks for the great suggestions.

We're considering the fat twins so we get both a lot of CPU and some disk. GitLab.com is pretty CPU heavy because of ruby and the CI runners that we might transfer in the future. So we wanted the maximum CPU per U.

The 2028u has 2.5" drives. For that I only see 2TB drives on http://www.supermicro.com/support/resources/HDD.cfm for the SS2028U-TR4+. How do you suggest getting to 4TB?

msandford · on Dec 12, 2016

Also whatever you do, don't buy all one kind of disk. That'll be the thing that dies first and most frequently. Buy from different manufacturers and through different vendors to try and get disks from at least a few different batches. That way you don't get hit by some batch of parts being out of spec by 5% instead of 2% and them all failing within a year, all at the same time.

If you do somehow manage to pick the perfect disk sure having everything from a single batch would be the best since that'll ensure you have the longest MTBF. But how sure are you that you'll be picking the perfect batch simply by blind luck?

halbritt · on Dec 12, 2016

I had this problem with the Supermicro SATA DOMs. Had problems with the whole batch.

That said, I bought the same 6TB HGST disk for two years.

msandford · on Dec 12, 2016

As long as you're not buying all the disks at once sticking with one manufacturer and brand should be fine. If you're buying 25% of your total inventory every year it'll all be spread out to just a few percent per month.

But when you're buying 100% of your disk inventory at once there's a serious "all eggs in one basket" risk.

halbritt · on Dec 12, 2016

Sorry, I was confused by the part numbers. I was thinking of the 6028u based system that have 12x3.5" drives. These are what I used for my OSD nodes in my Ceph deployment.

As for CPU density, I still feel like you're going to need more spindles to get the IO you're looking for.

nwilkens · on Dec 11, 2016

The estimate of 19KW gives a rough estimated requirement of ~90A @ 208v.

4 x 208v 30A circuits gives a total of 120A -- of which they can only be used at 80% capacity, so that gives you 96A usable -- without redundancy.

My initial feelings (eyeballing it) are you should be looking for 3 full racks, 2x208V@30A in each.

As a juniper shop, we implemented the 2xQFX 5100 48T (in virtual chassis) + ex4300 for remote access per rack. This will be a decision based on your local expertise though.

I also looked hard into the twin boxes -- but power to the rack in the end ruled the day, and it made not much of a difference to use the 1U boxes.

Don't forget about out of band (serial for the switching gear). We've been using OpenGear.com for this stuff with 4G-LTE builtin.

VPN access to access console devices?

Any site-to-site VPN needs?

Also, I would consider not pre-purchasing N times your required horsepower/disk if you can avoid it, but rather add in pre-planned yearly, or biannually stages.

As the CEO of a cloud hosting & server management company -- I have much more to say about this if you would like to chat via phone or email anytime.

mohctp · on Dec 12, 2016

There are two major pitfalls to crowd-sourced consulting such as this.

1) Contributors have not been vetted - Some responses are based on real world experience, and some is conjecture from arm-chair quarterbacks. (A simple example would be that nobody has mentioned with any of the SuperMicro 2U Twins that you have to be cautious about the PDU models and outlet locations of 0U PDU's to not block node service/replacement in the rack)

2) There are multiple ways to skin a cat - There are many viable solutions in this thread, but you can't simply take a little bit of this, a little bit of that, and piece together a new platform that "just works." -- Better to go with a know working solution than a little of this and a little of that. Multiple drivers tend to be less effective.

I am the owner of a bespoke Infrastructure as a Service provider that delivers solutions to sites of similar metrics, and speak with plenty of real-world experience.

Larger Providers - We find a lot of clients move away from AWS, Softlayer, Rackspace, et al. as the larger providers aren't nearly as interested in working with the less-than-standard configurations. They want highly-repeatable business, not one-off solutions.

I'd love to talk in more depth with you about how we can deliver exactly what you need based on years of experience in delivering highly customized solutions. We'll save you money and headache.

rphlx · on Dec 12, 2016

HN crowdsourcing is a pretty reasonable strategy for entities that cannot reliably identify & hire 1+ rockstar employee(s) and/or VARs to cover the compute, networking, storage, electrical, environmental, etc. If you can rationally evaluate the HN comments you should get pretty close to the best, cutting-edge advice. Whereas when you are small and you listen to 1 or 2 VARs and/or 1-2 internal employees you can expect, on average, to get average advice. Or advice that was excellent 2-3 years ago but is now out-of-date due to HW/SW progress that the employee/VAR is unaware of.

mohctp · on Dec 12, 2016

Cutting-edge and availability aren't really always best friends. For example cutting edge routing and switching (the latest products, fabric, SDN's, MC-LAG, etc) are notorious for failures and outages.

Average is not the appropriate word, however I'll use your word. I'd rather have, and so would every enterprise out there, rather have average advice based on tried and true solutions that costs 10% more with 99.99 to 99.999% availability than cutting edge, saved 10-20% with 99.8% availability. The downtime alone can (and does) kill reputations of sites like GitLab.com (and others).

hxegon · on Dec 11, 2016

I wish more companies were open like this, pretty cool to see the decision making and reasoning behind operational stuff like this.

siliconc0w · on Dec 11, 2016

It'd be interesting to see more application-level solutions to scale rather than just adding hardware. Like extending git so objects can be fetched via a remote object-store rather than dealing with a locally mounted POSIX file system. This would allow you to use native cloud object stores and might simplify your latency requirements to the point where you could consider staying in the cloud.

This definitely increases software complexity but going the other way increases other complexities (ops, capex, capacity).

sytse · on Dec 11, 2016

I thought the same and looked into this years ago but the short story is that for many git operations it needs block access to the repository. With an object store it becomes very slow.

dsr_ · on Dec 11, 2016

Z2 (no, Z is not on the list you have)

You need at least two nodes that do DNS, DHCP, NTP, and other miscellaneous services that you absolutely want to have but do not seem to have mentioned. You want them to be permanently assigned, so that you never have to search for them, and you want them to fail over each service as needed, preferably both operating at once. Three nodes would be better. Consider doing some basic monitoring on these nodes, too, as a backup to your main monitoring system.

skuhn · on Dec 11, 2016

Absolutely agree. Typically I deploy about 4-6 "tools" hosts in a site. From a computational standpoint, you could make do with fewer hosts, but there are some things that I prefer to separate out:

1&2. Authoritative DNS (internal)

1&2. NTP

1&2. Caching DNS resolver

3&4. Outbound HTTP proxy (if necessary)

3&4. PXE / installer / dhcp

3&4. Local software mirror (apt / yum / etc.)

5&6. SSH jump hosts.

Or something like that.

Make sure the second host is not just in a separate chassis from the first host, but also in a separate rack.

For external authoritative DNS, don't do it yourself -- pay for someone to run that for you (Route53, ns1, etc.). For e-mail, if you can possibly not deal with it then don't -- use Mailchimp or something.

stp-ip · on Dec 12, 2016

Sidenote: We are running coredns.io in production as authoritative internal DNS and as hidden master with NOTIFY to a secondary DNS provider (currently DNSmadeEasy).

The DNS records for the internal records are done using the kubernetes middleware (basically serving the service records). The external records are pulled in from a git repository hosting our zones as bind files. If need be zones are split into subzones per team/project. Same permission system as our code via MRs using Gitlab.

Our recommendation is build on open standards (BIND, AXFR) and use services on top of these.

I agree that using an external mail provider is usually a good idea. It mostly is your fallback communication channel and is usually easy to switch (doing replication to an offsite mail storage needs to be done to make switching easy/possible/fast). MX records \o/

sytse · on Dec 11, 2016

Good points, thanks!

lern_too_spel · on Dec 11, 2016

Since hosting git repositories is core to your business, you should take the time to do it right (https://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/fi...) instead of using vanilla git and relying on a magic filesystems and vertical scaling to solve your issues.

sytse · on Dec 11, 2016

The presentation you linked is something we've considered. But it is build on top of Google's distributed filesystem. So we consider our move to Ceph a first step in that journey.