Agree that bare metal is effective and doable at your scale, and can if done right give better SLA and much much much better control - especially of Internet-facing network performance than public cloud, or combos thereof).
We are running a 50%+ gross margin mid-stage venture-backed startup in Equinix facilities (but started there vs. cloud), and have no people near our facilities, and have had 0 issues service-wise related to doing management remotely. Yes, people go out to set up cabs, etc, but we hired our ops folks as generalists who had some network experience, and our CEO and CTO do as well, though AFAIK I don't have network logins active right now.
2 high-level thoughts I'd share:
1) Try not to use Ceph unless you're committed to having 2 people with deep experience at the code level.
2) I'd use Juniper QFX or EX, or Aristas. You don't seem to be running at scale or functionality where SDN magic is needed and there is a large community of QFC, EX, and Arista users your folks can reach out to when problems happen.
The other comments are more tuning and FYI on what we do HW-wise:
Specifically re: HW, at Kentik we run tens of worker nodes + flow ingest servers, all SM 1us w a few SSD and 256-384gb RAM. 48 logical cores, 2 x E5-2650v4.
We run approaching 1PB of storage, and while we still have some 4u 36-disk 3.5" boxes, those are phasing out and all we buy now is 2u SuperMicros w/ 24x2TB Samsung Evo 850ss. Procs are 72 logical core, 2 x E5-2697v4.
The Evo SSDs have been great - but our workload is largely appends or create/writes - largely but not all sequential, with high read IOPS. Before Samsung I was a big fan of Intel but we have no data on the modern Intels - slower for sure, but a focus on reliability is great...
We use JBOD and ZFS on the storage nodes; the LSI 9300-8i. Have things tested so we can do TRIM.
They do make SuperServers for roughly those configs, but we go with SM resellers who assemble and burn-in for +10-15%. I had 50+ SuperServers that were great at my Usenet company, but we'd rather have our ops folks work on things other than burn-in.
Happy to explain why we went to SSD vs. spinning at 2x the cost, but basically it made enough of a different at 95th and 99th percentile in our query times, and we had access to venture debt on great terms (which you should too and happy to discuss, since we're both funded by August).
Last note re: gear - when we were doing spinning, we found a screaming deal on new 2TB enterprise SATA (Hitachi, I think) for $50 and took the power/space hit for the +IOPS and extra compute we got for firing up the additional machines. Not sure if those are still out there, or the IOPS of this kind of approach would be needed.
We are running a 50%+ gross margin mid-stage venture-backed startup in Equinix facilities (but started there vs. cloud), and have no people near our facilities, and have had 0 issues service-wise related to doing management remotely. Yes, people go out to set up cabs, etc, but we hired our ops folks as generalists who had some network experience, and our CEO and CTO do as well, though AFAIK I don't have network logins active right now.
2 high-level thoughts I'd share:
1) Try not to use Ceph unless you're committed to having 2 people with deep experience at the code level.
2) I'd use Juniper QFX or EX, or Aristas. You don't seem to be running at scale or functionality where SDN magic is needed and there is a large community of QFC, EX, and Arista users your folks can reach out to when problems happen.
The other comments are more tuning and FYI on what we do HW-wise:
Specifically re: HW, at Kentik we run tens of worker nodes + flow ingest servers, all SM 1us w a few SSD and 256-384gb RAM. 48 logical cores, 2 x E5-2650v4.
We run approaching 1PB of storage, and while we still have some 4u 36-disk 3.5" boxes, those are phasing out and all we buy now is 2u SuperMicros w/ 24x2TB Samsung Evo 850ss. Procs are 72 logical core, 2 x E5-2697v4.
The Evo SSDs have been great - but our workload is largely appends or create/writes - largely but not all sequential, with high read IOPS. Before Samsung I was a big fan of Intel but we have no data on the modern Intels - slower for sure, but a focus on reliability is great...
We use JBOD and ZFS on the storage nodes; the LSI 9300-8i. Have things tested so we can do TRIM.
They do make SuperServers for roughly those configs, but we go with SM resellers who assemble and burn-in for +10-15%. I had 50+ SuperServers that were great at my Usenet company, but we'd rather have our ops folks work on things other than burn-in.
Happy to explain why we went to SSD vs. spinning at 2x the cost, but basically it made enough of a different at 95th and 99th percentile in our query times, and we had access to venture debt on great terms (which you should too and happy to discuss, since we're both funded by August).
Last note re: gear - when we were doing spinning, we found a screaming deal on new 2TB enterprise SATA (Hitachi, I think) for $50 and took the power/space hit for the +IOPS and extra compute we got for firing up the additional machines. Not sure if those are still out there, or the IOPS of this kind of approach would be needed.