Hacker News new | past | comments | ask | show | jobs | submit login
NVMe over Fabrics Explained (westerndigital.com)
93 points by nickysielicki on June 1, 2021 | hide | past | favorite | 31 comments



There is a wheel of life story here. Everything winds up being used as an IP protocol carrier. Working backwards from USB-C I can't think of a modern-day plug which hasn't wound up having a MAC address, bootstrap-to-IP path added to it.

NVMe to ethernet is simply making the SSD the endpoint with a functional NFS/SMB and IP stack. Soon it will move out of the backplane into a stand-alone cage, and be a NAS...


> NVMe to ethernet is simply making the SSD the endpoint

Confused by this comment. NVMeoF is about tunneling the NVMe command set over a network, instead of tunneling SCSI over IP or Fibre Channel, when communicating with a NAS-like device; not about plugging an ethernet cable to the SSD.


There are also native Ethernet SSDs that speak NVMeoF. Marvell is a big supporter of that idea since they make Ethernet switch chips but not PCIe switch chips. They make a NVMe/PCIe to Ethernet converter chip that can be used on small interposer boards between drives and the backplane or integrated into the drive itself.


NVMe-oF is already supported in Linux with target(host) and initiator(client). I've already configured and use this in my homelab with Mellanox network cards with RDMA.

Performance is awesome and I have not had any problems. It is a bit messy to configure on the target. But is very easy to configure on the initiator. There is also a free NVMe-oF client driver for Windows, https://www.starwindsoftware.com/starwind-nvme-of-initiator but I have not tested it yet.

I use NVMe-oF both for block devices(ZFS volumes from traditional hard drives) and Intel Optane. It also works with SATA/SAS SSD's. It is not as fast as native, you get some additional latency but still much much better than the alternatives.

What is really cool is that Optane is so fast that I can easily create 4+ partitions and share the Optane device to 4+ different systems.


It seems to me you might have a hard time adding 12 NVMe drives to a single CPU then?

SATA might be slower, but I much prefer a slower interface that can break but supports 12x drives than a single NVMe drive!!!

Also I need to see some CPU overhead graphs on those 12x SATA drives vs. a single NVMe?!

My suspicion is that the overhead will be lower if the SATA control chip is buffering data to avoid the CPU waiting.


You can connect many NVMEs to a single CPU, but the PCI bandwidth becomes a bottleneck much earlier of course. Just as with SATA. Think of the SATA HBA as an adapter that bridges between PCI and SATA. For NVMe the counterpart is a PCI switch, with PCI on both sides.


Supermicro sells single-CPU servers with 24 NVMe bays. Not a big problem if your CPU has 128 PCIe lanes.

You could get 12 quite easily even on consumer hardware. AMD Threadripper gives you 64 PCIe 4.0 lanes (so up to 72 with the TRX40 chipset), so you just use three PCIe x16 expansion cards each providing four NVMe 4x slots.


> It seems to me you might have a hard time adding 12 NVMe drives to a single CPU then?

Attachment is a solved issue as all you do is buy a PCIe switch and hang more nvme off your cpu. The issue I see is no clear standard for nvme form factors or connectors.

m.2 is a consumer standard that is more mobile friendly than desktop or server yet it appears to be the more common standard in all three. U.2 nvme in the classic 2.5" spinning rust form factor but I don't see them used in desktops. Then there are numerous "ruler" form factors designed to plug into 1U servers.

Connectors are also a mixed bag with m.2, U.2, a 36pin quad channel sff motherboard connector, sata express connector, and OCulink.

And further this whole article is more old-is-new again blogspam as all it talks about yet another SAN technology.


I had a very hard time finding an efficient way to add more NVMe slots to my personal storage server. If you want more than 1 NVMe drive per PCI slot (even if the slot is x16), it seems you need an expensive card that only works with certain motherboards. What are these PCI switches you mention?


I don't know why you were downvoted because this is a valid question.

Most NVMe expansion cards ('carrier cards') do not integrate any extra IC (or just a retimer) and instead rely on motherboard feature called PCIe bifurcation which essentially turns a x16 slot into four 'virtual' x4 slots. These can be fairly inexpensive ($50-$80), e.g. Supermicro AOC-SLG3-2M2 which provides two NVMe x4 connectors or Linkreal LRNV95NF which provides four.

If your motherboard does not support bifurcation then you need a more expensive ($150-$200) card with a PCI switch (e.g. SuperMicro AOC-SHG3-4M2P or Linkreal LRNV9524 which use PLX PEX switches).


> I had a very hard time finding an efficient way to add more NVMe slots to my personal storage server.

I think this has more to do with the form factor/connector issue than the interface. m.2 is an awkward form factor even though it is the most popular. An m.2 device is a low cost bare pcb which relies on a cheap board-to-board connector and secured via a screw remotely located from the connector. You cant locate these off board without a carrier and cable and you cant connect them vertically because of the way they are secured. Great for laptops though...

Another thing to consider is NVMe uses a PCI express x4 link meaning you need eight high speed differential pairs in a single cable along with clocking signals and so on. Cables aren't as simple nor cheap to make as sata which only had two differential pairs. Hopefully OCulink will become the standard.

And don't forget that disk form factors and interfaces are tied to desktop standards like ATX which are not the driving force behind computer design anymore. So does it make sense to continue to promulgate these form factors, e.g. U.2? I personally say keep what works but some will argue that.

Until we get this whole physical part of nvme sorted, we're going to be stuck in this awkward transition phase for a bit.


They're talking about a PCIe card with a PCIe-to-PCIe switch chip from the likes of the ones available from PLX Technology (now Broadcom). In the absence of chipset and firmware level PCIe bifurcation support a PCIe switch is your only option.

https://en.wikipedia.org/wiki/PCI_Express#SWITCH

https://en.wikipedia.org/wiki/PLX_Technology

https://www.anandtech.com/show/13511/highpoint-releases-the-...

https://www.ebay.com/p/18021887628


Does anyone know of a good NVMe-of storage chassis where the pricing is listed on their web site?


https://www.siliconmechanics.com/systems/storage/storform

They mainly use SuperMicro. They have been around for decades and have top notch service and pricing.

Also note, check out the Server section as well since there’s a few more in that section that have high NVMe capacity. Like below https://www.siliconmechanics.com/systems/servers/rackform#fe...


Oh hey, Silicon Mechanics in the wild! My first software dev job was there. They really do try to have the best prices/test their stuff better than Supermicro does.


If you're one of the people who created their web price configurator - then THANK YOU a million times over. It's so easy to use and understand ... as well as being informative as well.


No that was a different group of people - the backend system that runs that is a CRAZY complex PHP app ;) It's pretty impressive how many configurations/checks it can handle, supermicro probably sells as many different SKUs as Dell in total? I worked on a new stress test suite/baremetal OS deployment automation. No idea if they still use it.


Does Silicon Mechanics license out their web configurator?

I ask because ThinkMate.com web configurator is identical to SiliconMechanics.com

EDIT: I'm going to answer my own question. It appears Silicon Mechanics was acquired by Source Code Corporation, who also owns ThinkMate.


From reading the article, it sounds like NVMe-oF exposes a drive directly to a network, but I don’t really get how that works. Does the drive still get connected to a host? How does this compare to other technologies for having networked storage, like NFS?


It still needs to be connected to a host. This is an alternative to iSCSI. Instead of using the SCSI protocol, it uses the NVMe protocol.


And the use case is when you have a sufficiently high-performance storage controller that the SCSI protocol (that you are tunneling over IP or Fibre channel) becomes a bottle neck.


SCSI still does have some tricks (eg VAAI) which I don't think have NVMe counterparts yet?

Still, even just being able to get rid of the huge complexity and historical baggage of SCSI is a good thing. One of my favourite examples is how the spec defines the LUN number to not be a number.


I think NVMe is pretty close to feature parity with SCSI, or at least the subset of SCSI that is actually relevant. NVMe includes thin provisioning, some copy offload, atomic operations and of course deallocate/trim—which I think covers all the VAAI features.


> it sounds like NVMe-oF exposes a drive directly to a network

The drive is connected to a storage device. Storage is exposed by the device to network. Exposed storage and the drive do not correspond 1to1 (redundancy, etc.)


I'm curious how NVMe-oF compares to iSER?

From a quick search it seems that NVMe-oF is still early days with limited implementations, where iSER is more established. Both rely on RDMA.

I could not find any performance comparisons. Any insight here?


I don't have any formal comparisons handy but there's lots of NVMe-oF benchmarks here: https://spdk.io/doc/performance_reports.html

NVMe-oF can do both TCP (akin to iSCSI) and RDMA (akin to iSER). There are likely several reasons NVMe-oF is faster, but one big one is that each connection within a session shares state in iSCSI, so you either process all connections for a session on a single thread or you take locks. In NVMe-oF, it's possible to keep every connection entirely independent, so the NVMe-oF implementations scale better.


Ask HN: Do you people also see high performance SSDs vanishing from the shelves, and going up in price at a rapid pace these days?


Not sure if it’s the cause but Chia is a new-ish cryptocurrency which uses Proof-of-Space: https://en.wikipedia.org/wiki/Chia_(cryptocurrency)?wprov=sf...

Might be related


Chia uses high-performance SSD / NVMe drives to 'mine' for plots. This is a very read/write intensive process.

Once a plot is generated, it can be stored on a slower HDD for the proof-of-space to take place.


Yes,I have a script that searches eBay for the lowest price per gb, currently prices are _equal_ to buying new for many higher performance drives. There is also far fewer enterprise drives listed than their used to be, but I don’t have a hard number for that. Just no more large fusion io drives for example.


This is not related to cloth or some sort of smart sheets like I thought it was going to be




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: