Nice compiled list of stats. I'm not sure what they mean by H100s requiring pre-approval on LambdaLabs? Maybe I was grandfathered in since I had an account prior to their rollout, but I never had to do anything special to rent H100s there.
Somewhat related and hopefully helpful: My experience so far using the H100 PCIes to train ViTs:
Pros: A little over 2x performance compared to A100 40GB. 80GB by default. fp8 support, which I haven't played with but supposedly is another 2x performance win for LLMs. For datacenters and local workstations it's twice as power efficient as the A100s. Supposedly they have better multi-GPU and multi-node bandwidth, but my workload doesn't stress that and I can only rent 1xH100s at the moment.
Cons: I had trouble using them with anything but the most recent nVidia docker containers. The somewhat official PyTorch containers didn't work, nor did brewing my own. Luckily the nVidia containers have worked fine so far. I just don't like that they use nightly PyTorch. In addition to that, because of their increased performance, they really start to push the limits on feeding data fast enough to them with existing system configurations. I was CPU-limited on the LambdaLabs 1xH100 machines because of this.
Overall the pricing has worked out equal for my use-case, but I'm sure fp8 would make them more affordable. If fp8 were available out-of-the-box on PyTorch I'd play with it, but it's only available right now from some nVidia codebase specific to LLMs.
Even at equal pricing, having twice the power per GPU and per node is a big win. That increases experiment iteration across the board.
Side note: If I recall correctly, nVidia is heavily differentiating the H100 products by their interface this go around, which I find quite odd. The SXM version of the cards are supposed to be something like twice as beefy as the PCIE version? Not sure why they're doing that; gonna make comparing rentable instances all the more difficult if you overlook that little detail. A100 had a little bit of this, but the difference was never much in practice except between specifically the A100 40 GB PCIe and the 80 GB SXM, which was something like 10% faster.
Anyone else having fun with the new toy our overlords have allowed us to play with?
Can you guys expand the number of regions with storage? I don’t know what I’m doing wrong but the only storage available (Texas) never overlaps with available compute.
It was a cheap power adapter design that was sold by Nvidia that caused the melting, it was nvidias own doing selling a cheap cable on a $1,600 card. Using proprietary interfaces has everything to do with monopolistic behavior and nothing with faulty standards.
If you don't mind answering, what's your use case. Often I feel you either need to do something large scale like a 100 A100s, or you're better off with 8 3090s to run many experiments in parallel
My biggest project right now is training a multi-label ViT-L/16 model for a few hundred million samples. Mostly a big experiment, so not something I want to invest serious money into.
I have a 2x3090 rig as my local machine, which has been useful for early experimentation, but I'm at the stage now where my runs are at 200 million samples which would take ages on that rig. 8xA100 can do it in tens of hours, which allows me to iterate faster.
An 8x 3090 or 4090 machine locally would be great, but is a huge hassle to build. Last I looked into it there really wasn't a lot of knowledge available online on how to even do it. I did find an EPYC server motherboard and such that I could theoretically use, but couldn't find a great source for a >3,200 Watt server power supply. Everything in that domain is geared towards either B2LargeCorp, or B2ServerBuilders2SmallBusinesses.
I could of course buy an 8xA100 rig no problem for some ungodly amount of money, but as noted above that's not appropriate for this project.
In both cases, I now have to figure out what to do with 3,200kW of heat output in my office which I'm trying to avoid turning into sauna. Or co-lo it for more money out of pocket.
So renting off the cloud has worked well enough at this scale.
You might be interested in http://nonint.com/. He made a number of quite detailed blogposts on his 2 gpu machines (both 8x 3090's), including racks, power delivery etc
Thank you for that. Google utterly failed to bring...any relevant results when I was searching for what others were doing for ML rigs.
I'll stick with 2x 3090s for now, but maybe I'll go for an 8x monster in a few years when the hardware shakes out a bit if there is a good reason to do so.
> The SXM version of the cards are supposed to be something like twice as beefy as the PCIE version? Not sure why they're doing that
I can't help but feel it's another cloud/enterprise cash grab. The SXM baseboards are way more expensive than server motherboards and NVIDIA makes them.
It’s really not (well, no more than PCIE) - SXM has nvlink integrated, and more power delivery built in. Thus, they can crank the max wattage way higher, though it definitely gets into the diminishing return zone quickly past 300W.
SXM baseboards are made by more than Nvidia, Dell, HP, and Supermicro all have their own designs.
(Disclaimer - I work for MS Azure but have no internal knowledge about the costs/designs/capex whatever of these systems)
The outcome of Nvidia's monopoly hold on AI/GPU computing is that consumer level devices that might otherwise be perfectly effective for this sort of stuff are prevented by Nvidia from being used for such purposes.
If there was real competition -like two or more other suppliers on par in terms of capability - then artificially constraining devices just to plump up prices would not be a thing.
> consumer level devices that might otherwise be perfectly effective for this sort of stuff are prevented by Nvidia from being used for such purposes.
Citation?
To the contrary, millions of consumer-level Nvidia customers have access to datacenter-grade HPC APIs because of their vertical integration. Nvidia's "monopoly hold" on GPGPU compute exists because the other competitors (eg. AMD and Apple) completely abandoned OpenCL. When the time came to build a successor, neither company ante'd up. So now we're here.
CUDA is not a monopoly. If Apple or Microsoft wanted, they could start translating CUDA calls into native instructions for their own hardware. They don't though, because it would be an investment that doesn't make sense for their customers, costs tens of millions of dollars, and wouldn't meaningfully hurt Nvidia unless it was Open Source.
While OpenCL was simply not equivalent to CUDA, I think you're correct that those other enterprises (Apple, AMD and similar) that could challenge Nvidia on the high-end GPU front simply choose not to. The thing is, the reason is if there was competition in this market, prices would sink much closer to costs and no one would be making bank whereas a large enterprise would want a higher return.
Also, a consumer-grade GPU can used for neural net training at the researcher level but large corporate use requires H100/A100 and that is what's getting traction.
> prices would sink much closer to costs and no one would be making bank
For Apple and AMD, that's not really a problem. Both of them drive considerable (40%+) margins on their products and can afford to drive things closer to the wire.
I also think more competition here would be good (and I do love lower prices) but Nvidia charges more here because they know they can. It's value-based marketing that works, because their software APIs aren't vaporware.
> large corporate use requires H100/A100 and that is what's getting traction.
I guess... you really need a strict definition of "requires" for that to hold true. For every non-"competing with ChatGPT" application, you could probably train and deploy with consumer-grade cards. You're technically right here though, and it invites the conversation around what actually constitutes abusive market positioning. Nvidia's actions here really aren't much different than AMD and Intel separating their datacenter and PC product lines. It's a risky move from a "keeping both users happy" standpoint, but hardly anticompetitive.
Both of them drive considerable (40%+) margins on their products and can afford to drive things closer to the wire.
They could that - but the reason they command these margins is exactly because they don't do that. I think do something like for fairly some investment but producing products that would compete with Nvidia would require a significant percentage amount of capital for any company - those dealing with tens of billions of dollar chunks expect above commodity revenues.
It's not like I like the situation. I wish things were like 90s with a lot of competition making sure individual end-consumers got most of the benefits of Moore's law.
But, putting on my economist hat, not all market-structures naturally generate large-scale competition in the fashion of white box PC clones. Some market structures are naturally monopolies (energy), some are naturally oligopolies (automobiles) and some naturally have a dominant player plus marginal players arrayed around them.
There's just no easy solution to this. That said, it's not like we don't GPUs of unprecedented power available at a variety of price levels.
Also, it is starting to become the case that CUDA isn't that important anymore, both pyTorch and TF have numerous other backends and the programmer doesn't need to know what it runs on. And the GGML project has shown that you can come a long way with a good CPU and large "normal" RAM and 4/8 bit weights, with no CUDA in sight. You can definitely enter this domain without having a full-fledged CUDA replacement from the start.
Consumer GPUs are very good for single GPU inference(or training small models), but Nvidia deliberately made GPU networking slower. eg 3090 has support for NVlink, but it was removed in 4090.
SLI was never well-supported in the first place. It's a shame it's gone, but compared to multi-GPU tiling solutions I don't think its much better, at least for AI.
Nvidia is certainly hostile to Open Source and not the kindest hardware vendor to boot, but that alone does not suffice a monopoly.
I don’t blame Nvidia. I think caring about blame when discussing this topic is missing the forest for the trees anyway. The bigger problem is that Nvidia can and will abuse their monopoly power. Whether it’s Nvidias fault or not, once upon a time governments would step in when that happened. Thing is, I don’t think that solution even exists for something as high tech as GPGPU stacks and vertical integration. I don’t see a solution at all really, which worries me a little in the medium-long term.
Contrary to the dogma of certain politico-economic camps, a monopoly (the actual presence of market power and absence of substitution effect marking an absence of actual competition in some space) can exist without competition being illegal. So, “Any…competitor could even legally re-implement [CUDA] for their hardware” is not a counter-argument to CUDA being the basis for an actual existing monopoly.
It might be an argument that, to the extent that that is the sole basis for the monopoly, the monopoly is unlikely to be a long-term stable condition, but its not a counterargument to it existing.
Then let's not mince words. This behavior is not illegally anticompetitive. Nvidia's advantage is fair, and they only monopolize GPU compute APIs because their competitors literally abandoned their own solutions.
No one said it was. Having a monopoly isn’t illegal in the US at all (leveraging it in certain ways is.)
The claim was that (1) NVidia has a monopoly, and (2) the effect of that monopoly has been consumer devices getting worse for this use in specific, well-defined ways. Legality of NVidia’s actions and fairness of how their market position arose are not particularly relevant.
It’s true they price partion their products basically perfectly (as in there isn't a magical good deal anywhere in the range).
But these specific critisms don't really ring true. Lower VRAM bandwidth lets them use lower binned VRAM and there isn't really a need for more RAM than the 24G in the 4090 in gaming.
The naming screw up they did with the 4080 was dumb and fortunately corrected quickly. But it doesn't seem related to the OPs point.
This is what we've done at Salad (www.salad.com). SaladCloud (officially launching on July 11th) has 10k+ Nvidia GPUs at the lowest prices in the market. Not conducive for training but for serving inferences at massive scale.
The only think I am happy about all this AI hype is Infiniband is getting some love again. A lot of people using RoCE on Connect-X HBAs but still a lot of folk doing native IB. If HPC becomes more commonplace maybe we get better subnet managers, IB routing, i.e all the stuff we were promised ~10+ years ago that never had a chance to materialise because HPC became so niche and the machines had different availability etc requirements than OLTP systems that didn't demand that stuff getting built out. Especially the subnet managers as most HPC cluster just compute a static torus or clos-tree topology.
There was a time I was running QDR Infiniband (40G) at home while everyone else was still dreaming of 10G at home because the adapters and switches were so expensive.
QDR infiniband at home was such a great hack, and I was doing the same thing with my small network a little while ago. As long as you run connections point-to-point (with no switch) and keep the computers near each other (in twinax range), it's incredibly cheap and blazing fast.
Yeah I ran it for years with no issues. I modified a Mellanox switch to use quieter fans and eventually switched to fibre SFPs so I could make longer cable runs (also twinax is bulky/heavy). Was absolutely epic to use my ZFS NAS at essentially native speeds from my desktop and media PC. I was using the IB SRP target to export ZVOLs so with fast all-flash storage you really couldn't tell it wasn't running on a (giant) local SSD.
It's not for general use but if you build large infrastructure stuff IB can be a very interesting transport choice. Naturally you will only do this if you want to either use RDMA directly (via IB Verbs) or use protocols like IB SCSI RDMA Protocol which do so under the hood.
Things like "Datacentre Ethernet" and RDMA over Converged Ethernet (RoCE) are basically Infiniband anyway (you will find that such features are only available on Connect-X HBAs from Mellanox which can generally run in 100GE or IB mode).
The main reason these things matter is they are circuit switched and generally combined with network topologies that ensure full bisectional bandwidth (or something that is essentially good enough for the communication pattern the machine is designed for). This means the network becomes "lossless" once it's properly verified as online and circuits are setup. For storage this is essentially unmatched, it's equivalent to Fiber Channel but way faster and converged so you can also run your workload over the same links rather than having separate storage and network links.
I haven't kept up with the pricing for the last decade, back when I was working with it was a fraction of the cost of comparable Ethernet. Maybe things have changed drastically though.
On the used market I'm sure you can get some decent kit for a homelab. But if you want a production environment, with anything close to the latest speeds (≥100G), you can't get gear for love or money: all the Big Players are taking all the inventory, so lead times are slightly ridiculous (assuming anyone has stock).
We were looking at IB in Feb 2022, and the earliest for IB switches was Nov-Dec 2022. We ended up going a different route for other reasons, so I haven't kept on lead times, but it would not be surprising if things were roughly the same.
Also hard to get from vendors: wait lists for >100G switches is a long time. No one has them in stock: probably because LLM is what all the cool kids are doing, so all inventory/production has been spoken for.
It’s hard to have a serious conversation about black swan events because you’re a fool until they happen. Anyway, if China invades Taiwan, which appears increasingly likely, the availability of GPUs will plummet. If this occurs in about 2 years, it will hit at peak AI (IMO) before supply chains have adapted to new extreme levels of demand. In fact peak demand may be a catalyst for invasion. Hoarding GPUs at this point may seem laughable, but having capacity during and immediately after an invasion will be a competitive advantage of note.
China implementing the National Security Law in Hong Kong, as forcibly as they did, was a trial run for what they want to do in Taiwan.
China's military has been built with the explicit long term goal of invading Taiwan. They drill for it, with pretty provocative displays - e.g. flying ballistic missiles over the island last year.
I think they will continue to build their military until success is assured, winning the battle before it's fought, per Sun Tzu. That's the best way to fight: make resistance obviously pointless.
Russia, to take a different example, never had enough troops to occupy Ukraine. They had ~130k troops on the borders. Per [1], crushing the Czechoslovakia 1968 uprising and maintaining order afterwards required 500k troops for 14 million people, and that was without state military resistance. Ukraine had 3x that, Russia would have needed an army on the order of 1.5 million. So the plan was to fly in and take Kyiv in a decapitation attack, but it failed, and here we are.
China has an army of 2 million people actively serving. Taiwan is 23 million people. I think it's a matter of when, not if, as long as Xi is in the ascendant, and isn't too distracted by the growing problems of the lopsided debt-laden, export-focused, manufacturer subsidizing mercantilist Chinese economy.
Why did Putin attack Ukraine instead of retiring on a superyacht? Why is he risking it all for what would, in a best-case scenario, become a decade long guerilla war against the ukraine resistance, even if russia managed to capture all of ukraine (which probably was the assumed/likely outcome as the war broke out).
Clearly the incentives and motivations of people running oppressive regimes are slightly different.
> The leaders of the CCP have everything they could possibly want in this world. They won't risk it all for an uncertain outcome attacking Taiwan.
They don't have Taiwan. Historically, they think Taiwan is not a distinct country, but a splinter off of China that needs to be brought back into the fold again.
Just the same bullshit Putin thinks about Ukraine (and most probably also about Belarus). The only thing saving Taiwan from a Chinese invasion is just how bloody Putin's nose got in Ukraine - and that's without the defense pact Taiwan has with the US.
In modern times major powers invade other countries to distract from domestic issues. Putin did it, Bush did it, etc.
The official reasons given differ but generally are just an excuse to drive the nationalism which is the goal of the war in the first place. China's government seems to be doing well enough right now to not need that trump card.
While I agree that the Iraq war was completely out of any legal boundaries (unlike Afghanistan), it had never been the intention of the US to pull a land-grab war and turn Iraq into the 51st US state. The invasions of Putin however clearly were, and so will a potential Chinese invasion into Taiwan assuming they'll stick to their actions in Hongkong.
I've only thought about this from the perspective of TMSC fab getting hit with cruise missiles, but I've never gone down the long term impacts of the trade freeze that would happen with China. Decades of globalization having to be torn down and reconstructed in a very, very short time period.
Am I a bad person for wishing the AI bubble would burst already so Nvidia would go back to allocating their capacity towards affordable consumer GPUs?
ChatGPT is cool, but not cool enough for me to be OK with the ridiculous prices Nvidia can get away with on GPUs nowadays.
Consumer GPU sales are at the lowest point they've been in decades, but Nvidia doesn't care. Their profits are higher than ever because everyone and their dog is paying out the nose to build up their AI nonsense.
Strong demand for faster and more capable GPU's is the best thing that can happen for GPU prices in the long run.
It'll incentive GPU development, and the more new/more-capable GPU models come out, the cheaper older/less-capable models will get.
We've seen this cycle again and again with all sorts of computer hardware, from hard drives to monitors, CPU's, to memory, to graphics cards themselves. With some notable exceptions, the power and capacity of most hardware consumers can afford today dwarfs what was affordable a decade or two ago.
The same will happen to GPUs if demand remains high. So bring it on!
Used 1080 ti's are going for less then $200 right now and can do solid 1080p gaming and light 1440p gaming, if you don't care about raytracing and DLSS and etc.
If you do care about those things, then you can pay for them by buying a 3000 or 4000 series card. They're a bit more expensive.
I have a 3080, I paid MSRP ($699) for it in 2020 because I got lucky. It will last me a while, but I'm dreading seeing the 5080 next year and having to pay an MSRP that is twice that for the same class of product.
I don't think Pascal cards will be relevant for AAA games much longer now that AAA games are starting to skip last-gen consoles. Ubisoft showed of their Avatar and Star Wars games this week, and both are shipping with ray-traced global illumination.
Crazy to see stuff like the H100 at $2.40 per GPU hour. It feels like not that long ago that K80s would cost more than that. Moore's law still seems alive in dollars per FLOP.
Or 2-3 A100. The H100 are allegedly more budget friendly, but I din't find that true on the "black market". And I think it draws a bit less power, but I have absolutely no experience about the software suites NVidia offers.
I think the price is quite high to be honest, although probably realistic. Wouldn't expect competitors to be able to offer something better and data centers need to calculate for maybe newer models with even better performance.
What a great resource. I've been hoping someday for AWS to expose A100s 1 GPU at a time, but you can only rent the 8-bangers. A couple of vendors on this list will rent me one at a time.
Your price quotes for AWS support are wrong: Business support is 10% of your usage, or $100 USD whichever is higher. Enterprise support is 10% of your usage or $15k USD, whichever is higher. It's actually more complicated if your spend is high. Better point people to the info page here: https://aws.amazon.com/premiumsupport/plans/
Great summary of the industry. I do want to point out that we actually have the lowest price for H100s: 1.89/hr. Coreweave’s pricing requires long term commitment and you’re comparing our on demand PCIe cloud to their ‘reserved’ h100 cloud. Our reserved H100 + InfiniBand cloud is priced as follows and is the lowest in the world:
People talk about cards not being worth the electricity vs cloud. Seems like an a100 pulls 300w, costs $1.50/hr ish to rent, and costs $12,000 to buy, meaning it pays for itself with 1 year of constant use.
Sounds like a business opportunity. Of course, the risk is that by the time you make your money back, the cards you've been renting to customers are obsolete and you need to recapitalize...
It is much easier to transfer X amount of heat with a liquid than with air. Liquid travels in pipes and has enormous mass heat capacity (Per unit of volume, water has about 3200 times the specific heat capacity of dry air (at 77°F)). Air needs wide straight ducts and fans every 10 meters to keep air flowing.
Sure, but removing that heat to the outside still takes the same amount of energy. Unless they just pour the water down the drain it has to be transferred somewhere, no?
Wouldn't that depend on the sink? If they can cool down to the water, then running a water-water-heatexchanger would require less power (due to better heat transfer) than simply cooling it to the air.
Same I guess if they could use evaporation to do some of the cooling.
Just seems like a really complicated setup to use unless you were going to be saving enough power to make all the complexity worth it. Remember you have to put a water block on each card and design a pump and piping system that will run through all of them.
It's surprisingly simple to run all of this at home.
I just took a bunch of GPUs out of storage and racked up a machine yesterday. From 0 to Linux/Docker install to Stable Diffusion / llama in about 2 hours.
Many old consumer gaming GPUs will run an implementation of Stable Diffusion. But this page seems to be about getting use of H100 and A100, such as one might want for running or training decent-sized LLMs.
Training anything resembling a current LLM from scratch is so far beyond the ability/capital of anyone using a site like this.
Facebook/Meta used 8,000 A100s to train LLaMA (for example).
If you’re doing anything with an LLM it’s fine-tuning and there’s a new approach nearly weekly to do it better, faster, cheaper, and easier on 24GB cards which can be had for less than $1000.
Is the earlier point that people should see what they can do with "common household ingredients", before they assume they need to pay cloud providers for bigger/more iron?
I agree, and I have a 3090 for that purpose, and once wrote a tutorial for others wanting to do ML stuff on a GPU at home rather than rent from a cloud provider.
But a consumer GPU (or eBay older Tesla card) can't do everything that a rental pool of H100 and A100 can do, and I and other readers here will sometimes want to do those other things.
I didn't want to confuse people that all they'd need was to buy up a bunch of random retired Ethereum miner GPUs, no matter what they wanted to do with ML.
Don't get me wrong - websites like this still have a lot of value. My overall point is the assumption that you need A100/H100 to make productive use of LLMs isn't accurate. You can go a very long way with a single 3090 in a workstation for $1000 (as frequently noted on HN, including this thread, r/localLLamA, etc). Or you can rent nearly anything you want on various platforms (most of the consumer RTX stuff is usually only available on platforms like Vast.ai). Whatever works for you in your situation but the somewhat common belief (especially in the more mainstream press) that you need at least 40-80GB of VRAM to do anything LLM related is flat out wrong.
The other benefit I'd add with buying your own GPUs is availability. They are yours and always yours, in a commercial application with deadlines, etc it's a real risk to depend on being able to get the necessary on-demand GPU compute on various cloud platforms at any point in time. There is nothing worse than logging into a cloud provider console and seeing "no availability" when you really need to get something done. For me personally this is what pushed me to buying vs cloud because I ended up in scenarios where Vast.ai was the only option left and I haven't had the best experiences with Vast.ai in terms of reliability and performance (I'm pretty sure many of the benchmarks are gamed, although I'm not sure how).
Speaking of performance, I've also seen very real issues with virtualized CPUs, what I assume is network attached storage, etc feeding data to high end GPUs fast enough (again, noted elsewhere in this thread). In benchmarking that I've done with various cloud providers unless you go for the much more expensive options on GCP and elsewhere with directly attached NVMe storage a single NVMe drive and decent CPU in a workstation will run circles around many of these cloud providers.
A friend of mine is on the LLaMA team at FAIR. They had 8,000 Nvidia A100s at the time. The only reference to your 2048 number was a report where someone at Facebook "estimated" they used 2048 GPUs for five months. My understanding is they used that number as an average over time/power to attempt to calculate carbon usage and the real number varied quite a bit - my friend had an interesting anecdote about detecting uncorrected and otherwise undetectable (even with ECC) GPU VRAM memory errors at that scale.
In any case I think when you're talking A100s in the thousands my point remains. No one is just showing up cold to a cloud provider from a website link and spending at least tens of millions of dollars.
It's only four words and I'm not understanding what you're saying here as all of your other comments (including the one I replied to) seem to be very much in agreement with my position.
I have an RTX 3060 on my desktop, and my friend has an RTX 4090 on his desktop turned server.
We are both running Stable Diffusion with Dreambooth etc., Llama (well, he is running Llama, I am running 4-bit Llama), Deep Floyd and so on. Both machines are running Ubuntu, don't know what distros are good for servers.
Incidentally, NVDA closed at 429.97 today, an all time high (it was 108.13 last year, after they were banned from selling A100s/H100s to China - it has almost quadrupled in price in a year).
We're a distributed GPU cloud with 10k+ consumer GPUs on our network.
We can crank out 4500+ stable diffusion images per dollar (highest in the market), all on consumer 3060s.
The high-end GPUs definitely have a role but for a bulk of production/inference serving can be done on consumer GPUs at 80-90% of the cost.
Who are the companies renting out GPUs at scale? There is obviously GAFAM (which have their clouds basically), OpenAI and other startups with big contracts for said cloud services, but aside from them, which companies actually rent out dozens of GPUs to train their networks? And what do they bring to the table that makes them think they can archive better than OpenAI and GAFAM?
Serious question. I think most companies without the top talent would do better using an API product.
We just dropped our H100 on-demand prices to 1.99/hr. It's now the lowest in the world, you don't need to talk to a sales person, and it's available now: https://t.co/AcqRIITS1s
When capacity or cost constrained (or both) consider running ML on “lesser” GPUs like A40, A10s and even consumer grade 3090/4090s. With compiler optimizations these uncover surprising price-performance ratios.
Cloud prices are very competitive, you can get 2x RTX 3090 located in Canada for $0.31 or so - here in Germany the cost for the electricity would be higher than that (0.36€ * 0.9 kW = 0.315€).
I wonder how much the amount of data transfer costs (for training or finetuning) might be a big on top of that. I am assuming ingress costs are relatively cheap in most services (as in AWS).
Somewhat related and hopefully helpful: My experience so far using the H100 PCIes to train ViTs:
Pros: A little over 2x performance compared to A100 40GB. 80GB by default. fp8 support, which I haven't played with but supposedly is another 2x performance win for LLMs. For datacenters and local workstations it's twice as power efficient as the A100s. Supposedly they have better multi-GPU and multi-node bandwidth, but my workload doesn't stress that and I can only rent 1xH100s at the moment.
Cons: I had trouble using them with anything but the most recent nVidia docker containers. The somewhat official PyTorch containers didn't work, nor did brewing my own. Luckily the nVidia containers have worked fine so far. I just don't like that they use nightly PyTorch. In addition to that, because of their increased performance, they really start to push the limits on feeding data fast enough to them with existing system configurations. I was CPU-limited on the LambdaLabs 1xH100 machines because of this.
Overall the pricing has worked out equal for my use-case, but I'm sure fp8 would make them more affordable. If fp8 were available out-of-the-box on PyTorch I'd play with it, but it's only available right now from some nVidia codebase specific to LLMs.
Even at equal pricing, having twice the power per GPU and per node is a big win. That increases experiment iteration across the board.
Side note: If I recall correctly, nVidia is heavily differentiating the H100 products by their interface this go around, which I find quite odd. The SXM version of the cards are supposed to be something like twice as beefy as the PCIE version? Not sure why they're doing that; gonna make comparing rentable instances all the more difficult if you overlook that little detail. A100 had a little bit of this, but the difference was never much in practice except between specifically the A100 40 GB PCIe and the 80 GB SXM, which was something like 10% faster.
Anyone else having fun with the new toy our overlords have allowed us to play with?