Hacker Newsnew | past | comments | ask | show | jobs | submit | more davidmr's commentslogin

It turns out that high altitude balloons don’t pop when you put small holes in them: https://apnews.com/article/268893fddde785d029d5a51b136951eb


I’m not sure I understand what you mean. I’ve run HPC clusters for a long time now, and node failures are just a fact of life. If 3 or 4 nodes of your 500 node cluster are down for a few days while you wait for RMA parts to arrive, you haven’t lost much value. Your cluster is still functioning at nearly peak capacity.

You have a handful of nodes that the cluster can’t function without (scheduler, fileservers, etc), but you buy spares and 24x7 contracts for those nodes.

Did I misunderstand your comment?


Not in the context the person you responded to meant it. Yes, you can very easily get 50GB/s from a few NVMe devices on a single box. Getting 50GB/s on a POSIX-ish filesystem exported to 1000 servers is very possible and common, but orders of magnitude more complicated. 500GB/s is tougher still. 5TB/s is real tough, but real fun.


This is tangential to your point, but I’ll just mention that Azure has some properly specced out HPC gear: IB, FPGAs, the works. You used to be able to get time on a Cray XC with an Ares interconnect, but I never have occasion to use it, so I don’t know if you still can. They’ve been aggressively hiring top-notch HPC people for a while.


That's the Sentinel system. I worked on it when I was at Cray, and we did some covid stuff[1][2] with a researcher at UAH. We accelerated a docking code using some cool tech I created (in Perl, so there!) and some mods my teammates did to the queuing system.

The work won some award at SC20[3] (fka Supercomputing conference). I had considered submitting for the Gordon Bell prize, which had been specifically requesting covid work, though I thought the stuff we had done wasn't terribly sexy. We were getting ~250-500x better performance than single CPU runs.

Looking back over these, I gotta chuckle, as this (press releases) is pretty much the only time I'm called "Dr.". :D

Back to the OPs points, they are right. In most cases, cloud doesn't make sense for traditional HPC workloads. There are some special cases where it does, those tend to be large ephemeral analysis pipelines, as in bioinformatics and related fields. But for hardcore distributed (mostly MPI) code, running for a long time on a set of nodes interconnected with low latency networks, dedicated local nodes are the better economic deal.

During my stint at Cray, I was trying (quite hard) to get supercomputers, real classical ones, into cloud providers, or become a supercomputing cloud provider ourselves. The Met Office system is in Azure, is a Cray Shasta, but that was more of a special case. I couldn't get enough support for this.

Such is life. I've moved on. Still doing HPC, but more throughput maximized.

[1] https://www.uah.edu/science/departments/math/news/14954-uah-...

[2] A whole marketing writeup was done here https://www.hpe.com/us/en/newsroom/journey-to-accelerate-dru... . I tried very hard to correct the errors in the writeups. Sadly I wasn't successful.

[3] https://baudry-lab.uah.edu/news#h.121c63ayp0k0


The Azure Met Office win left me very conflicted. As someone who is relatively positive about cloud adoption for science it was good to see some forward thinking. On the other hand, what I've heard about how the procurement was run plus my taxpayer-based views on where critical national infrastructure should be housed makes me rather less happy about the outcome.


So in Azure it's possible to get access to an infiniband cluster somehow? Bare metal?


I don't speak for them (never have), but I believe it to be possible. MSFT do a number of things right (and a few really badly wrong), but you can generally spin up a decent bare metal system there. IO is going to be an issue with any cloud, it will cost for real performance. Between that and networking, clouds could potentially throw in the compute for free ...

Reminds me of a quip I made back in my SGI-Cray (1st time) days. A Cray supercomputer (back then) was a bunch of static ram that was sold, along with a free computer ... Not really true, but it gave a sense of the costs involved.

This said, Azure had (last I checked) real Mellanox networking kit for RDMA access. At Cray we placed a cluster in Azure for an end user (who shall rename nameless), and used several of Mellanox's largest switch frames for 100G Infiniband across > 1k nodes, each with many V100 GPUs. Unit would have been mid single digits on the top500 list that year.

AWS is doing their own thing network wise. Not nearly as good from a performance (latency or bandwidth) as the Mellanox kit. I don't know if Google Cloud is doing anything beyond TCP.

You can do bare metal at most/all of these. You can do some version of NVMe/local disk at all of them. Some/most let you spin up a parallel file system (network charges, so beware), either their own Lustre flavor, or one of BeeGFS, Weka, etc.


The Azure FPGAs are a bit tangential from a customer perspective; they are just the equivalent of the AWS Nitro smart-NIC. Azure IB is interesting in that I originally expected it to be a killer feature, but for customers I work with it just isn't enough to overcome the multitude of downsides of having to use Azure for everything else. In the end, hardly any commercially relevant codes absolutely need IB, and work well enough with the low-latency ethernet both AWS and GCP offer.


That was definitely one of the weirdest things of working in academia IT: “hey. Can you buy me a workstation that’s as close to $6,328.45 as it is possible to get, and can you do it by 4pm?”


Same thing happens in the government sector here (US). If you don't spend all of the budget you requested last year, you might not get it next year. There is an entire ecosystem of bottom-feeder GSA companies that apparently exist to spend year-end money that would otherwise go to 'waste'.


I must not be understanding what your point is because I think the data show that some doctors pay much more than $71k/yr[1]

[1] specifically the table on the last page of https://www.ama-assn.org/sites/ama-assn.org/files/corp/media...


Medscape is widely regarded as the reputable source for physician salaries.

They do not consider hospital-issued / covered malpractice insurance as part of the TC listed on their yearly averages, hence why I don't think it should be made as big of a deal as it is being here.

I should say, these numbers affect private practices almost exclusively; and private practices are dying out/being bought out in droves.



What does this have to do with the topic of discussion?


At the time of the last list, there were two systems in China that had recently broken this barrier, but their owners have chosen not to submit benchmarks to the Top500.


> there were two systems in China that had recently broken this barrier

Allegedly broken this barrier. There is a reason that science is conducted in the open, in a reproducible and traceable manner. Those systems might not function properly at scale, or might not have run at an exaflop in double precision compute.

Frontier is certainly the first publicly verified system to achieve Exascale on the internationally accepted standard measurement.


There is also an article about how they did submit a score from one of these Chinese "exaflop" systems, for a different benchmark and it turns out it can only achieve the claimed performance at half precision:

https://www.tomshardware.com/news/chinese-exascale-supercomp...


Which would still make it faster than previous number one machine, no?


Maybe not. Fugaku also had super exa scale performance at reduced precision


I suppose we can discuss what the source of the souring of scientific coorperation is. But it does appear that China has at least two computers that are faster than its current best performing entry on the top 500. And that basically invalidates the list.


China has a lot of things that invalidate common knowledge but that doesn't mean it is real, reproducable, or novel. There is a reason why scrutiny exists.


It certainly does—my apologies. This is where I saw it first. Unfortunately I can’t change it.


BI runs content syndicated from elsewhere, so appearing on BI doesn’t indicate that it’s their reporting (although someone who can see past the paywall may be able to say for sure). Assuming that the linked source is legally running the piece, it’s a better link for HN because it isn’t paywalled.


Unrelated but I clicked your name and noticed you joined in 2007 the year HN started, that's pretty cool! How did you hear about it back then?

Also drift.space/jamsocket seem pretty interesting (the jamsocket site seemed to lag when scrolling).


I think it must have been from coming across PG’s writing, but I can’t remember exactly.

Thanks for the site feedback, which platform are you on? I’ll see if I can reproduce it.


Awesome. I'm on mbp osx 11.6.5. I tried again and it seems that scrolling feels a bit lagged/jumpy, but otherwise the site looks great.

Curious what a realtime chat app w/stateful backend might look like!


> Alex Morrell is a correspondent at Business Insider covering Wall Street at large

OP's article is trendsquatting.


The quoted segment of your comment answers my question, but what is trendsquatting?


It's a term that felt appropriate to describe what OP's article was achieving


This is certainly not universally the case, even in very well-regarded departments. The University of Chicago, for example, does not require it: http://collegecatalog.uchicago.edu/thecollege/mathematics/.


For those of you interested in the Chicago approach, a bibliography of textbooks used in Chicago UGrad math is maintained here:

https://github.com/ystael/chicago-ug-math-bib


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: