Hacker News new | past | comments | ask | show | jobs | submit login
A new programming language for high-performance computers (news.mit.edu)
173 points by rbanffy on Feb 11, 2022 | hide | past | favorite | 102 comments




Meanwhile, HPC software will continue to be written with C+MPI, or Fortran+MPI.

Source: was involved in development of multiple HPC frameworks, attended multiple HPC conferences, despite wailing and gnashing of teeth all actual software continues to be written the same way it's always been.


It'd be quite a boring world if observations of the past could reliably predict the future.


In this space, it is more they you don't want to lose all of the progress made in these older libraries. Coming up with a new ecosystem is an immediate race to parity with the older. And a lot of smart people were involved with that. (Such that you aren't competing with a single idea or implementation, but an ecosystem of them.)

Hats off for making a good shot. But don't be surprised to see reluctance to move.


There's very promising work on upgrading older libraries/legacy code with a technique called verified lifting. The technique has been used successfully at Adobe to automatically lift image processing code written in C++ to use Halide. The technique also guarantees semantic equivalence so users can trust the lifted code.

Paper: https://dl.acm.org/doi/pdf/10.1145/3355089.3356549


Before reading the paper: is "verified lifting" kind of like "deterministic decompilation" — where you ensure, with every modification to the generated HLL source, that it continues to compile back into the original LL object code?

(See e.g. the Mario 64 decompilation — https://github.com/n64decomp/sm64 — , whose reverse-engineering process at all times kept a source tree that could build the original byte-identical ROMs of the games, despite being increasingly [manually] rewritten in an HLL.)


To clarify, I’m not in any way involved with the paper described in the article. Just interested in the space :)


I'm not an expert. But with current CPUs, doesn't HPC critically depend on memory alignment? (For two reasons: Avoiding cache misses, and enabling SIMD.) Sure, algorithms matter, probably more than memory alignment. But when you've got the algorithm perfect and you still need more speed, you're probably going to need to be able to tune memory alignment.

C lets you control memory alignment, in a way that very few other languages do. Until the competitors in HPC catch up to that, C is going to continue to outrun those other languages.

(Now, for doing the tweaking to get the most out of the algorithm, do I want to do that in C? No. I want to do it in something that makes me care about fewer details while I experiment.)


Nothing C does is even remotely special anymore. If anything it's total crap by modern standards because it makes performant abstractions harder. C code uses a lot of linked lists because linked lists are easy.

C isn't the Lingua Franca of fast software anymore. The reason why C programs can sometimes be faster, today, however is not an intrinsic property of the language but rather its inability to allow incompetent programmers to hide their bad data structures.

C also encourages programming styles that are harder to optimize so some loop optimizations are no longer possible.


> C code uses a lot of linked lists because linked lists are easy.

Strong disagree. I don't see any reason why a competent programmer would hand-code a linked list, which is more of a hassle to make than a simple array (e.g. using buffer pointer + capacity and/or length field), unless the linked list makes actual sense for performance.


I'm not saying you're duty bound to do it but rather giving an example of something which in my experience is encouraged by C since it's the path of least friction.

Try writing a type that can automatically switch layout from AOS to SOA in C, you just can't do it.


You can't hide AOS vs SOA in C and in most other languages, but that has nothing to do at all with linked lists.

(you could use a macro to encapsulate the layout though, or in C++ could code a function that returns a reference)

In what situations are linked lists the path of least friction??


I picked linked lists as a real example of something I've seen people do. It doesn't really make any difference my point.

C is a bad programming language in 2022


Well picking an invalid example (C encourages the use of linked lists in situations where they aren't a good idea) and a very questionable example (automatic conversions between different SAO/AOS representations are not mostly at an experimental stage but are an important feature for performance programming in 2022) is not a good way to support a claim that isn't empirically evident.


It's empirically evident in that no-one is picking C for new projects anymore.

Anything fast is in C++ or C++ like languages.


Out of curiosity, what are some languages that can automatically switch layout between struct-of-arrays and array-of-structs?


The only language I know for sure to do it for you (as in you don't have to write the type) was Jai a while back (I'm told Blow removed that feature).

The only language I've actually done it in, is D. It's probably doable in many other nu-C languages these days, but D at very least can make it basically seamless as long as you do some try-and-break-shit testing to make sure nothing is relying on saving pointers when they shouldn't. This obviously constrains the definition of automatic ;)

I don't have my implementation to hand because it grew out of patch that failed due to aforementioned pointer-saving in code that I'm not paid enough to refactor (), but here's one someone else made https://github.com/nordlow/phobos-next/blob/master/src/nxt/s... there's another one in that repository too. I've never used those particular implementations but they're both by people I know so hopefully they're not too bad.

A more subtle thing, which I haven't used in anger, but would like to try at some point is to use programmer annotations (probably in the form of user defined attributes) to try and group things so things are stored such that temporal locality <=> spacial locality, but I've never bothered to actually do it.

There are some arrays of structs in an old bit of the D compiler that are roughly the size of a cacheline, and aren't accessed particularly uniformly. I profiled this and found that something like 75% of all LLC misses (hitting DRAM) were due to 2 particularly miserable lines... inside an O(n^2) algorithm.


ISPC has some support for this using the "soa<>" qualifier. For example,

  struct Point { float x, y, z; };
  soa<8> Point pts[...];
  for (int i = 0; i < 8; ++i) {
    pts[i].x + pts[i].y
  }
behaves as if originally written like

  struct Point { float x[8], y[8], z[8]; };
  Point pts;
  for (int i = 0; i < 8; ++i) {
    pts.x[i] + pts.y[i];
  }
See https://ispc.github.io/ispc.html#structure-of-array-types

I've not yet had an excuse to use ISPC myself, unfortunately :(


Zig has a compile time annotation that can do it.

https://zig.news/kristoff/struct-of-arrays-soa-in-zig-easy-i...


Jovial 73 has the PAR qualifier which converts an AoS to a SoA.


> C isn't the Lingua Franca of fast software anymore.

What is it, then?

> The reason why C programs can sometimes be faster, today, however is not an intrinsic property of the language but rather its inability to allow incompetent programmers to hide their bad data structures.

I'm not sure I manage to parse that properly. What's wrong about not being able to hide bad data structures? And how is this making C faster?


C++ naturally.


Increasingly less so over time I presume. Sneering at C++ is rapidly overtaking insulting Javascript as the world's conversational standby.


“There are only two kinds of languages: the ones people complain about and the ones nobody uses.”


Sour grapes.


It never was.


You're correct that low level details very much matter in the HPC space. The types of optimizations described in this paper are exactly that! To elucidate, check out Halide's cool visualizations/documentation (which this paper compares itself to) https://halide-lang.org/tutorials/tutorial_lesson_05_schedul...


Memory alignment is important and languages let you control that and/or make them align automatically given sensible assumptions does have an edge here.

Just a random example, I wrote a two lines Python function using Numba jit, because I know in my mind that some Python behavior is going to make it less performant and it is a high performance kernel so we need that fast (and also lower memory footprint.) Compiling using Numba jit is no brainer because I just add one more line there with like 10 seconds effort.

But the code base I am merging to has a policy against Numba (reasonably as we are targeting HPC platform where Numba has some performance problem related to oversubscribing if not setting up carefully.)

So I end up rewrote that in C++ and wrap it with pybind11. And the result is that it is faster by around 30%.

Since the algorithm is entirely trivial, the only explanation I have is exactly memory alignment, where I can control that in C++, but in Numba jit there's no way (both to guarantee the array allocated is aligned, or tell the compiler to assume that.)

(The 30% number is also reasonable in textbook examples.)

P.S. But it has a cost:

> Showing 14 changed files with 230 additions and 36 deletions. > from https://github.com/hpc4cmb/toast/commit/a38d1d6dbcc97001a1ad...

where the 36 deletions are mostly just documentations. So it's an ~200 lines effort comparing to a 4 line effort...


Most compiled languages allow to control alignment even before C was an idea.


That's pretty literally the framework of science.


I'm helping to organize a workshop about alternatives to MPI in HPC. You can see a sample of past programs at [1].

But you're right: today, MPI is dominant. I suspect this will change, if only because HPC and data center computing is converging, and the (much larger) data center side of things is very much not based on MPI (e.g., see TensorFlow). Personally, I find it reassuring that many of the dominant solutions from the data center side have more in common with the upstart HPC solutions, than they do with the HPC status quo. I'd like to think that at some point we'll be able to bring around the rest of the HPC community too.

[1]: https://sourceryinstitute.github.io/PAW/


I still remember when PVM vs MPI articles were common.


Python and Julia are a (may be a distant) 3rd and 4th.

Using Python on HPC system, depending on who you ask, is kind of cheating as the main work is always C/C++ (or Fortran). So it is still really just C/C++ and Fortran.

Julia is really promising as it is the first new language that can efficiently run on HPC and can use other form of message passing.

Also we should not forget UPC, a special kind of C that has its own parallel programming model (not MPI.) It has been used on HPC system as well but perhaps only from a selected few. (This probably reveal where I come from.)


Speaking from how we were already doing at CERN, 20 years ago, Python was plainly more confortable to use than UNIX shell for program orchestration, that was all, and the CMT build system used Python as well.

Real work was always C++ and Fortran.


Oh, CERN? I think they have made PyROOT which is horrible? (And even a jupyter kernel for interactive use.)

In terms of scripting, a professor once showed us a workflow script from his old days written entirely in bash and is very very long and very complex. He basically said if your script gets that complicated it should be written in Python nowadays.

But Python’s HPC presence is deeper than scripting in that sense. Machine learning frameworks definitely is a contributing factor, but even more traditional HPC workflows are benefited when stuffs have Python API, especially with mpi4py.

I once used something called mpi bash that I need to compile it myself on an HPC platform. It is much less well known and used, perhaps because of what the professor said—if it needs MPI, probably even for scripting it should be scripted in Python instead.


No mention of chapel ? just asking, it's been in development for years and I just saw its name on the benchmarkgame so some people still have interest in it.


I too was surprised.


I don't see why such a specialized language may not have a good C or Fortran interface. Also, MPI even allows units written in different languages to talk via messages, AFAIK.

I even suppose that the orchestration layer can be in basically any convenient language, even in Python, while the underlying compute modules may be written in highly-optimized compiled languages; see Thensorflow and PyTorch for hihgly successful examples.


The staying power of Fortran is amazing.


Well it does keep evolving. It's also pretty close to perfect for writing fast numeric code without too much trouble.


I actually use MPI on a Raspberry Pi cluster, using the clusterHAT with 4 Raspberry Pi Zeros controlled by a Pi 3. The whole unit is about the size of a Big Mac.

Obviously it's not very performant, but the lessons learned on how to program with MPI in mind is where the value lies.


Because it works. Nobody wants to risk a big budget to create something new from scratch with the risk that it will not work.


Do you see use of coarrays, introduced in Fortran 2008?


It's a kind of good idea, but has a few fatal flaws. The biggest is that they don't work well when you want a combination of GPU and CPU parallelism in the same codebase. The other is that most current fortran compilers don't compile them well (see https://fortran-lang.discourse.group/t/coarrays-not-ready-fo...).


C++ and MPI in what concerns high energy physics.


Shame that reporting for "normal" earthlings is unlikely to include a few sample lines of code of this new language.

Draft paper: https://arxiv.org/pdf/2008.11256.pdf

Sample code (looks like a DSL embedded in Python): https://github.com/gilbo/atl/blob/master/examples/catenary.p...


One of the innovations in the paper is they find a method for implementing reverse-mode autodiff in a stateless manner, making optimisations easier to verify.

I might be wrong but they take the principle of dual numbers: a + b epsilon, and change the b component into its linear dual. Therefore, b is a function, and not a number. If b is a function instead of a number, then there's more flexibility regarding flow of control, and that flexibility allows you to express reverse-mode autodiff.


a pure functional language is trivially differentiated, no? View a program as a data flow graph and then append the differentiation data flow graph. How this is executed (reverse-mode vs forward-mode vs some exotic mix) is just a leaked abstraction. Ideally, it wouldn't need to be exposed to the user.


My understanding is that all trivial versions of anything are wonderful for the pedagogy, but far short of what is needed in implementation.

Just consider garbage collection. Trivial reference counting is easy to explain and implement. Not going to win too many points in use, though.


Its unfortunate that ATL is such a generic name. The actual code to which you meant to link is here:

https://github.com/ChezJrk/verified-scheduling

OTOH, a Coq-like DSL written in python would be exciting!


I think it's a reference to APL which is an old, but extremely cool and relevant language and stands for "A Programming Language". APL was all about matrices and vectors too.


Yes, clearly an homage to APL.


Ironically, some (all?) of the same developers appear to be creating a Python DSL based on the work done on the Coq DSL.

https://github.com/ChezJrk/SYS_ATL


Ah it seems that the repos mentioned in this threads are all related after all—

The coq repo is covered by the paper but is a verified version of only the scheduler (a part of the parent python repo, i presume)

Then the repo you linked to seems to be a rewrite/generalization of another(?) lower level part of the main ATL.


Meanwhile, loads of compute hours get wasted because of accidentally overwriting output data in structures like /data/final/new_conf9/last_try2/JOBID.txt or due to wrong dependency setup.

IMHO the main issue of HPC today is not performance per se, but rather not embracing software best practices...


Ashamed to admit this happened to me once. The job output files did have unique jobid-based names, but I didn't account for the poorly-documented fact that multiple clusters shared a filesystem (and I was running jobs on a few) so multiple jobs ended up with the same jobid. Thankfully, I had some sentinel start/stop output that allowed me to automatically detect when multiple processes had overlapping output, and I was able to re-run fractional jobs and only wasted about 5% of my compute time.

On one hand, I hated that there wasn't some fancy framework (like, a database?) to hide the footguns behind... on the other hand, the learning curve was very low and I'm happier about that on the balance.


This happened to any of us in the early days... :) As a side note, a good framework does not necessarily mean to increase the learning curve, but rather to protect the user from itself.


I think you are referring to workflow managers? Yes, roll your own workflow manager is a bad practice but many do anyway.


Not only workflow managers... also:

- Usually no testing on a small test case / data set before going to the big one;

- Very old fashioned workload management systems with CLI-level job setup and automation prone to trivial errors as not handling escape sequences;

- Poor environment setup, little to no use of containerisation and when in use, HPC-specific solutions (as Singularity or Slurm containerisation support) tend to be more palliatives than addressing the root issue;

- A generic "not invented here" syndrome plus a strong belief that standard software best practices cannot be adopted in the HPC space (for no reason).

This, more or less.


> Usually no testing on a small test case / data set before going to the big one;

I don't understand how "no testing on small test case" will resulted in "loads of compute hours get wasted because of accidentally overwriting output data"?

> Very old fashioned workload management systems with CLI-level job setup and automation

I think this means they are rolling their own workflow manager? By workflow manager, I mean like those listed in https://docs.nersc.gov/jobs/workflow-tools/ and https://docs.nersc.gov/jobs/workflow/other_tools/

> ...standard software best practices cannot be adopted in the HPC space (for no reason)

I agree for no reason is bad, but then sometimes there's reason. Just a random example, for example using snakemake with SLURM has the following caveat at NERSC:

> Astute readers of the Snakemake docs will find that Snakemake has a sluster execution capability. However, this means that Snakemake will treat each rule as a separate job and submit many requests to Slurm. We don't recommend this for Snakemake users at NERSC. > from https://docs.nersc.gov/jobs/workflow/snakemake/

It could depends on how large the HPC is. The case mentioned above is NERSC, which I think always has a top 10 HPC at release time in top500.org. When dealing with such a state of the art HPC, some "standard software best practices" doesn't scale well enough for that many nodes.

In the particular example above, it is related to the SLURM scheduler, and when there are many more nodes for it to schedule, some usage pattern in smaller HPC platform can break it.

> Poor environment setup, little to no use of containerisation and when in use, HPC-specific solutions (as Singularity or Slurm containerisation support) tend to be more palliatives than addressing the root issue;

But how this would "loads of compute hours get wasted because of accidentally overwriting output data"?

While container could solve some problems in setting up environment, it is not the final answer. E.g. many HPC (if not all) does not use docker itself but roll their own Singularity or Shifter solution is because they cannot give you root access. Also using container does not automatically provide you optimized binaries, meaning if one want to run HPC applications at best performance (say `-march native -mtune native` or even O flag) will need to build their own container. (Compiling specifically to the native arch is very important to HPC. e.g. For NERSC Cori II using Intel Xeon Phi KNL, AVX512 is essential for performance, and in this specific case, we really need the Intel compilers for peak performance as KNL is odd and GNU compiler can't optimize them as good, given they don't have much intensive to optimize for such niche product, unlike Intel has to deliver that to sell the product.)

And if we now agree we need to compile specific to the HPC, then container is not that attractive comparing to just compile using the HPC environment (with Cray wrapper compiler for example.) It does has other advantage related to IO as shown in https://docs.nersc.gov/development/shifter/images/shifter-pe... from https://docs.nersc.gov/development/shifter/

Going back to another statement in the parent thread,

> IMHO the main issue of HPC today is not performance per se, but rather not embracing software best practices...

The main issue of HPC is really performance. Again it might depends on how big a HPC is (in terms of top500.org ranking), but for really high performance computing, which always is at the cutting edge, they are really testing new technologies that brings us to the next level. (say exa-scale at the moment) Think about the tech used in upcoming exa-scale or near-exa-scale machines, some uses AMD CPU + NVIDIA GPU, some AMD CPU + AMD GPU, some Intel CPU + Intel CPU, and some pure ARM CPU... (and the previous generation GPU-like CPU such as Intel Xeon Phi) They are all quite exotic (compare to previous ones) and very difficult to squeeze peak performance. One simple metric is compare the theoretical peak performance to that realized in https://www.top500.org/lists/top500/2021/11/ , it is very difficult for these large system to have high realization especially as the number of nodes are increasing at that speed (communication bottleneck is more apparent.)

(Just to mention that is a simple metric that no longer tells the whole story. That measures only in terms of FLOPS, whereas nowadays IO can be a bottleneck in "big-data" applications. And there are less well known metric targeting that.)

E.g. in the current generation NERSC system, Cori, some systems with lower theoretical peak ranks higher than that essentially because of the headroom from Intel KNL (everyone who uses that system hates Intel KNL... Even Intel themselves hate Xeon Phi so much that they give it up. And their replacement solution is, well, just copying NVIDIA and give you a GPU.)

The next issue, or the same issue in disguise, is energy. It is well known that we can build exa-scale machines a few years back. What prevented it is the amount of energy (or power as in MW) it needs. The main thing we need to solve to go to exa-scale is basically FLOP per watt, whereas in the old days may be more FLOP per $. (There's a rule of thumb that a Million Watt machine needs a Million dollar per year energy cost.)

Frankly, I think many users of HPC continues to have bad software practices and continue to be successful. Bad software practice are going to bite them when like you say (generated) data are overwritten, which basically is just wasting computer hours. (It is rare to lose original data, but I think there's a recent news say a lot of actual data are loss in an HPC center / university.) Or that bad software practice resulted in a lot of manual labor in baby-sitting jobs. But in today's Science world, that's entirely irrelevant. While I was speaking with Scientific computing in mind, it probably is also true say in financial computing, etc. Well, people just cares about the outcome and who cares how messy you get there.


This sounds like Halide [1] which I think is also from the same group in MIT. However, Halide is an embedded DSL in C++. It sounds like ATL being based on Coq would be more likely to always produce correct code.

[1] https://halide-lang.org/


Indeed. Notice the last author.

Halide has had transparent autodiff for some time now. I am guessing, this is follow-on work in the usual academic style. (Not trivializing at all.)

I've been watching Halide for a while as I really like it, intuitively. Also it wasn't that hard to bring up on FreeBSD with a few patches, where it performs more or less exactly as on Linux when configured cpu only. That's solid generic programming, to say the least.

However... I have no recent decades experience with HPC but I would be surprised if dense tensor operations were a significant component of current workloads. Could be though!


Is "ATL" "A Tensor Language" a homage to "APL" "A Programming Language" from Iverson?


Of course ATL is a homage to Ken Iverson's APL. His language treated multidimensional matrices up to 64 dimensions as single variables to be manipulated in single steps, thus creating a language that performed many steps in a short space and time. He was a farm-bred Canadian, first in his town to ever go to college, whose thinking was brilliant!


Yep!


I would say that ATL is a domain-specific language (DSL). To be more accurate, I would refer to it as a horizontal DSL (as opposed to typical DSLs, which are usually focused on some vertical subject domain or industry). Another thought is that, if I were writing HPC-focused software today (and, especially, tomorrow), I would definitely much prefer Julia to ATL or similar niche solutions. Thanks to (likely) not much of, if any, loss of performance, but an incomparably wider spectrum of potential applications and covered domains. Plus, an additional benefit of lack of the two-language problem.


If you are interested in building programs with proofs of correctness, one of the authors of A Tensor Language (ATL), Adam Chlipala, has a book called "Certified Programming with Dependent Types" that can be read at http://adam.chlipala.net/cpdt/


Hmm, I would've expected them to double down on differentiation within Julia (via Zygote etc)


It reads to me like ATL is actually more of a tensor compiler than a general-purpose language. In fact, I'd actually be curious if we could lower (subsets of) Julia to ATL, let it optimize the tensor expressions and transpile everything back into Julia/LLVM IR for further optimizations.

In Julia, there is already Tullio.jl [1], which is basically a front end for tensor expressions, which can target both CPUs and GPUs with AD support and automatically parallelizes the computation. It doesn't really optimize the generated code much right now though, so something like this could be interesting.

[1] https://github.com/mcabbott/Tullio.jl


MIT is a big place, there's lots of people there who don't use julia, and I don't think there's any structures in place to stop MIT people from making new languages (nor should there be).


> I don't think there's any structures in place to stop MIT people from making new languages (nor should there be).

And just reminding people this is one of the very few cases when it is ok to create a new language.


Yeah, based on the title and where it's from I half expected it to be some kind of DSL based on Julia. Instead it's based on Coq which gives them formal verification capability.


Basing it on Julia would make sense for certain reasons, basing it on Coq would make sense for different reasons but basing it on Python, as appears to be the case from the only code sample I've seen, doesn't make a lot of sense to me.


According to the paper it is based on Coq. Perhaps there's a python interface to that which kind of makes sense because a lot more people have experience with Python than with Coq and giving them some kind of bridge so they could use Python as a frontend would help with adoption.


I think formal verification makes sense for programs like optimising compilers. You have a reasonable hypothesis to prove: That the optimisations don't change the behaviour of the program in any way, except for speed.


It's not "in any way" but "in any way we care about" - people love optimizations like -ffast-math that are not well founded.


There is also this:

https://anydsl.github.io/

They have some framework that achieves high level compute!


In mathematics, there are two ways to deal with tensors:

1. You come up with notation for tensor spaces, and tensor product operations, and a way to specify the indexes of pairs of dimensions you would like to contract, either using algebra or many-argument linear maps.

2. You write down a symbol for the tensors and one suffix for each of the indices. For an arbitrary rank tensor you write something like $T_{ij\cdots k}$. You write tensor operations using the summation convention: every index variable must appear exactly once in a term (indicating that it is free) or twice (indicating that it should be contracted – that is you should take the sum of the values for each possible value of the index). Special tensors are used to express special things.

So for the product of two matrices C = AB, you could write:

  c_{ik} = a_{ij} b_{jk}
Or a matrix times a vector:

  v_i = a_{ij} u_j
Or the trace of A:

  t = a_{ij} \delta_{ij}
(delta is the name of the identity matrix, aka the Kroenecker delta). For the determinant of a matrix you could write:

  d = a_{1i} a_{2j} \cdots a_{(n)k} \epsilon_{ij\cdots k}
(Epsilon is the antisymmetric tensor, a rank n, n-dimensional tensor where the element at i, j, …, k is 1 if those indicted are an even permutation of 1, 2, …, n, -1 if they are an odd permutation, and 0 otherwise).

The great advantage is that you don’t spend lots of time faffing around with bounds for the indices or their positions in the tensor product. But to some extent this is only possible because the dimension (ie the bounds) tends to have a physical or geometric meaning. This is also why when the notation is used in physics you also have subscripts and superscripts and you must contract one subscript index with a corresponding superscript index.

I wonder if it is possible to achieve a similarly concise convenient notation for expressing these kinds of vector computations.

You can imagine abusing the notation a bit but leaving most loops and ranges implicit, e.g.

  let buf[i] = f(x[i]);
  let buf' = buf or 0;
  let out[i] = buf'[i-1] + buf[i];
  out
(With this having a declarative meaning and not specifying how iteration should be done).

A convolution with 5x5 kernel could be something like:

  /* in library */
  indicator(p) = if p then 1 else 0
  delta[[is]][[js]] = indicator(is == js)
  windows(radii)[[is]][[js]][[ks : (-radii..radii)]] = delta[[is]][[js + ks]]
  
  /* in user code */
  out[i,j] = in[p,q] * windows([2, 2])[p,q,i,j,k,l] * weight[k,l]
  /* implicit sun over all k, l, p, q, but only loops over k, l as p, q values implied */
Where I’ve invented the syntax [[is]] to refer to multiple indices at once.

I don’t know if it would be possible to sufficiently generalise it and avoid having to specify the size of everything everywhere, or whether a computer could recognise enough patterns to convert it to efficient computation.


Basically feasible, my previous company wrote a compiler that did that. There are other efforts I'm less familiar with but here's a link for ours:

https://arxiv.org/abs/1903.06498


its nice to see this.

the idea that speed, safety and usability are mutually exclusive has always been wrong. trivially so imo... but clearly not given how long the myth has persisted.


QBasic had tensor types support circa 1980s.


If this is a joke, I don’t get it.


Read the description for REDMAT.BAS. Apparently this started with Microsoft Professional Basic 7.0, which appeared in 1989. So technically, "the 1980s," but only the last one.

https://www.cambridge.org/core/books/abs/crystal-field-handb...


It had arrays of any n-arity. Come think of it, C++ also does


Since when we started calling multidimensional arrays - tensors? I see it's shorter and sexier, but tensors, as I remember them from general relativity course, are something different.


Is this in direct competition with JAX from google ?


This is not comparable with ML frameworks (Pytorch or Jax) - its closest analogue is Halide (which they mention in the paper).


I would say it is somewhat comparable with XLA which is the compiler backend for Jax (and was heavily inspired by Halide).


I'd be curious to read up on the parts of XLA design that were inspired by Halide if you have any links. I haven't familiarized myself with XLA/MLIR source code much, but I was under the impression it took a more conventional approach to optimizing compilers (pattern matching rather than generic separation of schedule and compute).


>speed and correctness do not have to compete ... they can go together, hand-in-hand

Is Rust considered slow these days outside of compilation speed?


for the type of computation described in this article, Rust is considered slow. I recently stumbled across a site to track it's progress in changing that: https://www.arewelearningyet.com


Have we duplicated X yet?

I wonder how this kind of thing will fare a 100 years in the future, where X is drawn from a set of functionalities that is much and much larger. We would essentially get stuck on a single language.


Not fast in this sense, but definitely not slow.


Not fast in which sense? HPC? Or single/multithredded processing on one machine?


In the sense of - where Rust can do some amazing program transformations because it’s quite demanding about memory safety, ATL purports that if the math is all tensors, then there’s potential in program transformations based on automated reasoning about tensor formulae. Which no one needs for FizzBuzz but sounds like something worth exploring for HPC.


Back in the days the compile times where legendary for being slow and that fact has stuck around but this is no longer true today.


Rust code is not slow to compile today? I'm currently learning myself some Rust, have a project with some dependencies + ~1500 lines of code in my directory and my debug build with cargo takes 20 seconds every time I change just a line. Specs: Intel i7-7600U (4) @ 2.800GHz, Samsung NVMe SSD SM961/PM961, 16GiB.

Considering the small size of the project, I wouldn't consider that fast and in fact, makes me very turned off on the prospect of the project growing any bigger.


That sounds quite bad to me. It can happen if you have a virus scanner going crazy on your build target during build (which is a common complaint on Windows, you might have to add your Rust projects to an exceptions list). It can also happen if you have really heavy proc macros or other compile-time magic going that needs to reprocess and reexecute often (serde famously can crank up compile times).

I doubt your compile times would really ramp up that much from your project itself getting much bigger. If you are at 1.5k lines of code, the ridiculous compile time is probably due to other causes.


These horrible compile times happen both on Windows and Linux. I am using Serde, but only on two out of 15 structs in total. I haven't really tried to investigate yet if these compile times is just because of my project, but I am using some "standard" crates (like tokio and serde) that the community seems to overall use for most user-facing software, so hopefully they won't "infect" the compile times too much.

Thanks for the pointers though!


No, it's quite fast.


[flagged]


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

Edit: you've unfortunately been posting a lot of unsubstantive and/or flamebait comments. We ban accounts that do that. I don't want to ban you again, so you could you please review the site guidelines and fix this? We'd be grateful.


ok, sorry, my bad




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: