In case anyone is wondering about why this project puts so much importance on things being inlineable, GPUs often don't support function calls or have an extremely shallow return stack.
In this way it reminds me of http://futhark-lang.org/ somewhat, though Futhark is explicitly targeted at producing extremely efficient GPU code.
As also described by the fourth to sixth paragraphs in the readme:
Abstraction by heap allocation is a dead end. It works moderately well on the current generation of computers where CPU is still the dominant driver of computation.
It cannot work for devices like GPUs and the rest coming down the line. Many of the most important computational devices of the future won't support heap allocation so an alternative is needed to draw out their full power. It is of absolute importance that a language for that task have excellent control over inlining. Inlining therefore must comes as guarantee in the language and be a part of the type system.
Inlining is a trade-off that expresses the exchange of memory for computation. It should be the default instead of heap allocating.
I guess whoever is wondering about the why hasn't read the introduction.
There are multiple perspectives to this. Without heap allocation during kernel execution GPUs might not even become the dominant driver of computation in the first place.
If you haven't clicked through to read the above commit message, please do. It is amazing. This is a glimpse into the mind of a mad scientist (and I respect and love mad scientists more than anything and aspire to be one)
> It is a tragedy, but I cannot imagine a language better than Spiral. Not within real world constraints facing me. A language better than it would pretty much have to be magic.
> So, in this respect, my programming skill has reached it's limit and the most I will be able to do is refine it.
> And it is embarrassing to me because I want to do better. I am ashamed, but the one to come will have to follow up on what I could not manage with my meager human abilities.
This reminds me of the fictional inspirational process of dwarves in the Drizzt Do'Urden series of books [0] by R. A. Salvatore [1]. In that setting a dwarf craftsman gets only one moment of exceptional divine inspiration in his life to forge a legendary artifact. It is both a blessing and curse for them, because they realize that they will never be able to surpass that work for the remainder of their long lives.
I wonder if other people here have had that moment in their life as well?
No, he uses commit messages as log of his thoughts, and commits need to have changes, thus each "log only" commit deletes or adds those same whitespaces in a never ending cycle.
In between you find some actual code changes (with also long commit messages).
It's not that he's finished - I hardly guess so because he complains that he'd need 10 years of time to polish everything out perfectly (in his opinion) and he only has one - it's just his style for a log book...
I can't actually program all the time. Sometimes I need to study machine learning or think about design. During those times I use commits much like journal entries.
I actually keep a separate journal and sometimes paste from it. I've been using LibreOffice Writer for it and decided to drop it recently because it would take so long to save a file - like 5s or more. Around every 3 month I'd fill in about 1000 pages of it and it was forcing me to move to a new file every time that happened. Now I just use VS Code for this sort of thing. Raw text is the best after all, but it is too bad I can't paste images into the journal anymore.
If you do wish to use commits to simply journal your thoughts, and would like to do so without this requiring code or file changes, you may find the --allow-empty option useful when committing.
This allows you to create a new commit, even if no files have changed.
Or, write your journal entries to a file and commit that? That's what I do. Most of my personal projects have a "notes.txt" at the top level for this purpose. Sometimes I just need to write freeform about the problem I'm trying to solve, from different angles, to work out my strategy. Sometimes I just write the despair that I'll never figure something out.
You could set up an external tool that did something like the following:
Title: Commit with Selected Text as Message
Command: path\to\git.exe
Initial Directory: $(SolutionDir)
Arguments: commit --allow-empty -m "$(CurText)"
This will allow you to create an empty commit, with the selected text becoming the commit message. (Just bear in mind if you have any files staged, they will be committed as well)
I unfortunately do not have Visual Studio currently installed, so please take the above as a rough sketch, and my apologies if it doesn't work as is. But, I hope it is of use to you.
Ah, yes, I remember wanting to be able commit the identity transformation (_i.e._ no change) at some point. I don't remember why. It might have been a matter of principle.
PS: I just learned about the `--allow-empty` flag. [0] I'm happy again.
I’m quite happy to see random projections getting some love, but I hope more people start using Choromanski et al.’s 2016 Structured Orthogonal Random Features, which has provably higher accuracy while reducing runtime to linearithmic and memory to linear (or constant) from quadratic for each. I’ve verified this experimentally in my implementation here [0]. As a shameless plug, it’s quite fast, is written in C++, and comes with Python bindings for both kernel projections and orthogonal JL transforms.
The reason why I am using random projections in the latest test is because I am testing an algorithm that iteratively calculates the inverse Cholesky factor of the covariance matrix and am testing it on Mnist images. The cov matrices made from raw Mnist images are non-invertible, but projecting them to a much smaller dimension allows me to actually test the algorithm on non-synthetic data.
I do not actually need more than I have, but I'll keep your link in mind if I ever need random projections though.
That makes sense. Thank you for explaining! I hadn’t fully pieced it together from the commit.
For dimensionality reduction, if one were to use my library, the way to go would be the orthogonal JL transform. The key to all these methods is the idea that multiplication by a diagonal matrix and subsequent F[HF]T is equivalent to multiplication by a matrix, which is what allows it to do what random matrix multiplication provides without instantiating or using the full matrix.
As an aside, I admire your work and think it’s both very exciting and highly valuable.
I think it's worth pointing out that the Spiral compiler seems to be written in F#, which is itself a terrific functional programming language. (And it looks like the Spiral compiler actually emits F#, which is then compiled?)
The author states that Spiral "was made for the sake of making a deep learning library which was too difficult to do in F# itself." I wonder what he thinks about DeepML (http://www.deepml.net/), which is a deep learning library written in F#.
Before I started work on Spiral I spent way too long, like a year working on a ML library before throwing in the towel. I think I skimmed through the source of DeepML over a year ago and my impression was that the author was struggling against the limitations of F#'s type system. There were a bunch of places where he was doing the equivalent of telling the compiler to fuck itself by downcasting to `System.Object`.
The way these projects go is that you start of with an AD library and then when the type system and the language starts getting in the way you either do a lot, and I mean a LOT of engineering with inferior tools or you make something better.
The 'make something better' is always picked and can take various forms. Generally, people would not make a decision to work on a language for a ML library for almost a year before they write down the first line for it. They instead take the middle road of starting with an AD library and gradually extending it.
First they realize that they cannot really express tensors properly with the confines of the language so they make everything symbolic, then they build engines to run such symbolic AST, then they realize that they need JITs and other optimizers to make everything run fast. Tensorflow and PyTorch are currently at this stage. There are more stages after this.
Of course the task of working with ASTs is what a compiler does. I think it is a great pity that Tensorflow and PyTorch are written in C++ under the hood. C++ is probably the worst language I can imagine for working on compilers, so I applaud the authors of DeepML for picking F# instead. Had the TF team picked statically typed functional language for this they could have cut the size of TF by over 10x and saved themselves over a million lines of code.
Thanks for the response. I always thought that the point of "making everything symbolic" is that you can differentiate a symbolic expression and thus implement backpropagation automatically. No? How do you handle this in Spiral?
Yes, it is, but there exist an alternative approach called automatic differentiation which is more imperative and flexible in nature than symbolic differentiation.
The idea is to keep a tape (you can think of it as a list) and then record all the operations on it as you step forward through the program. Then at the end you execute the operations backwards from the tape.
As someone not particularly familiar with F#, could someone go into what the author means by "first-class types"? Do they mean that data types themselves can be treated as values (first-class citizens)?
Indeed. But F# cannot actually do this, and neither can even dependently typed languages to the degree Spiral can. In Spiral types act much like values meaning you can bind them to variables and you can reflect on them without cost like with values, though if you tried something like `(type 1) + 2` you'd just get a error during the last phase of compilation. It is rather useful and without it, type inference would be impossible in Spiral.
Probably the most mind bending idea I've had the pleasure of being exposed to via wayward link clicking in quite some time. I didn't burn through the entire thing but I went pretty far and the syntax primitives and the inlining by default type guarantees seem like a natural extension of language design in the history of programming languages.
My apologies for getting a bit philosophical here at the end, but this kind of treatment is among my favorite things to encounter having had the privilege of being born at the cusp of the information age. During the first 50 years of the internet we get to casually encounter foundational new ways of expressing software with increasingly more guarantees and expressability from the languages themselves and reap the benefits of the plumbing and infrastructure that inevitably arises from such. Rust's compile time memory tracking borrow system comes to mind as another such set of compiler based guarantees that we have now that we just didn't have when I was a kid.
tldr: Exciting times to be alive and clicking links on the internet.
That caught me by surprise. Especially when it has its class entry-center to indicate “this is the actively focused message”, to not load to that point in the page or at least have an ID to jump to there seems surprising.
(A pure CSS solution for centering it on the page with an anchor: make the .entry-center `position: relative`, and give it a new child <div id=jump> with CSS `position: absolute; top: calc(50% - 50vh)`. The downside is that if the entire .entry-center is taller than the viewport, it will still be centered on the page, rather than sitting at the top. I suspect that defect could be solved if you used two nodes instead of just one, combining margin and padding, since padding can’t be negative.)
It is fascinating how much a single person can achieve. Since the author seems to be aware of this thread, Marko, do you mind telling a little bit about your background? This project does not seem like something that was done on the side, and also not something that a non-expert in programming languages just decides to have a go at.
I wish there was a language that had the syntax of ReasonML, built on top of F# because the ecosystem of F# is much better
It's very easy to get lost in syntax when transferring from C like languages like JavaScript to F#, some concessions (like Reason has) to semi colons and braces might help those programmers (myself included) to feel at home
This. I really like functional programming, and FP proponents keep telling me syntax isn't a big deal, but I'm constantly stubbing my toe on it. I guess my deferred-gratification mechanism isn't that strong, because I get bored quickly of battling syntax when I could do something interesting in an imperative language now.
I think it's more interesting that they're in the type system. C has had inline pragmas for decades now, and one might also consider memory allocation semantics to be optimization semantics. Or more generally, one might consider any low-level semantics to be optimization semantics. The neat thing about Spiral isn't that it has 'optimization semantics', but rather that they integrate neatly into the type system.
“Inline”, though, is a request, like a hint. It doesn’t _enforce_ it. And this is my problem with most optimisations: you can’t guarantee they’ve happened. Putting them in the type system is definitely the right way of doing this.
Frankly, to me the specifics of the syntax are not nearly as important as what the language does.
This:
> Abstraction by heap allocation is a dead end. It works moderately well on the current generation of computers where CPU is still the dominant driver of computation.
> It cannot work for devices like GPUs and the rest coming down the line.
tells me way more about the language (and gets me much more interested) than a "hello world" would.
I think syntax is by far the easiest part of a new language to learn, so it's ultimately not that important.
EDIT: Regardless, it turns out it actually gives code samples like a page or two down.
I just want to mention that code samples are about more than syntax. E.g. a code sample of Elixir might spawn concurrent processes, or a code sample of Idris might show type-safe vector concatenation.
I am not a developer beyond hobby scripts and such, and I find these threads and comments like this extremely interesting. Other people's domains are frequently fascinating.
I don't trust that it actually achieves those goals without seeing code. Show an illustrative example of what you are trying to do and how it is accomplished, then wax philosophic.
It does even better, it has a table of contents right at the top, so you can skip directly to whatever you want. "Intro" and "Design Philosophy" sound like manifestos, so pass over those and click on "Tutorials."
I actually prefer to read the philosophy of a language, before diving deeper into the language itself. It gives you an idea of the author's motivations for creating the language. I feel like a code sample would make it too easy to judge its syntax, and not tell you enough/much about its semantics.
The 'sample' thing is a result of the era, where people wanted to streamline language diffusion. I think it became a bit too much of a reflex for people to expect the same first piece of information about a language.
I too need more than a sample and prefer a larger explanation.
I don't think it aligns with the philosophy of the language either, to essential chunk ideas into large discrete blocks of particular categorical intents. It seems like the intent is to distribute programming concepts on a finer granularity than most languages tend to retain flexibility for, as structure builds.
Yes I agree that the language semantics and the philosophical motivation are more important. But there is a reason academic papers have abstracts. The core ideas should be summarized crisply at the beginning of a technical document. A few sentences and a code sample are the equivalent of an abstract for programming languages.
I do not think there is a language with capabilities similar to Spiral. If it existed it would have saved me a lot of time. The closest you could get would would be [Scala LMS](https://scala-lms.github.io/) and the work done on staging such as in [context of Ocaml](http://ocamllabs.io/iocamljs/staging2018.html).
Personally I think macro-style staging in statically typed languages is an atrocity and want nothing to do with it. I kind of like the type driven version as is done in LMS, but compared to Spiral the LMS is less expressive.
A lot of work is being done on macros/partial evaluation/staging in Scala and I assume it will continue to be done as the authors are aware of the need for this in a language - without it a functional language would be dog slow, but my impression is that they are having significant difficulties making it fit with Scala's much more complex type system. One consequence of that is that they seemingly rewrite their macro system every few years. This time they will surely get it right despite being at it for decade(s).
The reason why what Spiral has works so well is because its type system is so simple, so staging can be completely intervowen with how it does type inference.
My view based on seeing the last 50 years of language design is that staging is really difficult to bolt onto an already established design. It tends to degenerate into having two languages - the core language and a scripting language on top of it. C++'s template system is the worst example of this. Another example which actually embraces this dichotomy would be [Terra](http://terralang.org/) which could be described as bolting Lua on top of C. I rather dislike this sort of design.
A language needs to be designed around staging from the ground up and it needs to be accepted that in the absence of magic, the programmer will have to mold his style to fit with the new paradigm.
If a language is to have staging it should be everywhere and at all times much like regular type inference is in F# and other ML styled languages.
I have a suspicion that you might like digging into a language called Felix. Sadly it has sort of disappeared from the internet because felix-lang.org is no longer up. The documentation now exists in a poorer form on readthedocs. The development is alive and kicking on GitHub though.
You can think of it as a codegen for C++ in the lines of what Scala is to Java. The author was involved in Barry Jay's language called Fish, which would also be of interest to you.
Which you have to scroll past pages of manifesto to find, and all the examples seem to be basic illustrations of syntax. Where are the examples showing how this approach to a language actually results in superior code bases or more efficient programs or whatever?
This is a one person project. This clearly took a lot of time and effort. You can afford it to look at the TOC and get scrolled directly to various examples.
When I looked at Julia last time I do not think it had the capability to partially apply tensors. In Spiral indexing into a 3d tensor would give you back a 2d tensor. It might be possible to do this with function in Julia, but that would depend on its inlining capabilities and unlike Spiral it does not provide guarantees with regards to that.
Its type system is definitely less powerful than Spiral's - some of the things you could express naturally in Spiral would require macros in Julia.
All in all, the language strikes me as a better Matlab which I think it succeeds at, but Spiral is made to be a better F#. Some of the tradeoffs I had to make in language design means that I could not succeed at this across the board.
The fact that it has a more powerful type system makes it take an immediate hit in terms of ergonomics, and the language is less pleasant to code in than F# because the IDE is not providing immediate and constant feedback.
On the other hand for systems programming, I can't imagine what C (and even C++) could possibly do better than it except maybe compile times. Despite that, Spiral is also nothing ilke Rust - it does not try to check that pointers are safe. It is rather a extremely expressive static language that has no type annotations anywhere and looks like a dynamically typed functional language.
> When I looked at Julia last time I do not think it had the capability to partially apply tensors. In Spiral indexing into a 3d tensor would give you back a 2d tensor. It might be possible to do this with function in Julia, but that would depend on its inlining capabilities and unlike Spiral it does not provide guarantees with regards to that.
That's exactly how slicing a tensor works in Julia:
Inlining seems irrelevant since the slicing is done eagerly. Unless you're talking about lazy slicing, in which case you could do `f(T) = T[:,2,:]` and your inlining/function comment makes more sense. If you use StaticArrays.jl (the closest thing in Julia to Spiral's tensors) then the machine code for this operation on a 2×3×4 Float64 tensor is just:
> Its type system is definitely less powerful than Spiral's - some of the things you could express naturally in Spiral would require macros in Julia.
The macro comment bit is a bit perplexing since macros are about syntax and are about as unrelated to the type system as it's possible to get. I'm unaware of any way that the presence of absence of macros in a language can affect the expressiveness of its type system.
I'd be very curious to hear what can be expressed in Sprial's type system that cannot be in Julia's. Julia's type system is generally more expressive than state-of-the-art dependent static type systems since it punts on type checking entirely. This allows it to have, among other things, parametric types where the parameters can be other types, integers, floats, symbols and tuples of same (recursively).Parametric types even support fancy features like using an earlier type parameters to constrain later type parameters, e.g. `X{T<:Number, S<:Vector{<:T}}` — i.e. `T` is a possibly abstract number type and `S` is a vector whose element type is a subtype of `T`.
The only major type system feature I'm aware of Julia lacking (aside from the obvious omission of static type checking) is intersection types. There's been some thought of adding those in order to support multiple inheritance among other things, but it hasn't proven necessary yet.
I recall an example where Julia had to use macros in order to achieve loop unrolling. I'll dig it up if you want. This is something can be done quite naturally in Spiral due to its type system.
I find that people often say macros are about syntax, but what they really are is some combination of a parser and a partial evaluator - in other words, their main use seems to doing compile-time function evaluation.
Spiral does not have arbitrary CTFE, but can perform any functionally-pure, non side-effecting computation in its type system plus memoization of function calls.
One thing of practical interest that makes me doubt Julia's inlining capabilities would be monads.
"Monads.jl provides a powerful, if relatively slow"
The key word here is slow. I haven't seen a single language as of yet apart from Spiral that can inline all their overheads away.
They are pervasive in Haskell, and Haskell cannot do it. Scala wants to do it, but it can't. F# and Ocaml cannot optimize their overheads away. If Julia could then it would advertise it. This is something that is subject of academic research in the field of partial evaluation.
Speaking as a functional programmer, as a language Julia feels barely functional to me. It has first class functions, but it does not have pervasive partial application and syntax that makes such a style comfortable like functional languages tend to have. It feels more like an imperative language.
And speaking as a systems programmer, I know that without inlining guarantees you cannot have optimized functional constructs. It is not just monads. By tracking all the factions and their bodies to exactness and forgoing all heap allocation unless explicitly stated, it becomes much easier to do language interop. It is possible to make functions that capture variables from lexical scope and then pass as arguments to a function that passes them through to the GPU. Julia talks about speed, but barely talks about inlining so I am dubious about its claims.
I could go on, but I think this is a large enough sample of my views on Julia. If Julia wants me to use it, then it needs to speak more to my values as a programmer.
I agree that while Julia has a lot of functional programming elements, it's not strictly a functional programming language. It also is built for things where heap-allocated arrays are necessary: you're not going to be scientific computing with 100GB stack-allocated arrays. Having the base arrays be stack-allocated and then having StaticArrays for stack-allocated arrays works well, though that means you do want to use imperative styles a lot when managing the large array operations to cut down on allocations. Spiral and Julia are just very different and cater to different paradigms and applications.
>There is no need to consider how tensors are made in a language made for numeric computation like Julia.
and then proceed to compare to PyTorch instead of julia is incredibly misleading and unfair to julia. I fail to see how anything you've brought up here excuses that.
I might be persuaded to see things from your point of view if you can show that Julia's tensors meet my criteria. I might be wrong about this - I do not know much about Julia and it might very well be capable of this. But these are the kinds of things that if it were capable of, it would advertise. If it can do it I'll apologize and change the line to show Julia in a more positive light.
The reason why I started talking about inlining of monads here is because ultimately they are just functions, and Spiral's tensors are not part of the language, but just functions as well. So if it cannot do monads then it probably cannot do tensors to satisfaction either.
It was pointed out to me that tensor slicing could serve as substitute for partial application. What I am wondering next is the degree of flexibility Julia's tensors have in the context of GPU kernels. Let me illustrate this with a short example. The following is how the inplace map Cuda kernel is made in Spiral. It seems simple, but I'll go at it from the top to show the characteristics and flexibility that Spiral's tensors have.
met map' w f in out =
inl in, out = zip in, zip out
assert (in.dim = out.dim) "The input and output dimensions must be equal."
inl in = flatten in |> to_dev_tensor
inl out = flatten out |> to_dev_tensor
inl in_a :: () = in.dim
inl blockDim = 128
inl gridDim = min 64 (divup (s in_a) blockDim)
w.run {
blockDim gridDim
kernel = cuda // Lexical scoping rocks.
grid_for {blockDim gridDim} .x in_a {body=inl {i} ->
inl out = out i
inl in = in i
out .set (f in.get out.get)
}
}
The first line is `inl in, out = zip in, zip out`. Since the inputs to to the map function can be arbitrary data structures, the `zip` iterates over it and zips the tensors in it into a single one.
In Spiral the tensors have a tuple of arrays layout by default, and it is absolutely trivial to switch between AOT and TOA representations. In fact you can mix and match tensors that have both layouts internally. Zipping tensors in Spiral just merges them into one with the TOA layout though the subtensors could be AOT.
Dealing with these layout transformations is a significant issue in numerical computation that Spiral solves completely. The reason why I have an impression that Julia cannot do this is not because I know for sure. Rather I do not think this can be done within the confines of a parametric type system. Rather I think it would require an intensional type system like the Spiral has.
inl in = flatten in |> to_dev_tensor
inl out = flatten out |> to_dev_tensor
`flatten` checks that the tensor is contiguous and flattens it into a 1d tensor. The `to_dev_tensor` is just some formality that I could not get rid of. It takes the pointers out from the references holding them which allows the tensors to be captured lexically before being passed into the kernel. Without this a type error would occur.
kernel = cuda // Lexical scoping rocks.
grid_for {blockDim gridDim} .x in_a {body=inl {i} ->
inl out = out i
inl in = in i
out .set (f in.get out.get)
}
The actual kernel itself is quite small and trivial. `cuda` is just a parser abbreviation for `inl {blockDim gridDim} ->`. In other words it is not a special language feature, but a regular function.
When I was doing this in F#, I had to not pass in every single argument of the function by hand - meaning the tensor pointers, their offsets, their sizes, their dimension ranges for each and every tensor, I also had to make wrapper classes on top of that. If a map operation needed more arguments, then I would have needed to copy paste a kernel, adjust it and the wrapper and repeat the testing process all over again, all this with very little type safety.
Spiral is not quite on the level of Futhark in terms of ergonomics, but it is a vast improvement over what came before. Note that there is no need for a type annotations anywhere - the function is as generic as it could possibly be in a statically typed language. Because of inlining guarantees all of this has zero overhead.
Compared to Spiral, where do Julia and its tensors stand on the axis of generality? Would it be possible to write a simple map as simply as this?
This code works with any multi-dimensional array, independent of layout considerations, though you may of course want to specialize based on layout for performance. Note also that this the code is slightly more general than the one in your example, because it handles arrays that do not have efficient linear indexing.
Sorry about not seeing your reply yesterday, HN is not as helpful as Reddit so I have to scan for them manually which can be error prone. I just went through your talk. It was definitely interesting. Despite my confidence in the high-performance abstractions that Spiral allows, I do not have much experience in doing high performance optimizations myself so such field reports motivate my efforts.
So Julia can definitely do flexible data structure layouts, but looking at your code I am starting to understand why it took an expert such as yourself to show me this example. Had you not told me that the link is to a map kernel I would have difficulty figuring it out on my own.
This shifts my objections from 'Julia cannot do it', to 'macro heavy code is quite hard to grasp'. In Spiral you could have written all of this without the need for macros. Spiral does have (text) macros, but their purpose is to do language interop and not abstraction.
The use of macros is a typical anti-pattern in languages whose type systems are too weak for the problem at hand - I am decently sure that with sufficient macro magic any language that has them would have been capable of performing these same optimizations.
One immediate benefit of a powerful type system that I'd like to point out is that Spiral allows reflection over tuples, which is a lot more elegant solution to a problem of a function having variable arguments than the C-style varargs mechanism that I see Julia using in the examples you've shown me. Also, in Spiral reflection is done using pattern matching rather than if statements.
Another benefit is that Spiral has no need for separate static array machinery. Because literals can be a part of a variable's type and if not, unless explicitly prevented the literals tend to be propagated through function call boundaries (join points). Assuming they are known at compile time the inlining of tensor dimensions in loops is actually the default behavior in Spiral. I've yet to do benchmarking to see how helpful that is, but I am sure it would help reduce register pressure.
Your examples are not quite enough to get me to apologize for my rude behavior, but are enough to get me to shift my views a little so I'll change the offending sentence so it reflects reality more accurately.
On Julia 0.7, the broadcast kernel will look much more like this (and actually the version used in CuArrays/CLArrays currently also doesn't use those macros):
1) The macro @cuda_index doesn't really need to be a macro,
but like this it's the shortest way to check the index and return when out of range (in CUDA/OpenCL you often need to schedule a loop with more iterations than you need).
2) the `dest[I] = bc[I]` does all the unzipping of arguments and dispatch to the correct fast getindex for a certain layout of dest/bc via the type system + tuples of the arguments (not using C-style varargs).
This simple function makes it possible to get all the advantages of julia's lazy fused broadcasting to work.
Feel free to checkout this great blog article to see what's possible with the new lazy broadcasting:
You can also look directly at the broadcasting code in Julia Base, which is already a bit bloated to deal with lots of edge cases, but is still pretty readable in parts:
I like how the broadcasting comes out in the end in Julia. Besides that Julia has quite many niceties that I could only wish Spiral had and I do think that they matter significantly even if they are not vital. This is definitely praiseworthy work.
Since I haven't done so and I really should have, if you or anyone else still looking at this thread are curious how GPU programming without macros looks like then take a peek at this:
Comparing Spiral and Julia code that has been linked so far, I feel that no one would ever guess without knowing ahead of time that Spiral is a static language and Julia dynamic based on these examples.
As an example of this in action, here is how layer normalization is implemented using the 'seq_broadcast' kernel which does a sequence of reductions and broadcasts in registers. It is also used to implement softmax for example.
>GPU programming without macros looks like then take a peek at this:
I'm not sure if you missed my examples, or if there is some other missunderstanding going on ;)
GPU code in Julia works 100% without macros, but you are free to use them, so people do ;) Also, you can pretty much write the gpu kernels like pseydo code, not sure how much simpler it can get.
With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)
I wrote an article some time ago about generic programming with Julia's gpu infrastructure:
Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.
> With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)
Heh, I've noticed that the setup code tends to come out longer than the actual kernels. It is inside `kernel = cuda`. Spiral is indentation sensitive like Python and F#.
The stuff inside {} is just module creation, think of it like tuples with named fields.
> Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.
No, I've yet to actually benchmark it. It really depends on how good of a job NVCC does with the generic sequential reduce kernel. I'll do an in depth analysis when I am done with all the neural network work that I am doing currently which might take a while.
You can in fact get loop unrolling with just standard functions in Spiral. I go into it in the context of this chapter. It is achieved by pattern matching over tuples and recursion.
> If you can make concrete examples of how things can be simpler than this, I'd be delighted to hear them :)
Yes, by replacing meta programming with intensional polymorphism and inlinining guarantees. Also reflection should be done using pattern matching. I think this last one could be taken entirely seriously as it would not involve a replacement of the entire type system.
All the claims in the article about inlining and specialization that make it sound like magic is what in general makes me dubious about languages pretending to be speed kings. Yes, I am aware that GPU kernels do not require optimizers capable of having monads for breakfast and that in the context of GPU programming where they were made they are probably true, but inlining is the sort of thing that matters more the more high level a language is. For very high level languages that desire speed, there isn't much choice but to make them a part of language semantics despite the added burden it puts on the user.
Since we are still at it, I have a question I need to ask.
Recently I've been informed that Julia is capable of GCing GPU memory. If this is fully integrated that would be a major feature which is not possible in say .NET or Racket. I really wanted this in Spiral and could not get it in .NET.
By fully integrated, I mean much like for regular memory for which the GC takes note of the state of the system for when to do collection and defragmentation. If it can only make a thin wrapper with a finalizer (much like in .NET) around an unmanaged resource then it is not a big deal.
Is it fully integrated or is it a wrapper style memory management?
If it is the later, then that is too bad, but I'd suggest to Julia devs that they work on making it fully integrated as it would be a really good feature. Obviously, I can't do it in Spiral as I would need to write my own VM and I have only so many years in my life.
As a matter of fact, it's what the compiler devs keep recomending. As a developper I must say, that it's pretty relaxing to take a meta programming shortcut from time to time ;)
In general, we plan to offer a tool box that employs these kind of patterns for loop unrolling, tiling etc, to make it easier to write GPU code without meta programming and perform certain optimizations/scheduling patterns based on a more trait like system.
> Also reflection should be done using pattern matching
I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch! If we needed more than that, we could put the effort into extending the language into that direction. But I think your use cases must get be pretty advanced untill you need something more complex. Can you recommend a nice article that shows off beautiful pattern matching for reflection? I see how it sounds nice, but I can't really imagine right now how it would improve my life.
Concerning inlining, I do think we actually force codegen to always inline when compiling for the GPU. I don't remember a 100% anymore, but there might have been some problems with that. But it's definitely the goal, to have a flexible compiler that you can fine tune to specific domains like GPU programming. And if we don't get it as part of the compiler, we can definitely do things like that with https://github.com/jrevels/Cassette.jl/
Appart from that, I'm pretty happy that our GPU compilation infrastructure isn't making Julia less useful as a general purpose language ;)
>Is it fully integrated or is it a wrapper style memory management?
It's wrapper style. We're playing around with different kind of hacks to make it less worse, but those are hacks you could do in any GC language. The good news is, that the Julia compiler devs take GPU computing seriously and promised to work on a full integration that is aware of the GPU hardware.
> As a matter of fact, it's what the compiler devs keep recomending.
I definitely agree with them in this matter. Macros might have a role in language development, but they should not be a stand in for compiler optimizations. For safety and speed, the type system should be there.
Here is how the example you've shown could be done in Spiral. Maybe I should add express support for literal testing in pattern matching, but it is not a pattern that comes up too often by itself.
> I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch!
What you say is exactly right as pattern matching compiles down to those low level operations, so there no reason at all why those low level operations should be done by hand. Pattern matching does not depend on a particular type system or whether the language is dynamic or static. There are only so many good ideas in programming languages and this is one of them.
Though there is some overlap between pattern matching and multiple dispatch, the roles are different. The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.
Here is quite a complex example of how it is used in action. I won't go into detail of this here, but you can see how I repeatedly match on the contents of a module at different times in order to get more generic functionality for the kernel.
The particular kernel shown here is the most complex one that exists in the library right now - I am yet to get to things like generic matrix multiplication and convolution, but I'll get there eventually.
> I see how it sounds nice, but I can't really imagine right now how it would improve my life.
Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life. I could have said the same thing about first class functions back in 2015. I am sure in the future there will be such features I can't even imagine right now.
>The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.
Agreed! ;) Just wanted to point out, that I was able to use multiple dispatch elegantly in the unroll example, where you are using pattern matching.
>pattern matching compiles down to those low level operations
So I guess that means Julia could indeed satisfy you, if we gave those low level operations some more syntactic sugar?
Have you seen: http://kmsquire.github.io/Match.jl/latest/ ?
>Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life
True story. I'll try to be more observant when writing code and see, if I could actually solve things more elegantly with better pattern matching.
Btw, I'm writing on an article about GPU programming in Julia - do you have feature you'd like to have explained, or a killer feature you believe we can't do and that would impress you if we did?
You keep making assertions based on completely false assumptions about Julia and its type system. It might come across better if you posed these assertions as questions instead. A few counterpoints...
> One immediate benefit of a powerful type system that I'd like to point out is that Spiral allows reflection over tuples, which is a lot more elegant solution to a problem of a function having variable arguments than the C-style varargs mechanism that I see Julia using in the examples you've shown me.
* Julia's type system can reflect on tuples just fine at both compile time and run time and we do so all the time. When tuple size and contents are statically knowable, there is no run time overhead.
* Julia's varargs are represented as tuples, which can be and are reflected on by the compiler. Julia's varargs has nothing to do with C varargs except superficial usage similarity.
> Another benefit is that Spiral has no need for separate static array machinery. Because literals can be a part of a variable's type and if not, unless explicitly prevented the literals tend to be propagated through function call boundaries (join points).
* The fact that Julia's default array types doesn't include static dimensions is a design _choice_ not a type system limitation. Specializing on the exact dimensions of an array is _not_ usually what you want to do in general computational work. If it _is_ what you want to do, you can use the StaticArrays package.
* Literals can be type parameters in Julia too—this is done all the time. Indeed, as I said elsewhere in this thread, type parameters can be other types, integers, floats, symbols, tuples, and recursive constructions thereof.
* The StaticArrays package is implemented in pure Julia, demonstrating that the type system is perfectly capable of reflecting on tuples and having dimension sizes and tuples of them as type parameters—that's how the static array types are implemented. The compiler generates completely unrolled, fully inlined code for static array operations since the dimensions are known at compile time.
> Assuming they are known at compile time the inlining of tensor dimensions in loops is actually the default behavior in Spiral. I've yet to do benchmarking to see how helpful that is, but I am sure it would help reduce register pressure.
I understand that Spiral is targeted at a context (GPUs) where static arrays are a good default. The fact that Julia does not take that approach as a default is not a type system limitation (as proven by the existence of StaticArrays), it's a choice based on the fact that static arrays are not the right default for general numerical computing. There are many situations where being forced to know array dimensions at compile time is not helpful or even actively problematic.
> Your examples are not quite enough to get me to apologize for my rude behavior, but are enough to get me to shift my views a little so I'll change the offending sentence so it reflects reality more accurately.
I'm unclear on why any examples are necessary to apologize for rudeness. Being right doesn't justify being rude.
As far as I can tell, the relationship between Spiral and Julia's approaches is roughly:
* Spiral is static and primarily focused on statically known, fixed sized arrays and generating good code for GPUs.
* Julia is dynamic, and primarily focused on flexibility along with C-level performance on CPUs.
* Julia partially covers Spirals target territory with StaticArrays and GPU code generation capabilities.
* Spiral does not, on the other hand address (or aim) to the more general array computing that Julia supports.
Given the substantial overlap in capabilities, it seems like dismissing Julia out of hand for the application areas you're interested is both unwarranted and unwise. You may want to be a bit less dismissive and understand it and learn from it instead. In particular, Spiral's type system is not more powerful than Julia's. They are different since Spiral's is static and forces types to be known and checked at compile time (as far as I understand). Julia's type system relaxes this, allowing types to be unknown until much later—and as a tradeoff, checked much later. I have yet to see anything that Spiral's type system can express that cannot be expressed in Julia's. Of course, we're also cheating since Julia doesn't do type checking and therefore having a highly expressive type system doesn't really cost us much.
* Julia's type system can reflect on tuples just fine at both compile time and run time and we do so all the time.
Then why do I nowhere see that actually being done? Reflection on tuples should be done much like pattern matching on lists in functional languages. What is the point of that VarArgs nonsense? I know that Julia has pattern matching - now it needs to take the next step and actually make use of it.
Is this Julia's way of making itself familiar to C programmers?
I see those `...` elipses used as some kind of operator in both C++ (and Racket ironically) and they are a horrible idea as they are completely implicit and non-obvious in their function.
Using type membership tests + if statements is so 90s. Is Julia also trying to draw in the Java crowd here by trying to be closer to that language?
> <all those other points>
> There are many situations where being forced to know array dimensions at compile time is not helpful or even actively problematic.
It is not that Spiral enforces that dimensions be static. Rather if they are known at compile time then that information is merely propagated forward including through function call boundaries.
You seem to miss what it really means to have first-class types together with staging in a language. Spiral's tensors can be arbitrarily static or dynamic in their dimension and can even allow some dimensions to be static (known at compile time) while the others are dynamic. No friction results from this.
All this does not actually require separate implementations like in Julia. Both Spiral and Julia have Turing complete type systems, but based on this I can conclude that Spiral's is more expressive.
It would be trivial to force it so all the dimensions of a tensor are dynamic, but why would one want to propagate less information during compilation? It is not like dimensions of tensors change that often or arbitrarily like common variables do.
Also let me just state for the record that Spiral is intended to be more than a GPU language. A language with the capabilities of doing it elegantly was my motivation, but Spiral featureset makes it uniquely suited for both very high level and very low level programming.
A language saying it wants to be as fast as C counts for very little, the question is how fast it would be once the code starts being really abstract? This is why the rare one benchmark that currently exists in Spiral's documentation is for parser combinators.
GPU kernels are not a good test bed for language speed because they do not use that many high level features except for the ones needed for tensors. I'd be more concerned with all the scaffolding needed to set up the kernel. And in fact that was one of the majors concerns for me back when I was doing a ML library in F#.
In the field that Julia is aiming for, I'd rather see a benchmark for a CPU based AD library that works on scalars which is a use case in scientific computing. Optimizing this is actually difficult given that none of the mainstream functional languages can do it. I'd expect the same situation as with monads for Julia where the inliner just gives up.
> Of course, we're also cheating since Julia doesn't do type checking and therefore having a highly expressive type system doesn't really cost us much.
I do not know whether the code that was linked in these threads is a representative sample for Julia, but Julia's code to me looks much more like it was written in a static language than Spiral's does.
I think that language wars such as these are necessary in order to come to the truth. It is not that I am being rude on purpose, it is that rudeness in a battle is to be excused and seen as unavoidable. Being right is a process rather than a fact because being able to get it down to a fact is rare and getting the fact to be accepted is even rarer. Cooking a good meal requires flames.
Language semantics have indivisible algorithmic properties that set them apart from each other and hence they cannot ever be equal. The best thing therefore is to enjoy the division. I see this as different than celebrating diversity.
I actually don't know anything about these sorts of things. My complaint is merely that you claimed julia is not even worth comparing with when you don't actually know how tensors are implemented in julia or if whatever it is you're doing here is possible in julia. Instead you compare to PyTorch and assume julia has as shit semantics as PyTorch which is false.
Perhaps someone who knows more can comment on whether or not julia meets your criteria here but either way, I think your presentation is misleading.
I cannot change my attitude otherwise I would never get anything done. The only principle that I hold dear, and the only way I can truly be fair is that if I am actually wrong I will change my mind. I would not be so rough on Julia if it didn't present itself as a direct competitor to Spiral. Since it presents itself as a numerical computation language capable of replacing C++ and dominating machine learning and GPU programming I will hold it to the same standard I would take Spiral.
Any numerical computation language should have tensors generic in dimension, layout and type. That is the point I am trying to hammer home. If you think that Julia's tensors are better than PyTorch's and feel that my equal treatment of them is unfair - I honestly feel that is the wrong way to think about this. Rather than fighting to be the king of a very small hill, it would be better to try climbing a much larger mountain instead. Look inward and be unhappy at what the language has now.
Envy Spiral's wonderful tensors. Wouldn't Julia be a much better language if it had its inlining guarantees? And maybe add to that the intensional polymorphism that would allow it to implement them.
Why if that happened, I'd lose my reason for working on the language and have to do something else like study reinforcement learning for example. How horrible would that be? :)
>Any numerical computation language should have tensors generic in dimension, layout and type. That is the point I am trying to hammer home. If you think that Julia's tensors are better than PyTorch's and feel that my equal treatment of them is unfair - I honestly feel that is the wrong way to think about this. Rather than fighting to be the king of a very small hill, it would be better to try climbing a much larger mountain instead. Look inward and be unhappy at what the language has now.
It's still silly to compare Julia's tensors to PyTorch's because they are generic in dimension, layout, and type unlike PyTorch's and these seem to be exactly the pieces that you're ragging on. This means that your beef against PyTorch is fine, but it doesn't necessarily apply to Julia. I am curious if you can go into more detail about Julia's implementation instead of talking about possible Julia problems through an explanation of PyTorch.
>Wouldn't Julia be a much better language if it had its inlining guarantees?
Not necessarily. Julia is a dynamic interactive language which is made to be used through a REPL. Although you can increasingly statically compile packages, the standard way to use it is interactive and in this case you do have to watch your compile-times closely.
Julia does have opt-in inlining guarantees through the `@inline` macro. It's opt-in instead of opt-out in order to stop unnecessary compile-time growth (inlining isn't always that helpful). To decide when inlining should occur, Julia uses a cost model:
This then tries to only inline what's necessary to do for performance. Of course, using the fact that Julia can inline what it needs allows things like CUDANative.jl to force inlining and get the kernels it needs.
>And maybe add to that the intensional polymorphism that would allow it to implement them.
This sounds like a good idea. We should take this over to our community and see whether it would work well in the language.
I think the last thing to address is:
>Since it presents itself as a numerical computation language capable of replacing C++ and dominating machine learning and GPU programming I will hold it to the same standard I would take Spiral.
Not really. Julia is dynamic and interactive. C++ with its full templating is very powerful but can allow its compile-times to explode because its main use case is not with a REPL (and templated C++ code compile times are...). While Julia's generics and speed have increased the domain of what's possible with a dynamic language, it is still made for a much different domain than C++. I would say it's not really in direct competition with Spiral either since the interactive workflow is just so different. Julia's static binary generation is growing in power but will never be as strong as something made for static compilation, and C++/Spiral won't feel like an on-demand calculator where most users are punching commands and getting instant feedback.
To me, Spiral looks like a cool alternative to C++/Rust/Go in the scientific computing domain, and package ecosystems that mix Spiral+Julia could be quite powerful.
> It's still silly to compare Julia's tensors to PyTorch's because they are generic in dimension, layout, and type unlike PyTorch's and these seem to be exactly the pieces that you're ragging on. This means that your beef against PyTorch is fine, but it doesn't necessarily apply to Julia. I am curious if you can go into more detail about Julia's implementation instead of talking about possible Julia problems through an explanation of PyTorch.
Well, no, I can't go into more detail. It is the responsibility of those who take up the challenge I made, so why don't you go into detail instead, in the context of a GPU map kernel? I am genuinely curious about this, but not so much that I actually want to dive deep into the language.
Apart from that, I do like the rest of your reply. It is true that best performance lies somewhere between full inlining and full heap allocation, and that full inlining will kill compile times. One of the points I want to make with Spiral is that starting from the position of full inlining and then adding boxing and join points is a lot easier and much less error prone than starting from the other end and inlining by hand. I feel that this is important other programmers realize along with the fact that sophisticated compiler heuristics for when to inline are the only alternative.
> This sounds like a good idea. We should take this over to our community and see whether it would work well in the language.
I know that Odesky thinks that pattern matching on types directly only really works on languages with simple type systems such as Spiral's. If a language uses abstract types then that becomes more difficult. I definitely see a language having both intensional and extensional (parametric) polymorphism as a significant challenge. I'd love to see it met because extensional polymorphism is a lot easier on the IDEs and the way Spiral does it will make it difficult to create good tooling for it.
If Julia could support intensional polymorphism then inlining guarantees would just fall out from that. It is quite powerful - the inlining guarantees Spiral gives does not just apply to top level functions, but any function passed as an argument and captured by other functions.
> Not really. Julia is dynamic and interactive. C++ with its full templating is very powerful but can allow its compile-times to explode because its main use case is not with a REPL (and templated C++ code compile times are...).
If C++'s templating system was a car, then it would be missing wheels and a steering wheel. And possibly the brakes. How can one possibly do staging without join points, and all the other things Spiral has? As a language it misses several key concepts that would actually make such a system work well.
With the last paragraph I half agree with you.
It is true that right now Spiral uses a compile-run loop and acts as a sophisticated code generator, but it could be made into caching interpreter with some work. It probably will have to be if it gets bigger for the sake of compile times and library support.
Getting instant feedback in the REPL is hardly the domain of dynamic languages anymore - F# can do it well for example and so can various other static functional languages. Even C# can do it these days.
> To me, Spiral looks like a cool alternative to C++/Rust/Go in the scientific computing domain, and package ecosystems that mix Spiral+Julia could be quite powerful.
Julia does a lot of multi-stage compilation. It probably has had a lot of work done on its JIT by now. And language mixing, especially of languages with different strengths and weaknesses is a great idea. Staging gives the user access to it in a more direct manner so why not bring it out?
Spiral is quite a small language so if JITing abilities are flexible enough, it might be a good idea to implement it in Julia as a DSL and merge it with main language. That would combine all the advantages of Julia and Spiral as languages.
This kind of syntax is great for small code samples or to play in a CLI, but in larger programs more traditional (verbose) syntax works better for me. YMMV.
I'll admit this caused the first non-gaming related slight twinge of nostalgia for the windows ecosystem I've had in a long time since I went gnu/linux only. Good on the author.
It really reminds me of forth. The only forth program I’ve interacted with is the BIOS prompt for the olpc laptop which used forth to bootstrap the cpu from the gpu.
Author here. The language was made for Cuda programming and until something comes along to displace it, I have no plans of moving from it.
That having said, the language is quite small and it would be possible to create a new backend in about 500 lines of code so if you really want it, you could probably do it yourself. I'd help you integrate it into the language. I know nothing about SPIRV at the moment though.
In this way it reminds me of http://futhark-lang.org/ somewhat, though Futhark is explicitly targeted at producing extremely efficient GPU code.