I might be persuaded to see things from your point of view if you can show that Julia's tensors meet my criteria. I might be wrong about this - I do not know much about Julia and it might very well be capable of this. But these are the kinds of things that if it were capable of, it would advertise. If it can do it I'll apologize and change the line to show Julia in a more positive light.
The reason why I started talking about inlining of monads here is because ultimately they are just functions, and Spiral's tensors are not part of the language, but just functions as well. So if it cannot do monads then it probably cannot do tensors to satisfaction either.
It was pointed out to me that tensor slicing could serve as substitute for partial application. What I am wondering next is the degree of flexibility Julia's tensors have in the context of GPU kernels. Let me illustrate this with a short example. The following is how the inplace map Cuda kernel is made in Spiral. It seems simple, but I'll go at it from the top to show the characteristics and flexibility that Spiral's tensors have.
met map' w f in out =
inl in, out = zip in, zip out
assert (in.dim = out.dim) "The input and output dimensions must be equal."
inl in = flatten in |> to_dev_tensor
inl out = flatten out |> to_dev_tensor
inl in_a :: () = in.dim
inl blockDim = 128
inl gridDim = min 64 (divup (s in_a) blockDim)
w.run {
blockDim gridDim
kernel = cuda // Lexical scoping rocks.
grid_for {blockDim gridDim} .x in_a {body=inl {i} ->
inl out = out i
inl in = in i
out .set (f in.get out.get)
}
}
The first line is `inl in, out = zip in, zip out`. Since the inputs to to the map function can be arbitrary data structures, the `zip` iterates over it and zips the tensors in it into a single one.
In Spiral the tensors have a tuple of arrays layout by default, and it is absolutely trivial to switch between AOT and TOA representations. In fact you can mix and match tensors that have both layouts internally. Zipping tensors in Spiral just merges them into one with the TOA layout though the subtensors could be AOT.
Dealing with these layout transformations is a significant issue in numerical computation that Spiral solves completely. The reason why I have an impression that Julia cannot do this is not because I know for sure. Rather I do not think this can be done within the confines of a parametric type system. Rather I think it would require an intensional type system like the Spiral has.
inl in = flatten in |> to_dev_tensor
inl out = flatten out |> to_dev_tensor
`flatten` checks that the tensor is contiguous and flattens it into a 1d tensor. The `to_dev_tensor` is just some formality that I could not get rid of. It takes the pointers out from the references holding them which allows the tensors to be captured lexically before being passed into the kernel. Without this a type error would occur.
kernel = cuda // Lexical scoping rocks.
grid_for {blockDim gridDim} .x in_a {body=inl {i} ->
inl out = out i
inl in = in i
out .set (f in.get out.get)
}
The actual kernel itself is quite small and trivial. `cuda` is just a parser abbreviation for `inl {blockDim gridDim} ->`. In other words it is not a special language feature, but a regular function.
When I was doing this in F#, I had to not pass in every single argument of the function by hand - meaning the tensor pointers, their offsets, their sizes, their dimension ranges for each and every tensor, I also had to make wrapper classes on top of that. If a map operation needed more arguments, then I would have needed to copy paste a kernel, adjust it and the wrapper and repeat the testing process all over again, all this with very little type safety.
Spiral is not quite on the level of Futhark in terms of ergonomics, but it is a vast improvement over what came before. Note that there is no need for a type annotations anywhere - the function is as generic as it could possibly be in a statically typed language. Because of inlining guarantees all of this has zero overhead.
Compared to Spiral, where do Julia and its tensors stand on the axis of generality? Would it be possible to write a simple map as simply as this?
This code works with any multi-dimensional array, independent of layout considerations, though you may of course want to specialize based on layout for performance. Note also that this the code is slightly more general than the one in your example, because it handles arrays that do not have efficient linear indexing.
Sorry about not seeing your reply yesterday, HN is not as helpful as Reddit so I have to scan for them manually which can be error prone. I just went through your talk. It was definitely interesting. Despite my confidence in the high-performance abstractions that Spiral allows, I do not have much experience in doing high performance optimizations myself so such field reports motivate my efforts.
So Julia can definitely do flexible data structure layouts, but looking at your code I am starting to understand why it took an expert such as yourself to show me this example. Had you not told me that the link is to a map kernel I would have difficulty figuring it out on my own.
This shifts my objections from 'Julia cannot do it', to 'macro heavy code is quite hard to grasp'. In Spiral you could have written all of this without the need for macros. Spiral does have (text) macros, but their purpose is to do language interop and not abstraction.
The use of macros is a typical anti-pattern in languages whose type systems are too weak for the problem at hand - I am decently sure that with sufficient macro magic any language that has them would have been capable of performing these same optimizations.
One immediate benefit of a powerful type system that I'd like to point out is that Spiral allows reflection over tuples, which is a lot more elegant solution to a problem of a function having variable arguments than the C-style varargs mechanism that I see Julia using in the examples you've shown me. Also, in Spiral reflection is done using pattern matching rather than if statements.
Another benefit is that Spiral has no need for separate static array machinery. Because literals can be a part of a variable's type and if not, unless explicitly prevented the literals tend to be propagated through function call boundaries (join points). Assuming they are known at compile time the inlining of tensor dimensions in loops is actually the default behavior in Spiral. I've yet to do benchmarking to see how helpful that is, but I am sure it would help reduce register pressure.
Your examples are not quite enough to get me to apologize for my rude behavior, but are enough to get me to shift my views a little so I'll change the offending sentence so it reflects reality more accurately.
On Julia 0.7, the broadcast kernel will look much more like this (and actually the version used in CuArrays/CLArrays currently also doesn't use those macros):
1) The macro @cuda_index doesn't really need to be a macro,
but like this it's the shortest way to check the index and return when out of range (in CUDA/OpenCL you often need to schedule a loop with more iterations than you need).
2) the `dest[I] = bc[I]` does all the unzipping of arguments and dispatch to the correct fast getindex for a certain layout of dest/bc via the type system + tuples of the arguments (not using C-style varargs).
This simple function makes it possible to get all the advantages of julia's lazy fused broadcasting to work.
Feel free to checkout this great blog article to see what's possible with the new lazy broadcasting:
You can also look directly at the broadcasting code in Julia Base, which is already a bit bloated to deal with lots of edge cases, but is still pretty readable in parts:
I like how the broadcasting comes out in the end in Julia. Besides that Julia has quite many niceties that I could only wish Spiral had and I do think that they matter significantly even if they are not vital. This is definitely praiseworthy work.
Since I haven't done so and I really should have, if you or anyone else still looking at this thread are curious how GPU programming without macros looks like then take a peek at this:
Comparing Spiral and Julia code that has been linked so far, I feel that no one would ever guess without knowing ahead of time that Spiral is a static language and Julia dynamic based on these examples.
As an example of this in action, here is how layer normalization is implemented using the 'seq_broadcast' kernel which does a sequence of reductions and broadcasts in registers. It is also used to implement softmax for example.
>GPU programming without macros looks like then take a peek at this:
I'm not sure if you missed my examples, or if there is some other missunderstanding going on ;)
GPU code in Julia works 100% without macros, but you are free to use them, so people do ;) Also, you can pretty much write the gpu kernels like pseydo code, not sure how much simpler it can get.
With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)
I wrote an article some time ago about generic programming with Julia's gpu infrastructure:
Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.
> With your linked spiral code, I can't even really tell where the algorithm starts and where the setup code ends - which of course might easily be attributed to my unfamiliarity with Spiral ;)
Heh, I've noticed that the setup code tends to come out longer than the actual kernels. It is inside `kernel = cuda`. Spiral is indentation sensitive like Python and F#.
The stuff inside {} is just module creation, think of it like tuples with named fields.
> Do you have any benchmarks for the softmax kernel? If that kernel has optimal performance, it would be quite interesting. If it's sup par, it looks much longer than a simple version.
No, I've yet to actually benchmark it. It really depends on how good of a job NVCC does with the generic sequential reduce kernel. I'll do an in depth analysis when I am done with all the neural network work that I am doing currently which might take a while.
You can in fact get loop unrolling with just standard functions in Spiral. I go into it in the context of this chapter. It is achieved by pattern matching over tuples and recursion.
> If you can make concrete examples of how things can be simpler than this, I'd be delighted to hear them :)
Yes, by replacing meta programming with intensional polymorphism and inlinining guarantees. Also reflection should be done using pattern matching. I think this last one could be taken entirely seriously as it would not involve a replacement of the entire type system.
All the claims in the article about inlining and specialization that make it sound like magic is what in general makes me dubious about languages pretending to be speed kings. Yes, I am aware that GPU kernels do not require optimizers capable of having monads for breakfast and that in the context of GPU programming where they were made they are probably true, but inlining is the sort of thing that matters more the more high level a language is. For very high level languages that desire speed, there isn't much choice but to make them a part of language semantics despite the added burden it puts on the user.
Since we are still at it, I have a question I need to ask.
Recently I've been informed that Julia is capable of GCing GPU memory. If this is fully integrated that would be a major feature which is not possible in say .NET or Racket. I really wanted this in Spiral and could not get it in .NET.
By fully integrated, I mean much like for regular memory for which the GC takes note of the state of the system for when to do collection and defragmentation. If it can only make a thin wrapper with a finalizer (much like in .NET) around an unmanaged resource then it is not a big deal.
Is it fully integrated or is it a wrapper style memory management?
If it is the later, then that is too bad, but I'd suggest to Julia devs that they work on making it fully integrated as it would be a really good feature. Obviously, I can't do it in Spiral as I would need to write my own VM and I have only so many years in my life.
As a matter of fact, it's what the compiler devs keep recomending. As a developper I must say, that it's pretty relaxing to take a meta programming shortcut from time to time ;)
In general, we plan to offer a tool box that employs these kind of patterns for loop unrolling, tiling etc, to make it easier to write GPU code without meta programming and perform certain optimizations/scheduling patterns based on a more trait like system.
> Also reflection should be done using pattern matching
I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch! If we needed more than that, we could put the effort into extending the language into that direction. But I think your use cases must get be pretty advanced untill you need something more complex. Can you recommend a nice article that shows off beautiful pattern matching for reflection? I see how it sounds nice, but I can't really imagine right now how it would improve my life.
Concerning inlining, I do think we actually force codegen to always inline when compiling for the GPU. I don't remember a 100% anymore, but there might have been some problems with that. But it's definitely the goal, to have a flexible compiler that you can fine tune to specific domains like GPU programming. And if we don't get it as part of the compiler, we can definitely do things like that with https://github.com/jrevels/Cassette.jl/
Appart from that, I'm pretty happy that our GPU compilation infrastructure isn't making Julia less useful as a general purpose language ;)
>Is it fully integrated or is it a wrapper style memory management?
It's wrapper style. We're playing around with different kind of hacks to make it less worse, but those are hacks you could do in any GC language. The good news is, that the Julia compiler devs take GPU computing seriously and promised to work on a full integration that is aware of the GPU hardware.
> As a matter of fact, it's what the compiler devs keep recomending.
I definitely agree with them in this matter. Macros might have a role in language development, but they should not be a stand in for compiler optimizations. For safety and speed, the type system should be there.
Here is how the example you've shown could be done in Spiral. Maybe I should add express support for literal testing in pattern matching, but it is not a pattern that comes up too often by itself.
> I feel like what Julia does is just more low level right now - And you get pretty far with just multiple dispatch!
What you say is exactly right as pattern matching compiles down to those low level operations, so there no reason at all why those low level operations should be done by hand. Pattern matching does not depend on a particular type system or whether the language is dynamic or static. There are only so many good ideas in programming languages and this is one of them.
Though there is some overlap between pattern matching and multiple dispatch, the roles are different. The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.
Here is quite a complex example of how it is used in action. I won't go into detail of this here, but you can see how I repeatedly match on the contents of a module at different times in order to get more generic functionality for the kernel.
The particular kernel shown here is the most complex one that exists in the library right now - I am yet to get to things like generic matrix multiplication and convolution, but I'll get there eventually.
> I see how it sounds nice, but I can't really imagine right now how it would improve my life.
Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life. I could have said the same thing about first class functions back in 2015. I am sure in the future there will be such features I can't even imagine right now.
>The purpose of multiple dispatch is extensibility, but the purpose of pattern matching is destructuring.
Agreed! ;) Just wanted to point out, that I was able to use multiple dispatch elegantly in the unroll example, where you are using pattern matching.
>pattern matching compiles down to those low level operations
So I guess that means Julia could indeed satisfy you, if we gave those low level operations some more syntactic sugar?
Have you seen: http://kmsquire.github.io/Match.jl/latest/ ?
>Let me just say that it is really difficult to know ahead of time how a particular language feature would affect your programming life
True story. I'll try to be more observant when writing code and see, if I could actually solve things more elegantly with better pattern matching.
Btw, I'm writing on an article about GPU programming in Julia - do you have feature you'd like to have explained, or a killer feature you believe we can't do and that would impress you if we did?
You keep making assertions based on completely false assumptions about Julia and its type system. It might come across better if you posed these assertions as questions instead. A few counterpoints...
> One immediate benefit of a powerful type system that I'd like to point out is that Spiral allows reflection over tuples, which is a lot more elegant solution to a problem of a function having variable arguments than the C-style varargs mechanism that I see Julia using in the examples you've shown me.
* Julia's type system can reflect on tuples just fine at both compile time and run time and we do so all the time. When tuple size and contents are statically knowable, there is no run time overhead.
* Julia's varargs are represented as tuples, which can be and are reflected on by the compiler. Julia's varargs has nothing to do with C varargs except superficial usage similarity.
> Another benefit is that Spiral has no need for separate static array machinery. Because literals can be a part of a variable's type and if not, unless explicitly prevented the literals tend to be propagated through function call boundaries (join points).
* The fact that Julia's default array types doesn't include static dimensions is a design _choice_ not a type system limitation. Specializing on the exact dimensions of an array is _not_ usually what you want to do in general computational work. If it _is_ what you want to do, you can use the StaticArrays package.
* Literals can be type parameters in Julia too—this is done all the time. Indeed, as I said elsewhere in this thread, type parameters can be other types, integers, floats, symbols, tuples, and recursive constructions thereof.
* The StaticArrays package is implemented in pure Julia, demonstrating that the type system is perfectly capable of reflecting on tuples and having dimension sizes and tuples of them as type parameters—that's how the static array types are implemented. The compiler generates completely unrolled, fully inlined code for static array operations since the dimensions are known at compile time.
> Assuming they are known at compile time the inlining of tensor dimensions in loops is actually the default behavior in Spiral. I've yet to do benchmarking to see how helpful that is, but I am sure it would help reduce register pressure.
I understand that Spiral is targeted at a context (GPUs) where static arrays are a good default. The fact that Julia does not take that approach as a default is not a type system limitation (as proven by the existence of StaticArrays), it's a choice based on the fact that static arrays are not the right default for general numerical computing. There are many situations where being forced to know array dimensions at compile time is not helpful or even actively problematic.
> Your examples are not quite enough to get me to apologize for my rude behavior, but are enough to get me to shift my views a little so I'll change the offending sentence so it reflects reality more accurately.
I'm unclear on why any examples are necessary to apologize for rudeness. Being right doesn't justify being rude.
As far as I can tell, the relationship between Spiral and Julia's approaches is roughly:
* Spiral is static and primarily focused on statically known, fixed sized arrays and generating good code for GPUs.
* Julia is dynamic, and primarily focused on flexibility along with C-level performance on CPUs.
* Julia partially covers Spirals target territory with StaticArrays and GPU code generation capabilities.
* Spiral does not, on the other hand address (or aim) to the more general array computing that Julia supports.
Given the substantial overlap in capabilities, it seems like dismissing Julia out of hand for the application areas you're interested is both unwarranted and unwise. You may want to be a bit less dismissive and understand it and learn from it instead. In particular, Spiral's type system is not more powerful than Julia's. They are different since Spiral's is static and forces types to be known and checked at compile time (as far as I understand). Julia's type system relaxes this, allowing types to be unknown until much later—and as a tradeoff, checked much later. I have yet to see anything that Spiral's type system can express that cannot be expressed in Julia's. Of course, we're also cheating since Julia doesn't do type checking and therefore having a highly expressive type system doesn't really cost us much.
* Julia's type system can reflect on tuples just fine at both compile time and run time and we do so all the time.
Then why do I nowhere see that actually being done? Reflection on tuples should be done much like pattern matching on lists in functional languages. What is the point of that VarArgs nonsense? I know that Julia has pattern matching - now it needs to take the next step and actually make use of it.
Is this Julia's way of making itself familiar to C programmers?
I see those `...` elipses used as some kind of operator in both C++ (and Racket ironically) and they are a horrible idea as they are completely implicit and non-obvious in their function.
Using type membership tests + if statements is so 90s. Is Julia also trying to draw in the Java crowd here by trying to be closer to that language?
> <all those other points>
> There are many situations where being forced to know array dimensions at compile time is not helpful or even actively problematic.
It is not that Spiral enforces that dimensions be static. Rather if they are known at compile time then that information is merely propagated forward including through function call boundaries.
You seem to miss what it really means to have first-class types together with staging in a language. Spiral's tensors can be arbitrarily static or dynamic in their dimension and can even allow some dimensions to be static (known at compile time) while the others are dynamic. No friction results from this.
All this does not actually require separate implementations like in Julia. Both Spiral and Julia have Turing complete type systems, but based on this I can conclude that Spiral's is more expressive.
It would be trivial to force it so all the dimensions of a tensor are dynamic, but why would one want to propagate less information during compilation? It is not like dimensions of tensors change that often or arbitrarily like common variables do.
Also let me just state for the record that Spiral is intended to be more than a GPU language. A language with the capabilities of doing it elegantly was my motivation, but Spiral featureset makes it uniquely suited for both very high level and very low level programming.
A language saying it wants to be as fast as C counts for very little, the question is how fast it would be once the code starts being really abstract? This is why the rare one benchmark that currently exists in Spiral's documentation is for parser combinators.
GPU kernels are not a good test bed for language speed because they do not use that many high level features except for the ones needed for tensors. I'd be more concerned with all the scaffolding needed to set up the kernel. And in fact that was one of the majors concerns for me back when I was doing a ML library in F#.
In the field that Julia is aiming for, I'd rather see a benchmark for a CPU based AD library that works on scalars which is a use case in scientific computing. Optimizing this is actually difficult given that none of the mainstream functional languages can do it. I'd expect the same situation as with monads for Julia where the inliner just gives up.
> Of course, we're also cheating since Julia doesn't do type checking and therefore having a highly expressive type system doesn't really cost us much.
I do not know whether the code that was linked in these threads is a representative sample for Julia, but Julia's code to me looks much more like it was written in a static language than Spiral's does.
I think that language wars such as these are necessary in order to come to the truth. It is not that I am being rude on purpose, it is that rudeness in a battle is to be excused and seen as unavoidable. Being right is a process rather than a fact because being able to get it down to a fact is rare and getting the fact to be accepted is even rarer. Cooking a good meal requires flames.
Language semantics have indivisible algorithmic properties that set them apart from each other and hence they cannot ever be equal. The best thing therefore is to enjoy the division. I see this as different than celebrating diversity.
I actually don't know anything about these sorts of things. My complaint is merely that you claimed julia is not even worth comparing with when you don't actually know how tensors are implemented in julia or if whatever it is you're doing here is possible in julia. Instead you compare to PyTorch and assume julia has as shit semantics as PyTorch which is false.
Perhaps someone who knows more can comment on whether or not julia meets your criteria here but either way, I think your presentation is misleading.
I cannot change my attitude otherwise I would never get anything done. The only principle that I hold dear, and the only way I can truly be fair is that if I am actually wrong I will change my mind. I would not be so rough on Julia if it didn't present itself as a direct competitor to Spiral. Since it presents itself as a numerical computation language capable of replacing C++ and dominating machine learning and GPU programming I will hold it to the same standard I would take Spiral.
Any numerical computation language should have tensors generic in dimension, layout and type. That is the point I am trying to hammer home. If you think that Julia's tensors are better than PyTorch's and feel that my equal treatment of them is unfair - I honestly feel that is the wrong way to think about this. Rather than fighting to be the king of a very small hill, it would be better to try climbing a much larger mountain instead. Look inward and be unhappy at what the language has now.
Envy Spiral's wonderful tensors. Wouldn't Julia be a much better language if it had its inlining guarantees? And maybe add to that the intensional polymorphism that would allow it to implement them.
Why if that happened, I'd lose my reason for working on the language and have to do something else like study reinforcement learning for example. How horrible would that be? :)
>Any numerical computation language should have tensors generic in dimension, layout and type. That is the point I am trying to hammer home. If you think that Julia's tensors are better than PyTorch's and feel that my equal treatment of them is unfair - I honestly feel that is the wrong way to think about this. Rather than fighting to be the king of a very small hill, it would be better to try climbing a much larger mountain instead. Look inward and be unhappy at what the language has now.
It's still silly to compare Julia's tensors to PyTorch's because they are generic in dimension, layout, and type unlike PyTorch's and these seem to be exactly the pieces that you're ragging on. This means that your beef against PyTorch is fine, but it doesn't necessarily apply to Julia. I am curious if you can go into more detail about Julia's implementation instead of talking about possible Julia problems through an explanation of PyTorch.
>Wouldn't Julia be a much better language if it had its inlining guarantees?
Not necessarily. Julia is a dynamic interactive language which is made to be used through a REPL. Although you can increasingly statically compile packages, the standard way to use it is interactive and in this case you do have to watch your compile-times closely.
Julia does have opt-in inlining guarantees through the `@inline` macro. It's opt-in instead of opt-out in order to stop unnecessary compile-time growth (inlining isn't always that helpful). To decide when inlining should occur, Julia uses a cost model:
This then tries to only inline what's necessary to do for performance. Of course, using the fact that Julia can inline what it needs allows things like CUDANative.jl to force inlining and get the kernels it needs.
>And maybe add to that the intensional polymorphism that would allow it to implement them.
This sounds like a good idea. We should take this over to our community and see whether it would work well in the language.
I think the last thing to address is:
>Since it presents itself as a numerical computation language capable of replacing C++ and dominating machine learning and GPU programming I will hold it to the same standard I would take Spiral.
Not really. Julia is dynamic and interactive. C++ with its full templating is very powerful but can allow its compile-times to explode because its main use case is not with a REPL (and templated C++ code compile times are...). While Julia's generics and speed have increased the domain of what's possible with a dynamic language, it is still made for a much different domain than C++. I would say it's not really in direct competition with Spiral either since the interactive workflow is just so different. Julia's static binary generation is growing in power but will never be as strong as something made for static compilation, and C++/Spiral won't feel like an on-demand calculator where most users are punching commands and getting instant feedback.
To me, Spiral looks like a cool alternative to C++/Rust/Go in the scientific computing domain, and package ecosystems that mix Spiral+Julia could be quite powerful.
> It's still silly to compare Julia's tensors to PyTorch's because they are generic in dimension, layout, and type unlike PyTorch's and these seem to be exactly the pieces that you're ragging on. This means that your beef against PyTorch is fine, but it doesn't necessarily apply to Julia. I am curious if you can go into more detail about Julia's implementation instead of talking about possible Julia problems through an explanation of PyTorch.
Well, no, I can't go into more detail. It is the responsibility of those who take up the challenge I made, so why don't you go into detail instead, in the context of a GPU map kernel? I am genuinely curious about this, but not so much that I actually want to dive deep into the language.
Apart from that, I do like the rest of your reply. It is true that best performance lies somewhere between full inlining and full heap allocation, and that full inlining will kill compile times. One of the points I want to make with Spiral is that starting from the position of full inlining and then adding boxing and join points is a lot easier and much less error prone than starting from the other end and inlining by hand. I feel that this is important other programmers realize along with the fact that sophisticated compiler heuristics for when to inline are the only alternative.
> This sounds like a good idea. We should take this over to our community and see whether it would work well in the language.
I know that Odesky thinks that pattern matching on types directly only really works on languages with simple type systems such as Spiral's. If a language uses abstract types then that becomes more difficult. I definitely see a language having both intensional and extensional (parametric) polymorphism as a significant challenge. I'd love to see it met because extensional polymorphism is a lot easier on the IDEs and the way Spiral does it will make it difficult to create good tooling for it.
If Julia could support intensional polymorphism then inlining guarantees would just fall out from that. It is quite powerful - the inlining guarantees Spiral gives does not just apply to top level functions, but any function passed as an argument and captured by other functions.
> Not really. Julia is dynamic and interactive. C++ with its full templating is very powerful but can allow its compile-times to explode because its main use case is not with a REPL (and templated C++ code compile times are...).
If C++'s templating system was a car, then it would be missing wheels and a steering wheel. And possibly the brakes. How can one possibly do staging without join points, and all the other things Spiral has? As a language it misses several key concepts that would actually make such a system work well.
With the last paragraph I half agree with you.
It is true that right now Spiral uses a compile-run loop and acts as a sophisticated code generator, but it could be made into caching interpreter with some work. It probably will have to be if it gets bigger for the sake of compile times and library support.
Getting instant feedback in the REPL is hardly the domain of dynamic languages anymore - F# can do it well for example and so can various other static functional languages. Even C# can do it these days.
> To me, Spiral looks like a cool alternative to C++/Rust/Go in the scientific computing domain, and package ecosystems that mix Spiral+Julia could be quite powerful.
Julia does a lot of multi-stage compilation. It probably has had a lot of work done on its JIT by now. And language mixing, especially of languages with different strengths and weaknesses is a great idea. Staging gives the user access to it in a more direct manner so why not bring it out?
Spiral is quite a small language so if JITing abilities are flexible enough, it might be a good idea to implement it in Julia as a DSL and merge it with main language. That would combine all the advantages of Julia and Spiral as languages.
The reason why I started talking about inlining of monads here is because ultimately they are just functions, and Spiral's tensors are not part of the language, but just functions as well. So if it cannot do monads then it probably cannot do tensors to satisfaction either.
It was pointed out to me that tensor slicing could serve as substitute for partial application. What I am wondering next is the degree of flexibility Julia's tensors have in the context of GPU kernels. Let me illustrate this with a short example. The following is how the inplace map Cuda kernel is made in Spiral. It seems simple, but I'll go at it from the top to show the characteristics and flexibility that Spiral's tensors have.
The first line is `inl in, out = zip in, zip out`. Since the inputs to to the map function can be arbitrary data structures, the `zip` iterates over it and zips the tensors in it into a single one.In Spiral the tensors have a tuple of arrays layout by default, and it is absolutely trivial to switch between AOT and TOA representations. In fact you can mix and match tensors that have both layouts internally. Zipping tensors in Spiral just merges them into one with the TOA layout though the subtensors could be AOT.
Dealing with these layout transformations is a significant issue in numerical computation that Spiral solves completely. The reason why I have an impression that Julia cannot do this is not because I know for sure. Rather I do not think this can be done within the confines of a parametric type system. Rather I think it would require an intensional type system like the Spiral has.
`flatten` checks that the tensor is contiguous and flattens it into a 1d tensor. The `to_dev_tensor` is just some formality that I could not get rid of. It takes the pointers out from the references holding them which allows the tensors to be captured lexically before being passed into the kernel. Without this a type error would occur. The actual kernel itself is quite small and trivial. `cuda` is just a parser abbreviation for `inl {blockDim gridDim} ->`. In other words it is not a special language feature, but a regular function.When I was doing this in F#, I had to not pass in every single argument of the function by hand - meaning the tensor pointers, their offsets, their sizes, their dimension ranges for each and every tensor, I also had to make wrapper classes on top of that. If a map operation needed more arguments, then I would have needed to copy paste a kernel, adjust it and the wrapper and repeat the testing process all over again, all this with very little type safety.
Spiral is not quite on the level of Futhark in terms of ergonomics, but it is a vast improvement over what came before. Note that there is no need for a type annotations anywhere - the function is as generic as it could possibly be in a statically typed language. Because of inlining guarantees all of this has zero overhead.
Compared to Spiral, where do Julia and its tensors stand on the axis of generality? Would it be possible to write a simple map as simply as this?