Hacker News new | past | comments | ask | show | jobs | submit login
State of Machine Learning in Julia (julialang.org)
256 points by agnosticmantis on Jan 12, 2022 | hide | past | favorite | 95 comments



I get the sense that the Julia guys are doing very well on solving interesting, hard technical problems like AD. But maybe less well at “dull” problems like good documentation and API design. That might bias them towards being a language for research rather than for production. Still very interested to see how things pan out.


I sat next to an active Julia dev from MIT for a semester, not long ago. He was a lot of what you would expect - accurate, on-time results, tending towards overwork, but on track. I like the Julia project. In this guy's defense, the environment is competitive by design. Some people do get grants and rewards for "dull" work but you know they always get those high marks for solving known domain problem, for real. "Docs lag behind" is fair to observe, but the forge that creates this correct, disciplined work makes the results, too. Other "soft" academically-rooted projects ultimately did produce docs, there is no reason to believe that Julia is different there.

Personally, I think you do need to pay some living stipend to translators and core authors, within the context of their physical lives. The "libre" movement is real, and has to mature too.


> I think you do need to pay some living stipend to translators and core authors

I agree. Institutions like MIT have the monetary resources to support things like this, but they just don't want to. Universities have still not figured out a great model to hire people from industry to help them with their problems. Some places, like the Broad Institute have.


Exactly. To me they have a rather big blind spot when it comes to reaching out to users with docs, and understanding their user needs, exactly because they focus so much on these hard challenges.

You can see it in pt 2: "where is the Julia ML ecosystem currently inferior?". They identify some gap with using cuda optimized kernels, where Julia doesn't make use of built in cuda magic; a solution is proposed where users might manually override compiler behavior (if I understand correctly).

However, no mention at all about the gap in dev experience. Some examples:

- TERRIFIC efforts of various groups in porting ml models from research to easy-to-use torch code

- the super ergonomic data wrangling tools in torch. Practically speaking, ml ecosystems need quality interop: do i get my data in and out of the ml framework in a sensible way

- the ease of getting started building your first toy ML model in torch because the docs and the API are so great

Getting the ball rolling, adoption-wise, means overcoming the cost of switching to something new. And I don't see that happening soon because python+torch is just so comfy.


I think it's more complicated than that. The projects that are getting the funding are usually the hard, technical ones, but that funding also supports better docs + more time for API design. This doesn't apply to bleeding edge stuff, but look back through the core SciML libraries and there's no shortage of effort directed towards "dull" stuff like docs + improving compile times. Likewise for the core language: a lot of recent work is bread and butter engineering like (again) improving compile times, filing rough edges off of APIs and (gradually) tackling the deployment story.

Now, one area where this dull problem work isn't as noticeable is on the "core" deep learning libraries (Flux and Zygote). AFAICT those two haven't received any significant funding for a couple of years, and there is at most 1 full time, active contributor for both of them. Compare with JAX or even higher-level wrapper libraries like Flax, Haiku or PyTorch Lightning, which have 5-10+ full time core devs. Given this, is it surprising that progress on anything (including docs + interface design) is slow?


Documentation is a hard problem. Very hard.

Julia actually does quite ok on that front. Here's the 1400 page Julia manual

https://raw.githubusercontent.com/JuliaLang/docs.julialang.o...

What other language do you have in mind that has better documentation?


I don't think they're talking about the language manual itself, which is certainly very good (though I have my quibbles with it), but the quality of documentation for packages/package ecosystems. Those vary a lot in quality: some good, some incomplete, some hard to access (another subthread here mentions cases where you have to know the implementation details of a package to know where to look up the docs), some poorly organized, some outdated, etc. It's not an insurmountable problem, just adds additional friction to the process and can make the experience more frustrating than it needs to be.


> But maybe less well at “dull” problems like good documentation and API design.

I'm inclined to agree with you here. To my mind, this means that there's ample scope for developers from many backgrounds to make positive contributions to the Julia ecosystem.


Yes and yes. Julia has a pretty bad developer experience. My biggest concern is that when you bring this up to the Julia crowd they just argue that it’s actually great and you just don’t get it.

The Julia community feels a lot like the Scala community to me. Lots of proving why the theories are right, but in production it’s a mess


Mostly because in regards to where they are coming from the pain points lie elsewhere.


I took a quick look at it a while back and this seems correct. They are very hard to actually use, and the tooling seems to be lacking severely.


I feel similarly. I love playing around with Julia, and it’s nice for academic one offs, but I wouldn’t choose to use it for a serious deployment at my job. And the interesting part about ML is that industry is very close to academia relative to other fields. So that predisposes a lot of academia to using SKLearn/Pytorch/Jax etc., because that’s what industry is using, and it’s getting the most development.


I'm really rooting for Julia to improve its ML ecosystem. Julia is a JIT-compiled language that has the potential to become the all-encompassing solution for AI/ML/Data Science. I think their motto was "Walk like Python, Run like C." This could help to implement all their libraries in Julia instead of what is happening in Python (e.g., PyTorch is written in C++). I want to be clear; Python will be the go-to language for AI/ML for the foreseeable future. And It's a great language, but Julia is a breath of fresh air.


Why is it a problem that python calls out to other languages as needed? It's not like it's obvious while you're using the library and the result can be, in fact, very fast.

The main benefit I can think of is that Julia could theoretically do whole-program optimization, but it doesn't and it doesn't model side effects so even if it wanted to it would be quite limited.


It matters to package developers. If you can develop packages in Julia rather than in C++ you can move much faster. Also in Julia users of a package can much more easily contribute. A Python programmer cannot easily jump in an fix or add things to a package mostly written in C++.

Also writing packages in C++ creates a fair amount of friction. Combining packages becomes easier. The ML world in Julia allows a lot more code sharing while in Python PyTorch and TensorFlow are large monoliths. E.g. different ML packages in Julia can use the same library of activation functions. Each ML library is not a completely different island of functionality.


There are innumerable benefits that the compiler gets from being able to see into functions. Codegen, AD, fusion, inlining, user defined arrays and element types being first class etc


One benefit from the user's side that I've experienced is that, when debugging, I can go as far into the call stack as I want or need, using the same tools I use to inspect my own Julia code and read things in the same language throughout - no jarring context switch. This helps significantly with building a complete mental model of what you're debugging, and with ruling out a whole class of potential issues at or beyond the language border.


There a lot of cases in scientific computing where using "glue code" of any sort can kill performance. You miss compile-time optimizations that can perform whole-program analysis.

As a result, despite Julia being very FFI-friendly, the Julia community is trying to move towards a "pure Julia" ecosystem. Sometimes this attitude makes sense (Julia-native BLAS), other times not (Julia has no HTTP/2 packages).


If it is all one language you can have very nice code-sharing. Imagine a function that takes some matrices and does some operation on them. If you now want to execute this function on a GPU you can just swapt the matrix type to GPU matrices and as long as you implement kernels for all the operators the function uses the function can be executed on the GPU without any modification. This modularity removes the requirement for each package that wants to use GPUs to recreate and ship their own numpy-like implementation (like Tensorflow, PyTorch etc.).


> Why is it a problem that python calls out to other languages as needed?

There is not problem if a) someone has written that code in the other language already and b) it works correctly.

But it's harder to build, modify and debug these things, generally. Glue code is useful, but it definitely has a cost.


Add me to the club. I'm rooting for Julia for ML as well.


> I think their motto was "Walk like Python, Run like C."

I was also seduced by the concept of "octave, but with fast loops". This is really all that was needed. But I'm a bit disenchanted by the "advanced" additional features of Julia that add (unnecessary, in my view) complexity to the language. All that multiple-dispatch stuff, all those arbitrary types. I would love a compilation option that forced everything to be of the same type (a multi-dimensional array of numbers), and then it could really run like C, including compilation time.


Multiple dispatch is the main feature that makes everything else possible, not some additional feature that spoils things.


I disagree with the word "everything" in your sentence. If all you want to do is numeric computations, you just want a sort of "modern fortran" that compiles and runs fast. No need for multiple dispatch here. Julia almost is this language, if it wasn't for the multiple dispatch thing whose only effect is to slow the compilation (when all you have is the same numeric type).


> you just want a sort of "modern fortran" that compiles and runs fast

Well...if you want Fortran, you know where to find it? Fortran still compiles and runs fast. Just use Fortran, then? It's a perfectly fine language.

> if it wasn't for the multiple dispatch thing whose only effect is to slow the compilation (when all you have is the same numeric type).

How does multiple dispatch in Julia slow things down when in Julia (unlike, say, in CLOS) it's basically restricted to generating monomorphic functions on demand? If "all you have is the same numeric type" then only one version of a polymorphic function will get generated and that's as fast as generating code, say, for an equivalent monomorphic function in C.


Exactly. Fortran is fine and it has many advantages over Julia. It's not only the legacy code.

For one Julia needs an humongous runtime. I can see in the future a compiler from a Fortran number crunching routine to a webassembly SIMD-ready kernel. Julia in the browser is still a failure and not for a lack of effort. Fortran is much simpler to do.

I tried to work in Julia and it has a whole lot of hidden annoying things, undocumented nuances, etc...

I find myself wishing for a language between bare-bones Fortran and big-but-buggy Julia but I know it won't happen since the community is too small


> For one Julia needs an humongous runtime.

To be fair, Julia needs that runtime precisely to generate compiled versions of polymorphic functions at run time, so as not to cause a combinatorial explosion of n-ary functions. Also, for interactivity, which includes adding new functions at run time. These are problems that Fortran isn't even trying to solve, so it's not quite fair to complain that Fortran doesn't require a runtime that solves them. But of course there seems to be a market for a way to generate a static binary for a Julia program under a closed world assumption that wouldn't require embedding a complete JIT compiler into the process.


Ask and you shall receive:https://github.com/JuliaLang/julia/pull/41936

That giant runtime enables all sorts of features that are used by unparalleled julia packages.

Soon TM you'll be able to emit runtime free static binaries if you don't need the dynamism. In fact, that's what the GPU compiler does already


Huh? Multiple-dispatch is what makes the language really shine. I miss it in pretty much every other language I use. I am utterly puzzled by how you can see this killer feature as a downsides.

Single dispatch is an arbitrary restriction which should not exist. It only exists in most languages as a performance optimization because until Julia came along nobody could do multiple dispatch fast.


> until Julia came along nobody could do multiple dispatch fast

Is that correct/can you give a reference? Reading https://en.wikipedia.org/wiki/Multiple_dispatch#Use_in_pract..., julia uses it more in its standard library, but I don’t find indication that Julia is doing multiple dispatch better than, say, CLOS (https://en.wikipedia.org/wiki/Common_Lisp_Object_System) or Cecil (https://en.wikipedia.org/wiki/Cecil_(programming_language)).


How about Dylan? :-)

I think one of the nice thing about Julia's "just ahead of time" monomorphization/devirtualization is that it allows a level of dynamism that also works on GPUs/TPUs. This post and linked paper help me understand a little bit of it: https://discourse.julialang.org/t/julia-inference-lattice-vs...

Is this level of dynamism required for conventional ML? Probably not. For physics-informed ML and probabilistic languages? Probably more likely.


TLDR is Julia does a much better job than previous systems of de-virtualizing multiple dispatch. Most earlier systems (definitely CLOS, not sure about Cecil) allowed types of dispatch that were very hard to remove at compile time. When combined with Julia's willingness to generate lots of code, this means that most programs in Julia don't have significant amounts of dynamic dispatch which new.


So, where’s the long version? I can believe generating more code is cheaper now that we have gigabytes of memory, but does that make things faster, given that more code ≈ more cache misses?

Also, what types of dispatch does Jukai disallow that others allowed?

Just curious.


>given that more code ≈ more cache misses

More code isn't randomly jumped around in, so there are not necessarily (or in practice) more cache misses. Hot paths in Julia compile down better than in many other languages (including C/C++/FORTRAN, where aliasing is much more of a problem), specifically since multiple dispatch gets resolved as much as possible (and often completely) at compile time.

If anything, by making per type versions of things, there are less cache misses since things that are similar stay together, needing less instructions on hot paths to look up types and do other stuff that other systems do hit.

The proof is in the pudding - write (or find) some good code doing similarly complex things, and test them. Julia has C/FORTRAN speeds (and better) for a lot of important tasks, with the development flexibility of Python.


CLOS has :before, :after, and :around qualifiers which Julia doesn't support, and CLOS supports multiple inheritance. Both of these make de virtualization much harder.

https://arxiv.org/pdf/1411.1607.pdf is a pretty good general resource for info about Julia's design decisions and has a section on multiple dispatch.

The reason that generating more code can be faster is that more code can actually result in fewer cache misses. Julia is so fast because it is really good at inlining, constant prop, and automatic vectorization. These optimizations all increase the amount of generated code, but remove conditionals that decrease code locality.


Addressing the first point: It certainly does make it faster on the happy path. Instruction cache performance is not a major concern when you're running in a tight loop.


> I am utterly puzzled by how you can see this killer feature as a downsides.

It depends on the usage that you make of the language. I don't use Julia as a general-purpose programming language, but only for doing numeric computations. For that usage, the multiple dispatch feature is not necessary (all my variables are matrices of the same numeric type), and indeed it has downsides. For example, the infamous "time to first plot" is so large in Julia precisely due to the need to support multiple dispatch. If this benchmark is becoming faster is only due to a huge and complex effort in optimization. I would prefer if this effort could be spent in improving support for sparse matrices, for example.

I acknowledge that "time to first plot" may be a useless benchmark for many, even for most, people. But you can also acknowledge that multiple dispatch is a useless feature for a few graybeards like me, for whom multiple dispatch is the main reason why Julia feels slow. Indeed, when you use Julia as an interpreted language (e.g., write a Julia script), execution is very slow due to the complexities introduced by multiple dispatch, that force essentially to recompile all libraries upon each execution of my tiny script.


Multiple dispatch isn't useless for you. It's the reason the numerics are so good even when everything is the same type. Multiple dispatch makes it way easier to write generic code which means that all the libraries you use are about 10x easier to write.


Their point is that they're not trying to write generic code. In that case, Julia should act like an interactive Fortran, which would be nice, but then it needs to fix latency and binary building for the "this is pure Float64 and we know it" case. The only gain one really gets here is that type inference means you can write a bit simpler code, but that's not a huge win. That's a valid criticism that should hopefully be addressed soon, with https://github.com/JuliaLang/julia/pull/41936 being one of the biggest steps to getting there, along with the more precompilation stuff.


Chris,

Not to second-guess you (you know far more about Julia than I) but my experience is that people who think that their code is completely uniformly typed still win massively from Julia.

It starts with "just dense matrices with Float64" and then, like the poster above, "but with some nice sparse matrix support" and then "oh and special handling for Vandermonde matrices", "oh and tridiagonals" and eventually "my matrices are hyper-<mumble>-<mumble>-symmetric and I can compute vector dot products really fast". At that point, they have been using multiple dispatch for months without knowing it.

In my own case it was "I don't need multiple dispatch" until it was "oh... it surely is nice that the H3 geo-hashes work transparently with LibGEOS even though the underlying C libraries don't work together."

The point is that Julia makes using things work together far better than anything else I have seen. It even makes completely asocial C libraries talk to each other.


Thanks for your answer! I'm the one who sees everything as "just float matrices". Do you have a simple example of a numeric algorithm where multiple dispatch really makes an difference? (as opposed to simply allowing to pass different types of numbers to your functions). I have trouble imagining that.

In octave/matlab I can already use the "sum" function over dense and sparse matrices, but you wouldn't say that the language is multiple dispatch. It doesn't seem like a big deal. More importantly, I want the "sum" function to have exactly the same meaning regardless of the data type.

I abhor the idea of hiding algorithms into types, on a very fundamental level. Allowing an algorithm to behave differently depending on the type of the input data seems totally wrong to me. The famous Stefan Karpinski talk at JuliaCon 2019 is one of the most horrifying videos I've ever seen. I still wake up at night trembling and with cold sweat when I dream about this talk.


>I want the "sum" function to have exactly the same meaning regardless of the data type.

You really don't. If you have a sparse matrix, you don't want to spend 99% of your time adding zeros.

In general, the advantage of multiple dispatch is it means you can automatically get optimal algorithms for a variety of type combinations. To show why this matters, look at matrix multiplication. If you have a high level function that multiplies matrices, you want to call the appropriate BLAS function (for dense inputs). That function will be one of SGEMM,SSYMM,STRMM,DGEMM,DSYMM,DTRMM,CGEMM,CSYMM,CTRMM,ZGEMM,ZSYMM, or ZTRMM depending on the type (and element type) of matrix (this is a simplified example, in the real world you also might want to diagonal, banded, CSR, CSC, block, or any of 20 or so different matrix types). Without multiple dispatch, you have to either write a bunch of if-else statements to choose the appropriate one, or you just convert everything to a dense (and probably double precision) matrix first. The first one is totally un-maintainable, and the second one will make your program an order of magnitude slower when you ignore structure inherent to your problem. With multiple dispatch, you just call * and it does all the hard stuff for you.

The proof that multiple dispatch is necessary is that most numerics libraries that aren't in Julia make ad-hoc and slow implementations of it internally. For example, here is Pytorch's implimentation https://pytorch.org/tutorials/advanced/dispatcher.html


Thanks, here you presented an example that actually makes sense to me.

In case of octave/matlab, of course computing the sum() of sparse matrices does not traverse all the zeros. But the result is the same as if it did, just much faster. There are also many types of sparse matrices, I wonder how does it work internally, it must have a similar mechanism. Probably, since it is interpreted in real time, each time that the "sum" function is called, it checks the type of the arguments and it calls the appropriate function.

> most numerics libraries that aren't in Julia make ad-hoc and slow implementations

The word "slow" has a relative meaning here... if you take into account the time of the first compilation. For example, calling an octave script that performs a matrix product is faster than the equivalent julia script.


>the result is the same as if it did This isn't quite true. Differently ordered operations can lead to different results due to floating point math, but in general, the point of multiple dispatch isn't to produce different results for different types. The example from Stefan's talk does this when contrasting with operator overloading because it's easier to show a difference in results than a difference in performance.

I don't know the specifics of how Matlab/Octave do this, but you're probably right about runtime checks.

It's true that compilation time will make Julia slower than Octave for a single multiplication, but if you are doing a bunch of them (especially if they are small), Julia can perform the computations with much lower overhead since it doesn't have to do type checks for stable programs. Also, Julia lets you define specialized matrix types which can be asymptotically faster than Octave for specific circumstances. A great example of this is BandedBlockBandedMatrix from https://github.com/JuliaMatrices/BlockBandedMatrices.jl which are extremely useful solving of PDEs quickly, but aren't implemented in any of the other languages because they have to write all their dispatch systems manually for each algorithm.


Oh I definitely agree with you, and I think the OP's response to this post starts to go into the subjective ergonomic feels territory that is just, well okay everyone feel differently. But there is a technical point that precompiling and building binaries for the simple Float64 case is a bit harder than it should be compared to C or Fortran, so there is still a reason to grab Fortran every once in awhile. It's not unfixable, but it's something we should acknowledge that we can improve.


Can you say more about making H3 and LibGEOS work together?


I don't think latency is multiple dispatch's fault. Certainly, compilation caching failures aren't.


It kind of is. Multiple dispatch means that a static binary has to deal with most of the downsides of heavily templated C++ code (but much worse since everything is templated). Fixing this is highly non-trivial.


If that's seriously what you're looking for, you should really consider Fortran. There is a reason it is still very widely used in scientific domains.


That’s terribly interesting, given that this popped up too very recently: https://www.stochasticlifestyle.com/engineering-trade-offs-i...

However, while they speak of ML they refer only to deep learning, and I was wondering where Julia stands in terms of scikit-learn equivalents and if they ever took off.


There is ScikitLearn.jl if you want a literal port, and MLJ.jl if you want a spiritual one.


Do you know how much MLJ.jl is used out in the wild and how well it is maintained? I don't see it coming up much in discussions in, for eg., the Julia Zulip or Discourse forums, not nearly as much as Flux for example. It may just be that it's outside my visibility bubble, but the only context I see it mentioned is similar to this where someone mentions it as an available option.


I think a lot of the answer is that MLJ is pretty stable, and works so there's a lot less discussion of it. Discussion indicates a combination of interest and complexity.


Thanks, that's really good to know.


I've used MLJ for some small projects at work without issue.


Having played around with Julia recently there are two main reasons I would not recommend using it for ML:

1-Plotting takes too long. Forget fast iterations. You'll be left waiting for non trivial amounts of time to your first plot. The latency between tests is simply not good enough.

2-Libraries are just not mature. Plotting anything not too standard yields broken graphs and you have to look for a billion of backends to find one that does not break (hello twinx). Matplotlib is slow for some graphs but more often than not you get the right visualization (except for that suptitle being cut out of the graph bug that will never be fixed).

The requests library is not robust enough. This one is quite possibility my fault. I have received "bad headers" errors for some websites, but doing the exact same "get"with python works just fine.

The impression that Libraries die all the time. Could be just an impression, but it is quite common to find top stackoverflow answers that point you to dead libraries (with the common "use this other library instead" hint)

I like Julia speed and well done memory management (especially from what I saw from Dataframe.jl high memory benchmarks). But right now I'd not use it in a critical environment. (And I also suck at reading and understanding the error messages thus far, but it's likely a lack of experience from my side)


The Julia REPL starts in less than a second, importing the Plots module takes a couple of seconds, and the first plot appears in a few more seconds (on my weak computer). If you keep the REPL open, subsequent plots are quite speedy. This may indeed not be snappy enough for some workflows that involve starting a fresh Julia process for each iteration—in which case, it’s pretty simple to create a sysimage with the plotting functions and anything else you use routinely precompiled in: https://www.matecdev.com/posts/julia-package-compiler.html


For concrete numbers, here's my experience on a super-crappy laptop:

* `add Plots` on a new temp environment takes about four and a half minutes, most of which is precompilation.

* `using Plots` took 16 seconds, and

* a `plot` call after that is 7 seconds.

For comparison, on 1.6, the times are seven and a half minutes for precompilation, 17 seconds for `using` and 10 seconds for the first `plot`. I'm not patient/masochistic enough to try the same on 1.5 too, but the trend is definitely one of continuous improvement.

This is certainly not the sort of machine any production Julia code will be run in, but it is in the category of machines that a lot of students and many others trying out the language will have access to, so it gives an idea of their experience.

The times on 1.6 and 1.7 have been a qualitative change for me. plot being sub-10 seconds means that instead of it being "run it, switch to something else while I wait (potentially losing my mental context), come back and hope it's done", it's now "run it, stare at my dog for a few seconds, the plot is ready".


My dog loves Plots.jl


I think it depends on the version of Julia the OP was using. If you're using 1.6/1.7 then startup and time to first plot are vastly improved. Earlier versions are quite a bit slower. There's been a lot of improvements in these areas in recent releases.


Quite right. Since I’ve been keeping track, at about v1.5, there has been significant speedup with every release.


That's true, though they said "recently".

I don't think time to first plot is that bad anymore.

Time to first gradient can be bad in Zygote/Flux.


As a longtime user of Python for ML/AI, I love Julia... but agree that most people will find Julia challenging to use for everyday data munging, or for developing production code, say, in a business environment. The Python ecosystem (people, resources, libraries, frameworks, etc.) is superior for most common, plain-vanilla use cases.

However, if you are developing compute-intensive scientific software, I would recommend switching to Julia as soon as possible. IMHO, Julia is superior to the likes of Python, Fortran, C++, and C for those use cases.

--

PS. By the way, you can fix matplotlib's suptitle issue by calling this Figure method after the fact:

  fig.tight_layout(rect=(0, 0, 1, 0.95))  # leaves 5% of top space for suptitle


For many compute intensive tasks Julia beats the competition hands down. I got to use for loops again and it feels wonderful (in python you either write it as some numpy operation out meddle with numba to run with some speed)

But many of the "main" ml packages are written in actual fast languages and are well optimized so python's inefficiencies can be mostly ignored (pytorch/tensorflow/catboost are really fast. Sklearn is decent and statsmodels is dead slow)

But again if need some specific image processing algorithm you either have to hope for it to be implemented in c++ with python bindings (like opencv) or you are in for a world of pain.

With Julia you can just count on it to be fast and implement it yourself natively


> you either have to hope for it to be implemented in c++ with python bindings (like opencv) or you are in for a world of pain.

Yes, I agree -- Julia is superior for scientific research -- including developing or working with new, cutting-edge algorithms that are not yet implemented in any third-party library.

That said, if one is willing to do a bit of upfront work, in many and perhaps a majority of cases, it's possible to implement new algorithms/ideas in terms of tensor/matrix/vector operations, i.e., using fast library primitives instead of resorting to slow Python loops.


Actually

  using PyCall
  plt = pyimport("matplotlib.pyplot")
and just using it as in Python does work very seamless as an alternative.


It’s worth noting that PyPlot.jl is a nearly seamless way to use matplotlib from Julia. JIT compilation of the middle layer does mean that it suffers from time to first plot problems though.


That "Unreliable gradients" part is what I worry about most, why would I use a product when it just "returning incorrect gradients"?


I'd be interested to know whether that occurred pre- or post- Zygote.jl starting to use ChainRules.jl internally. Zygote's own rules used to be tested with a half-baked finite differences scheme, and it was easy to miss bugs. All rules in ChainRules are tested using the testing helpers in ChainRulesTestUtils.jl, which are very thorough.

Then, some bugs are not actually bugs. For example, the gradient of some function wrt to elements of a symmetric matrix is and should be different depending on whether the matrix is structurally symmetric or not. Or, when a function is not differentiable in the calculus sense, different ADs often have different conventions for what they will return. These cases come up all the time.


> For example, the gradient of some function wrt to elements of a symmetric matrix is and should be different depending on whether the matrix is structurally symmetric

Ugh. Yes and no, but mostly no.

A gradient is a collection of partial derivatives. The normal meaning of partial derivative with respect to symmetric matrix elements does not actually exist, because you can't vary them independently. This is actually a generic problem for any overparameterized constrained system. People still want to be able to use the standard tools of calculus though. You can reasonably do things like use Lagrange multipliers to do constrained optimizations. Or you can express a symmetric matrix in terms of only the n(n+1)/2 variables, treat them as a vector and get the gradient over that. (This is often what is done without actually describing the underlying method. This is the so-called "symmetric gradient"). But it's generally not what you actually want, even though it can lead to the same stationary points.

The best you can do is the unconstrained gradient restricted to the manifold of symmetric matrices (or whatever constrained object you are interested in). But even this is a (subtly) different thing than the gradient in the ambient space, even though it works in as large a set of cases as possible. Happily for symmetric differentials applied to symmetric matrices, the naive formulas of symmetrizing the gradient just work.

See, e.g. https://arxiv.org/abs/1911.06491 for some tracking down of the history and cogent analysis of what went wrong and https://saturdaygenfo.github.io/posts/symmetric-gradients/ for why it matters in practice (e.g. wrong and changing direction of descent for gradient descent, though still "downwards").


My point is I think simpler than what you're talking about, so let's replace "symmetric" with "diagonal." Suppose I have a structurally diagonal matrix, represented by a type `Diagonal` that can only represent diagonal matrices, and then a non-structurally diagonal matrix, represented by a normal dense matrix filled with zeros except on the diagonal.

I pass this as input to some function that uses more than just the diagonal and compute the gradient (here a matrix). In general, for the dense matrix, the resulting gradient will not be diagonal. Why should it be? If your function doesn't care if the matrix is diagonal, neither does the AD.

But what about for `Diagonal`? Well, the AD could handle this different ways. First, it could interpret all matrices as embedded in the dense, unconstrained matrices, in which case it would return a dense non-diagonal matrix. Or it could ignore that `Diagonal` represents a matrix entirely and instead return some nested structure that contains gradients of its stored fields, here the gradient of a stored diagonal vector. Similarly, it could represent this nested structure as just another `Diagonal`.

There are various reasons why an AD might do any one of these; neither is right or wrong. But note that the only case that gives you the same answer as the dense one is the one that promotes to a dense input, which is basically discarding all structure to begin with.

Most ADs don't ever have to work with matrix polymorphism, so their user bases don't need to think about this, but Julia has _many_ polymorphic matrix types, so Julia ADs have to take this seriously, and if new users naively pass a polymorphic matrix type as input to some `gradient` function, they might be perplexed at what they get back and think the AD engine has a bug. See the ChainRules talk at EuroAD 2021 for more details: https://youtu.be/B3bC49OmTdk?t=761


Ah, yes. That is a different point than the one I thought you were making.

Both the symmetric and diagonal cases are lucky ones where the constrained derivative is the same shape and size as the original (as the constrained manifolds are flat). Keeping it as a Diagonal in code then just works.

This is in marked contrast to "unit vectors", "orthogonal matrices", or similar non-flat manifolds. The derivatives in these cases although locally are the same dimension as the manifold, usually must be represented as something that spans the entire containing space once you deal with derivatives at multiple points.


Yes, precisely. It gets complicated very quickly. Cotangents for structural matrices in reverse-mode AD tend to wander away from the cotangent space at the corresponding point into the tangent space of the embedding for various reasons. Any time they wander away, there's the potential for huge performance losses. We use projections to routinely project them back to the correct cotangent space, but this at best a band-aid, and we could certainly use a better solution.

For the interested, this arises from the combination of implicitly embedding in the forward pass (e.g. multiplying an orthogonal matrix by a vector is implicitly the same as making the matrix dense first and then multiplying), where you would then need some sort of projection in the reverse pass. If your AD operates on the scalar level, you get this for free. But such ADs don't make full use of optimized BLAS routines. You can get that by writing custom AD rules, but then in the reverse pass you usually end up in some embedded cotangent space and need a projection to set things right. There are definitely trade-offs.


Thanks for the references. We're trying to write up a paper about all this work, and (to us) it seems obvious that differential geometry is the right framework.

What reverse mode AD talks informally about as "gradients" are cotangents, and if you are constrained to a submanifold of another manifold (such as the space of all symmetric matrices, embedded in the space of all NxN matrices) then your cotangent is a projection of that in the larger space. This reduces to the obvious thing for symmetric matrices, as you say, but there are less obvious AD-relevant cases (like the space of orthogonal matrices, or the space of "ranges", vectors with uniformly spaced elements) where getting it right with your bare hands is tricky.

I don't see any of the relevant terminology in these links, but they do seem to be thinking about related problems. Perhaps they have re-invented the wheel? Amused to read that standard cookbooks seem to have faithfully reproduced the recipe for a square wheel, for decades.


Julia is great for NEW ML research. But it has serious downsides for mundane or even above average production work...

Flux has historically been more of a research project then a production library. That's ok, it holds a lot of promise. But, time to first gradient can be excessive. Minutes to discover a small bug excessive. Last I checked all of the adjacent libraries around it were broken due to incredibly frequent changes to the API.

Mlj doesn't impress me. There was a serious opportunity to change the way people do routine modelling using the strengths of julia. They basically made a half baked sklearn instead. Missed opportunity in my mind.

In general the components for statistical libraries are exceptional. However the ecosystem falls short to do things with them in a stable way.

The best libraries imo for data science work stem from the excellent db adapters, optimization library's, data frames jl, Turing jl, etc. Anything beyond that and it's usually better off just rolling your own or using another language all together... Current state of production, ie not code running a notebook is also ok at best...


> Julia is great for NEW ML research. But it has serious downsides for mundane or even above average production work...

This was how PyTorch started out as well.


What are your concerns with MLJ?


MLJ doesn't concern me in a bad way. At it's heart it's an abstraction for ML components across many packages with some machinery on top for parameter tuning, etc. That's fine.

What it's missing is ... impactful improvements on the failures of libraries like this. Docs need more examples and less pointers to packages that wrap other packages(albeit they are written well). Despite being a multi-year long effort by very big names it still feels incomplete, and if someone wasn't sold on Julia there is technically no reason not to use R or Python or Matlab. Missed opportunity is all


The responses confuse "machine learning" with neural networks so far as I can see. I do lots of statsy things: good luck with that outside of R.



Until you've used: https://r-nimble.org/


looks good!

The ecosystem in Julia is quite strong for MCMC libs because people do not have to lower something to C++ to develop such a library: https://discourse.julialang.org/t/mcmc-landscape/25654/

Of course, for users, they might prefer something like Turing, but I think the Julia tends to blur the differences between user and developer more so than in most other languages (for good or worse) since everything is in one language.


Thanks! Didn't know this exists


> good luck with that outside of R.

There's always RCall for R inside of Julia, the best of both worlds.


Even professors doing trendy "AI" research conflate the two terms.


Even professors doing trendy "AI" research conflate the two terms.

Only because doing "AI" gets you bigger grants.


Tibshirani's Glossary

   Machine learning               Statistics

   network, graphs                model
   weights                        parameters
   learning                       fitting
   generalization                 test set performance 
   supervised learning            regression/classification
   unsupervised learning          density estimation, clustering

   large grant = $1,000,000       large grant = $50,000

   nice place to have a meeting:  nice place to have a meeting:
    Snowbird, Utah, French Alps    Las Vegas in August


Statistics is usually considered a subset of ML, except maybe by statisticians.

While R definitely leads the pack on variety of useful stats packages, there are excellent tools for certain tasks in other languages. I do a lot of probabilistic programming, and PyMC in Python and Turing in Julia are excellent packages. And both languages have official Stan interfaces. Not knowing R has not been a problem for me.


I'm sure it depends on your vantage point, but I would argue that ML is more likely to be considered a subset of Statistics than the other way around.



It just means that you have problems you're not aware of :-)


or that the tooling for the problems I do have is adequate in multiple languages.


When will Enzyme and Diffractor hit prod? The default right now for Flux is still Zygote.


Absolutely adore prototyping in Julia, absolutely heinous to work in production. Their memory usage is insanely high and their calc speeds are slow when you don't have the insanely high processing equipment needed. If they could just singularly focus on reducing memory consumption than the language will be likely _the defacto_ language for ML and analysis, but until then it really gets bogged down by just how much memory is needed for the most simple of functions.


Paraphrase: ML in Julia is for Serious Scientists Doing Very Sciencey Things. Things you wouldn’t understand. But your desire to train deep learning models on GPUs and deploy them as apps elsewhere? ...Aw, c’mon, that’s boring. You can just use PyTorch for that. Now, who's up for a new round of Neural ODE benchmarks?!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: