Hacker News new | past | comments | ask | show | jobs | submit login

You don't need induction for (x+x'e)^n+1. The binomial formula can be applied once the arithmetic on dual numbers is introduced.

> We can use this result to prove the same property for any smooth function f. Examining the Taylor expansion of f at zero (also known as its Maclaurin series):

That's not exactly true. For example, arctan = tan^-1 is smooth on R, but its Taylor series only converges for |x|<=1. However, polynomial series expansions exist for |x|>1, where similar arguments would apply. Also, Taylor series and Maclaurin series aren't the same thing. A Maclaurin series is a Taylor series centered at 0.

Lastly, is "differential programming" a new term? It seems weird to me that the machine learning community continues to reinvent terminology for already existing things. What is wrong with automatic differentiation? (Not necessarily a question for the author.) It's not really a new paradigm in the sense of "logic programming", "object-oriented programming", "functional programming", etc., and is instead just a technique.




The idea behind the term "differential programming" is that many traditional autodiff systems force you into a highly restricted subset of the language or DSL (e.g. no loops, no if statements etc). Differential programming is a term used to describe systems that let you take derivatives of your entire source code. This lets you do things like AD through a simulation or optimization algorithm which can be really powerful for applying ML techniques to domains that are less typically suited to ML approaches (circuits, physics, differential equations etc)


But I would wager that those are poor implementations of automatic differentiation, as the property that automatic differentiation works with for loops, if statements, etc. is inherent to automatic differentiation. So differential programming seems like automatic differentiation just implemented properly.

I've written some simple forward-mode automatic differentiation implementations in a few languages, and it's akin to just using a library. It doesn't seem like a paradigm to me.

Reverse-mode automatic differentiation is basically backpropagation. So again, ML seems to just like to rename things. I studied mathematics and have recently been trying to learn some ML. It's pretty annoying that a lot of mathematical terms seem to have been renamed or coopted for something else.


Autodiff does not work with for loops or if statements. The current solutions effectively pick a few promising traces through the program and then assume that nothing else exists. To handle it more elegantly (for things like preserving equational reasoning or avoiding exponential blowup) you need to address it at the level of language semantics.


> Autodiff does not work with for loops or if statements.

Is that necessarily true? Here is an incomplete automatic differentiation implementation that handles if statements just fine in a function definition. Unless you mean something else.

    type Dual = {Real: float; Epsilon: float} with
        static member (~-) (x: Dual) = {Real = -x.Real; Epsilon = -x.Epsilon}

        static member (+) (x: Dual, y: Dual) = { Real = x.Real + y.Real
                                                 Epsilon = x.Epsilon + y.Epsilon }
        static member (+) (x: Dual, c: float) = {x with Real = x.Real + c}
        static member (+) (c: float, y: Dual) = {y with Real = c + y.Real}

        static member (-) (x: Dual, y: Dual) = x + (-y)
        static member (-) (x: Dual, c: float) = x + (-c)
        static member (-) (c: float, y: Dual) = c + (-y)

        static member (*) (x: Dual, y: Dual) = { Real = x.Real * y.Real
                                                 Epsilon = x.Real * y.Epsilon + x.Epsilon * y.Real }
        static member (*) (c: float, y: Dual) = {Real = c; Epsilon = 0} * y
        static member (*) (x: Dual, c: float) = x * {Real = c; Epsilon = 0}

    let dcos (x: Dual) = {Real = cos x.Real; Epsilon = -(sin x.Real) * x.Epsilon}
    let dsin (x: Dual) = {Real = sin x.Real; Epsilon = (cos x.Real) * x.Epsilon}

    let differentiate (f: Dual -> Dual) a =
        let x = f {Real = a; Epsilon = 1.0}
        x.Epsilon

    let testFunction (x: Dual) = if x.Real < 0.0 then dcos x else dsin (x*x - 3.0*x)
Using that gives:

    > differentiate testFunction -1.0
      0.8414709848078965

    > differentiate testFunction 1.0
      0.4161468365471424

    > differentiate testFunction 0.0
      -3.0
Now, of course, one needs to be careful interpreting the result at a = 0.0. That's because the testFunction is not differentiable at that point due to a jump discontinuity there, but we still get a value back. But as far as I know, this is simply an issue with automatic differentiation in that it only correctly tells you what the derivative is if it exists at the given point.


This "discretizes then differentiates" to borrow terminology from [1] which is one of the more accessible presentations and papers. The program might evaluate correctly, but equational reasoning (like you might want for any kind of automated optimizations) is broken. In a toy example like this where you're doing everything manually then you probably don't care, but for larger systems, it gets tiring to do the mental equivalent of assembly programming.

[1] https://people.csail.mit.edu/sbangaru/projects/teg-2021/


This isn't a toy example, though. It's the start of a library. Once you've developed the dual numbers and differentiate function and defined the dual number versions of all elementary functions, then you have a full (forward-mode) automatic differentiation library that can just be used. You wouldn't have to do anything manually. You'd just define your functions using this library instead of the built-in functions, since you can use the dual number functions both to differentiate or simply to evaluate (setting the dual part to 0).

> This "discretizes then differentiates"

Not sure what you mean. It defines dual numbers, then defines elementary functions on dual numbers (I only did two as an example). From there, you get differentiation for free (i.e., automatically). The only thing that was done manually was defining the testFunction. Everything else would be part of a library that you'd consume.

I'm not sure what you mean by "equational reasoning is broken".

Thank you for the link to the paper. Seems interesting, and I'll read through it more. Although, it is discussing differentiating integrals, which is where their language "discretize-then-differentiate" comes from. From this paper, I sort of get a sense of why differentiable programming might make sense as a concept, but I've only ever seen the term introduced with automatic differentiation, which is what I was balking at (given the content of the original post here). I'll keep reading this paper, but I think what you've mentioned before hasn't convinced me. Thanks for the discussion.


> In a toy example like this where you're doing everything manually then you probably don't care, but for larger systems, it gets tiring to do the mental equivalent of assembly programming.

But by this argument (which sounds plausible to me) you have defeated your previous claim that differential programming is really a new paradigm, as it seems you have adopted what bmitc wrote earlier, that differential programming is not a new paradigm but "seems like automatic differentiation just implemented properly".


There's no contradiction: autodiff is a method of implementing differentiable programming. In this example, it is implemented as a type that handles a trace of a program, but everything else is left to the programmer. This is a problem because most of the code I would want to write is not a single trace!

Analogously, I could write a program in C that does message sends and organizes code in a design pattern called "objects" and "classes". Incredibly painful, but workable sometimes. Some people even call it "object oriented C" and go on to create a library to handle it like [1]. Is object orientation not a paradigm because I've implemented a core piece as a library?

No, because that misses the intangible part of what makes a paradigm a paradigm: I structured my code this way, for a reason. In OOP, that reason is the compartmentalization of concerns. The underlying OOP mechanism gives me a way to reason about composition and substitution of components to minimize how much I have to reason about when writing code. Similarly, in differentiable programming, the differentiability of all things gives me a way to reason about the smooth substitution of things because it more easily lets me reason about how the machine writes code.

[1] https://en.wikipedia.org/wiki/GObject


Seems we're arguing about definitions. Currently differentiable programming seems to be this vaguely defined term (I don't get what you mean by smooth substitution), with autodiff being its only (proper) instantiation.

You say autodiff is actually not representative of differentiable programming. But if there aren't any other good examples that illustrate differentiable programming, how is differentiable programming (currently) more than autodiff?...


@bmitc: Reading your replies (some of which seem to have been written the same time I wrote mine), it seems we are on the same page; I'm also a mathematician and I also have some qualms with how people invent new names for automatic differentiation :) I had a look at your bio and couldn't find any email address. Would you perhaps be interested in having a longer, scientific discussion about AD?


Julia's AD is compatible with control flow. They have their own issues, but Zygote + ChainRules actually work pretty well


Mathematician here.

> Differential programming is a term used to describe systems that let you take derivatives of your entire source code.

I think the way this is stated is incorrect. What would the derivative of a program that outputs the reverse the input string be? It would be meaningless.

It seems rather to be the case that it describes systems that let you take derivatives of your entire source code where that source code describes a numerical function. Perhaps it is obvious to state this, but I keep seeing differentiable programming being described as "taking a derivative of the source code" and it seems annoyingly imprecise.

> This lets you do things like AD through a simulation or optimization algorithm

This is also, taken literally, not a correct statement. It's not the algorithm which you want to differentiate, but a hard-to-differentiate function within the algorithm (which is always a gradient-descent type algorithm, as it otherwise makes no sense to consider derivatives; there is the entire domain of derivative-free optimization BTW). So, in case of (stochastic) gradient descent to train a neural network, that would be the neural network that you want to use AD on, not the gradient descent algorithm. Obviously.


> So, in case of (stochastic) gradient descent to train a neural network, that would be the neural network that you want to use AD on, not the gradient descent algorithm. Obviously.

The specific proposal of differential programming as a paradigm that goes beyond simple applications of gradient descent for optimising NNs is exactly to apply gradient descent to optimise learning (and other) algorithms themselves. This may be to select hyperparameters, or to select among a family of algorithms, or to include differentiable constraints in the solver, etc.

I wonder, are you familiar with now classic paper, Learning to learn by gradient descent by gradient descent?

[0] https://arxiv.org/abs/1606.04474


I don't really see the distinction. In the basic case of using gradients to optimize a NN, what's really being optimized are the parameters of a function expressing the NN error. We're just following the gradient downhill to find whatever minimum of this function the gradient leads us to (hopefully the global minimum, as is likely to be the case with lots of parameters).

In this "learning to learn" paper the optimization function (used to update the gradients at each step of the gradient descent algorithm) has been parameterized, and we're instead using gradient descent to find a minimum of that parameterized function.

So, yes, differentiable programming isn't limited to minimizing the losses of neural nets, but as the grandparent poster pointed out, the whole concept obviously only applies to parameterized programs calculating some numerical output that we want to miminize.


Wow, it’s interesting that a paper from 2016 is already classic. This field moves fast!


Well, I'd say it's a "classic" within the field of differentiable programming, which is a new field, so... everything is relative.


Agreed about the binomial formula, but don't you need induction to prove the binomial theorem? That is, if you have some sort of set that's like the naturals except that Peano's axiom of induction doesn't apply (such as the naturals plus Alberto and Cristina, who are one another's successors) is the binomial theorem necessarily true in it?

Agreed about Maclaurin series.

I didn't know there existed smooth functions with divergent Taylor series. I don't understand how that happens; if we're looking at term $n$ of the Taylor series around some point $a$, it's $f^{(n)}(a) \frac{(x - a)^n}{n!}$, where $f^{(n)}$ is the nth derivative, right? I guess you could have a function whose derivatives eventually start increasing so fast that even if (x - a) is 10^{-100}, the exploding derivative eventually wins?

PARI/GP helpfully informs me that deriv(atan(x)) is 1 - x^2 + x^4 - x^6 + x^8 - x^10 + x^12 - x^14 + O(x^16), which sure sounds like the sort of thing that would diverge when |x| > 1. But presumably that's based on the Maclaurin series for arctan, and you'd get a different result if you were taking a Taylor series around 0.9 or something? In response to taylor(atan(x-9/10), x) PARI/GP says things I will not repeat here, so I guess blindly pounding on the keyboard isn't going to get me very far. Some guidance could be helpful.

I think differentiable programming is a new paradigm in the sense of logic programming and functional programming, and I do think it goes beyond just autodiff (though the linked article only explains autodiff). To the extent that a programming system is differentiable, you can use gradient descent to search for programs that minimize a loss function, not just inputs that do. But even just searching for inputs that minimize a loss function is a pretty different way to program than the conventional approach, even though Ivan Sutherland proposed it as a general approach in SKETCHPAD.

An example of a differentiable programming system is in https://arxiv.org/abs/1605.06640 "Programming with a Differentiable Forth Interpreter", 02016.


> Agreed about the binomial formula, but don't you need induction to prove the binomial theorem?

Yes, that's the point. :) In that, they're basically re-proving the binomial theorem (or more accurately, using the same method) rather than just using its results with the new dual number arithmetic.

Michael Spivak's Calculus has some great chapters on Taylor polynomials, Taylor's remainder theorem, and Taylor series. I've been looking into this stuff lately, to try and justify mathematically why automatic differentiation works (I've never read anything that does so, including several published papers), and that's where I was reminded of the fact that not everything is gold with Taylor's polynomials on real numbers. Maybe (?) the introduction of dual numbers gives something akin to complex analysis' concept of analytic. Not sure and not there yet.

> I think differentiable programming is a new paradigm in the sense of logic programming and functional programming, and I do think it goes beyond just autodiff (though the linked article only explains autodiff).

After some of the responses, maybe that's where I'm balking at. In that, I've only ever seen differentiable programming mentioned in the context of automatic differentiation. One of the papers posted led me to be more comfortable with the concept of differentiable programming as a distinct thing, in a sort of Mathematica sense, but I'll definitely need to read some more.


> I've been looking into this stuff lately, to try and justify mathematically why automatic differentiation works.

What do you mean by that? Isn't the theory (at least the mathematical, abstract version of it) already fully represented in books like Griewank that I mentioned in another comment here in this post?

I remember seeing somewhere a rather well develop theory using dual numbers (but this was a purely mathematical development); but I think that there is a bit of a division between computer scientists and mathematicians, so it could be that the whole machinery is mathematically already development and is now being re-development within the mantle of theoretical computer science, in and given new closing in terms of a programming language/semantics that supports the mathematics "natively". But that is more of a guess I'm having.

> One of the papers posted led me to be more comfortable with the concept of differentiable programming as a distinct thing.

Which paper was that?


> Isn't the theory (at least the mathematical, abstract version of it) already fully represented in books like Griewank that I mentioned in another comment here in this post?

I guess it's in there somewhere, but like many of the papers, I find that book to overcomplicate things and just not to my liking. I also don't see any mention of dual numbers, although it's probably somewhat implicit in the development (maybe). I'm mainly interested in the dual number approach. With the dual number approach and forward-mode automatic differentiation, there's really no reason to overcomplicate the developing by discuss "propagating tangents". The development can be more much intuitively done because using dual numbers, automatic differentiation just falls out without discussion of propagation.


I think autodiff is mostly interesting because of optimization by gradient descent, and gradient descent is only efficient using reverse-mode autodiff, which can't be done with dual numbers.

I think optimization by gradient descent (and its variants like Adam) is interesting because it's the only optimization algorithm family I know of whose runtime scales linearly oreven near-linearly with the number of dimensions, so for large-dimensionality problems it's the only computationally feasible algorithm. Also, though it's only guaranteed to find the optimum for differentiable convex problems, in practice it seems to work well enough to be interesting for nonconvex problems.

I think optimization is interesting because it allows you to solve arbitrary inverse problems, or at least continuous relaxations of them.


Yes, the book is hard to read. I just went through a few chapters of it.

> One of the papers posted led me to be more comfortable with the concept of differentiable programming as a distinct thing.

I'd still be interested in knowing which of the papera that was ;)


Rather then repeating various arguments, I think it's best to just mention this link, which provides explanations and illuminating examples of the phenomenon of smooth functions with diverging Taylor series: https://mathoverflow.net/questions/72/whats-an-example-of-a-...

Quote by Pietro Majer from the link: "it is more useful to communicate the (historically non-trivial) idea that there is a difference between a function and a representation of it by means of a formula"


Thanks! I wonder why the example of a smooth function with a diverging Taylor series here is so much more complicated than arctan.


Object oriented programming, for example, doesn't let me have a variable hold half of one object and half of another or let the language derive the code that gave me that object at runtime, but object oriented + differentiable programming does. It's no less of a paradigm than logic, quantum, or probabilistic programming. If you want to, you can view differentiable programming as extending logic programming with a product and chain rule (+ some additional constraints) that allows (smooth, if you want it) interpolation between data and code.

That said, most discussion of differentiable programming is at the level of syntax sugar for reverse mode differentiation, so I can't blame you for that conclusion.


You don't need OOP plus another paradigm to do automatic differentiation though. I've implemented automatic differentiation, albeit "simple" versions and only forward-mode at this point, but there's really nothing special about the implementations.

Logic programming, on the other hand and for example, needs something much more substantial to be implemented as a library in an existing language, such as backtracking, unification, or the full-on Warren Abstract Machine.

If someone has a clear example of differential programming that is different than just using automatic differentiation as a technique or library, then that might help.

> doesn't let me have a variable hold half of one object and half of another or let the language derive the code that gave me that object at runtime

I'm not sure what you mean here. Could you elaborate?


I don't need OOP to do a hash table lookup and then an indirect function call with the receiver as the first argument either but that ignores that there's more to a paradigm than the algorithm I use for facilitating it. You can embed unification of expression trees as a library in C++. People implement backtracking all the time in almost every language. Talking about differentiable programming as if it's just autodiff is missing the point of what a programming paradigm is.

There's a mechanism, yes, but that's just a means to an end of efficiently enabling a different way of approaching programming. In the case of differentiable programming, that's continuous code and continuous data enabling program search that doesn't have to use purely discrete methods (like logic programming). If that sounds like autodiff and backprop, then yes, that's because that's a good way to implement it. Tensorflow and PyTorch are DSLs embedded in Python and C++ both useable and used for more than just implementing neural networks, but most people aren't happy calling a library a language until it has a parser and a file extension.

> I'm not sure what you mean here. Could you elaborate?

Most programming languages assume that a variable can only contain one value, or a composite value of values. Differentiable programming lets code be smoothly transformed from one to the other while being meaningful at all points between. In an object oriented case, this would be like having a variable contain an object that behaves like some known object A or object B selectively depending on which choice maximizes the success of the program at any given moment.


Seems a bit of conflation of notions going on here: Just because logic/quantum/probabilistic all end with "programming", it doesn't mean that they all are what is called a "programming paradigm" in the typical sense [0]. Where does a paradigm end and a library implementation begin?

For example, varying how a program arrives at the answer induces impertive/OOP vs. declarative paradigms [0]. On the other hand, quantum programming assumes a radically different type of "CPU" (i.e. instruction set) on which your program runs - which in turn obviously changes everything. But this can safely by implemented within existing paradigms, e.g. ProjectQ [1] is implemented in an OOP language: Python. Thus, I would not call this a new programming paradigm (unfortunately [0] does that, but I think that is bad form).

> you can view differentiable programming as extending logic programming [...] that allows interpolation between data and code.

Reference? I have browsed one of the standard references [2] on automatic differentiation and googled a bit and could not find something that supports your statement. Even more so, is seems that even defining semantics for differential programming is barely in its starting stages [3].

> differentiable programming is at the level of syntax sugar for reverse mode differentiation, so I can't blame you for that conclusion.

By the same argument you could say that probabilistic programming is just syntax sugar for painless specification of statistical models.

Care to provide a reference where differential programming is presented as something more than "syntax sugar"?

[0] https://en.wikipedia.org/wiki/Programming_paradigm#Further_p...

[1] https://projectq.ch/

[2] Griewank A., Walther A., Evaluating Derivatives, SIAM 2008

[3] https://arxiv.org/pdf/1911.04523.pdf


The existence of a different kind of CPU isn't a meaningful distinction at the level of discussing paradigms. The semantics are different, so the abstract machine is different. The fact that I need a different set of atoms in my desktop to use it doesn't change the programming language part of the discussion.

The main paper to read is [1] which introduces a syntactic notion of differentiation in the lambda calculus connecting substitution and nondeterministic choice to differentiation in the calculus of infinitesimals sense and also introduces a meaningful notion of Taylor expansion of arbitrary programs. This paper is mostly of academic interest, though. The resulting expansion is wildly uncomputable meaning that more modern, practical papers like [2] cite it wistfully as a dream of what could be achieved. How to computably handle most of the constructs we care about in a general programming sense is very active, open research. At the time the paper was introduced, it was more influential on (and influenced by) work on probabilistic and quantum programming through their related models of linear logic [3]. There are only a few slight axiom differences that separate differential, logic, probabilistic, and quantum programming though, so if you're willing to accept one as a "paradigm", then you should accept the others.

[1] https://www.sciencedirect.com/science/article/pii/S030439750... [2] https://arxiv.org/abs/1911.04523 [3] https://ncatlab.org/nlab/show/differential%20category


It seems you haven't read my references? As your [2] is my [3] from above!

> The existence of a different kind of CPU isn't a meaningful distinction at the level of discussing paradigms.

Well, that was my point above: You can't really lump quantum programming together with probabilistic programming, as they are paradigms on different "levels".

> practical papers like [2] cite it wistfully as a dream of what could be achieved

Are you sure about that? I skimmed [1] as I wasn't hadn't read it and it seems to describe a rather restricted set of functions ("types are interpreted as vector spaces and terms as functions defined by power series on these spaces"), as there are many differentiable functions that cannot be defined as power series.

Moreover, in [2] it is only claimed: "Ehrhard and Regnier {i.e. your reference [1]} do not give an operational semantics but they do give rules for symbolic differentiation and it should not be too difficult to use them to give an operational semantics. However their language with its convenient vector space semantics only supports total functions. It therefore cannot be extended to include recursive function definitions or conditionals (even with total predicates, as continuous functions from R^n to the booleans are constant)." So I would not say they cite [1] as a wistful dream ...

> There are only a few slight axiom differences that separate differential, logic, probabilistic, and quantum programming though.

Give me the axioms and their differences and I believe you. :) (Honestly, I'm not even sure if the discussion on which axiomatization captures the existing developments has been settled; it seems to me you have some kind of category theoretic approach in mind where you just change the category and get a new paradigm - I'd be happy to accept this as well, if there is a clear reference, though I'm doubtful one exists ...)


I don't usually respond to old comments, so I don't know if you'll read this, but I hope I can encourage you to think more broadly about what "differentiable programming" means.

Different fields have a different perspective on the same set of tools because those tools have different pathological cases in different areas. Context really matters.

> Well, that was my point above: You can't really lump quantum programming together with probabilistic programming, as they are paradigms on different "levels"

This distinction is not useful. If I write in a functional or logic programming language, it gets translated into imperative commands for an underlying architecture that is some mix of dataflow, event driven, automata-based, concurrent, etc... that is then further built on top of some physical atoms where an engineer worried about quantum effects. If I write in a quantum programming language, it will probably go through the same process for at least another 5 years. You might argue that quantum is somehow more dictated by the underlying physical model the way that people argue that imperative programming is closer to the physical world than functional programming. But the "level" doesn't change the usefulness of viewing all of these as "paradigms" worthy of study and analysis on their own terms with their own tools. At the level of studying a programming language, the "level" is a useful thing to be aware of for implementations and motivation but usually not for a theory.

> Even more so, is seems that even defining semantics for differential programming is barely in its starting stages

This is also not a useful distinction. OOP was also, infamously, a point of contention between the academic and outside worlds because it was developed and incredibly prevalent without a rigorous theory abstracting it beyond procedural programming. It became a "paradigm" despite that because there was a set of (informal) tools for reasoning about it on its own terms [1].

Likewise, differentiable programming has largely developed to formalize what makes programs written for machine learning frameworks different from programs written in the imperative/object oriented/functional language they are built on. Autodiff has mostly developed in practical usage, so the use cases are front-running the theory. There's increasingly hardware tailored to the execution model and software developers attempting to program it. There are approaches to problems like discontinuities that people have found solutions for without a rigorous theory justifying their use. There's a structure to why and how people are writing code for these applications as well as an operational theory for how to reason about it, but there's very little compositional, equational theory for these choices.

To most people in machine learning, "differentiable programming" is just autodiff with pretty syntax because the term sprung from attempting to put what they are already trying to accomplish with that implementation on more solid theoretical footing as a computable model of a more general logic. That, hopefully, lets us more efficiently explore what a better domain-specific theory might be and if there are better execution models or logical frameworks. Autodiff itself is increasingly used as an umbrella term for many other methods of differentiation with different edge cases, so this is already happening in practice.

To reduce "differentiable programming" to just its implementation ignores that aspect. It would be equivalent to equating machine learning to matrices. Not unreasonable computationally and not a terrible place to start for a theory, but deeply unsatisfying as a mature domain-specific theory.

The main paper I linked [2] is not about autodiff at all. It's an attempt to establish a connection between differentiation in an analysis sense to models of (not otherwise obviously differentiable) program evaluation. The (unrealized) promise is that the centuries of understanding we have for the calculus of infinitesimals can be applied to the less-mature study of lambda calculus and nondeterministic computation. Papers like [3] cite it because it addresses (discrete) structures that analysis is less interested in and potentially provides a way to connect computation, calculus, and whatever it is that we're doing with machine learning.

> Are you sure about that? I skimmed [1] as I wasn't hadn't read it and it seems to describe a rather restricted set of functions ("types are interpreted as vector spaces and terms as functions defined by power series on these spaces"), as there are many differentiable functions that cannot be defined as power series.

Because it's a PL theory paper, it's not concerned with whether all differentiable functions can be represented, but whether all computable functions can be differentiated. And PL theorists are generally more comfortable than most to accept that most functions cannot be computed and choose a more restrictive model that enables more reasoning power. The category [4] is really the better place to start since it lets us also consider models that aren't vector spaces and [2] is best thought of as a prototype that left many gaps in the theory (for example the requirement on coefficients for convergence is wrong, though I can't remember which paper by Lionel Vaux proved this). It can be thought of as computable in finite instances, but it's unsound even when typed due to the zero term and result of sums.

The quote you cite from [3] is easily misunderstood without that context. As a practice-focused paper, it cares very much about computability. Conditionals and loops are possible in [2] since it allows church numerals and fixed point combinators but it introduces a nondeterministic sum which is exponential in the number of evaluation steps and may diverge (doubly so since it's the untyped lambda calculus...) and is difficult to operationalize. That's what I meant by "wildly uncomputable". So, to them, [2] offers a useful mental framework for higher order features, but is not practical. The theory isn't there yet.

The connections between logic, quantum, probabilistic, and differentiable programming can be understood by how the model treats the exponential modality (!) which converts the otherwise linear term to an analytic one. Differentiation decomposes this to give a sum of linear terms. Differential lambda calculus doesn't put any (more) structure on the sum. Probabilistic programming gives the added structure of a probabilistic sum where coefficients are weights. Quantum programming can be modeled via a Fock space [5] for (!) ([5] predates [2] so is not directly discussed as a differential category here). However, it's unclear what the right model for differentiable programming should be if we want something practical for the resulting derivative (much less antiderivative). Daniel Murfet et al [6] have some related work more directly in the context of machine learning.

[1] https://www.cs.cmu.edu/~aldrich/papers/objects-essay.pdf [2] https://www.sciencedirect.com/science/article/pii/S030439750... [3] https://arxiv.org/abs/1911.04523 [4] https://ncatlab.org/nlab/show/differential%20category [5] https://www.researchgate.net/publication/2351750_Fock_Space_... (sadly a researchgate link) [6] http://therisingsea.org


> don't know if you'll read this

I did read it ;) Because I'm very much interested in this entire topic.

> I hope I can encourage you to think more broadly about what "differentiable programming" means

I'm trying to, but I find it hard. My stance was that differentiable programming seemes like this theory for which only a single example (namely autodiff) existed, as you also said ("autodiff has mostly developed in practical usage, so the use cases are front-running the theory"). But this entire comment of yours really clarified some things for me.

> The main paper I linked [2] is not about autodiff at all. >The quote you cite from [3] is easily misunderstood without that context

You finally convinced me to have a detailed look at this. Thank you for providing the context.

> Conditionals and loops are possible in [2] since it allows church numerals and fixed point combinators but it introduces a nondeterministic sum [...] and is difficult to operationalize. That's what I meant by "wildly uncomputable".

I think I may have misunderstood some of your previous comments (and perhaps vice versa) as it now dawns on me that you use a vocabulary than comes (I guess?) from PL theory and is very different from the one I'm used to, as a mathematician versed in analysis. I'll re-read them.

> Daniel Murfet et al [6] have some related work more directly in the context of machine learning.

I'm actually aware of Daniel Murfet but haven't read his work from last years. Did you have a specific paper from him in mind?


P.S. (In case you return once more to read comments.)

It seems like these are problems that could benefit from a PL-theory <-> mathematical analysis cross-collaboration.

I would have sent you a private message via email, but couldn't find any info on your HN profile; my email is there.


> That's not exactly true. For example, arctan = tan^-1 is smooth on R, but its Taylor series only converges for |x|<=1.

This issue creeps in a lot but I think it comes from the complex analysis, where differentiable function is always smooth and analytic (holomorphic).


But as far as I know, there is no corresponding dual analysis. The dual numbers do not form a field.

The issue primarily revolves around elementary functions, compositions of them, and polynomials.


Yeah, I was only trying to explain where the issue (assuming smooth=analytic, that you rightfully pointed out) might have come from


Ahh, gotcha. I'm trying to figure this out myself. I've always been intrigued by automatic differentiation but have never been satisfied by any article addressing its mathematical justification.


Neither have I ...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: