Hacker News new | past | comments | ask | show | jobs | submit login
Zero Cost Abstractions (ruudvanasseldonk.com)
366 points by Bognar on Nov 30, 2016 | hide | past | favorite | 110 comments



>The only proper way to reason about the cost of these

>abstractions is to inspect the generated machine code.

To me, that's a big problem with a lot of these Heldencompilers. They may generate really optimal machine code. Then again, they may not, and the difference between optimizations working well and not working well in runtime efficiency is so great (I've measured 1000x for Swift) that they might as well be completely different languages.

For reference, 1000x means that 1 second turns into 16 minutes, and having that type of difference in something that's completely opaque is not a useful performance tool for me, because predictability is at least half the game in performance. So something like Knuth's transformation systems that turn optimization into a dialogue between programmer, compiler and instrumentation seems like a better idea[1].

[1] https://www.cs.sjsu.edu/~mak/CS185C/KnuthStructuredProgrammi...


The problem is immature tooling. There is no feedback loop from the compiler's generated assembly back to the IDE. We finally have libclang which sorta does some stuff (I'm not entirely sure how far it can go) - but I'm honestly not seeing any work being done in this direction on the IDE level. After all these years of C++ development, why doesn't the IDE do something as simple as tell me if a function is being inline or not is beyond me (that's the tip of the iceberg in terms of what I want to know).

When I asked people at CppCon about it I just got some shrugs and was told "just go look at the assembly".

Another solution is profiling - but that's got a slow turn around, and it can be hard to narrow down problem areas.


For Lisp, SBCL is known for giving optimizations notes:

    ; note: forced to do GENERIC-+ (cost 10)
    ;       unable to do inline fixnum arithmetic (cost 2) because:
    ;       The first argument is a NUMBER, not a FIXNUM.
    ;       The second argument is a (INTEGER -19807040623954398379958599680
    ;                                 19807040623954398375663632385), not a FIXNUM.
    ;       The result is a (VALUES NUMBER &OPTIONAL), not a (VALUES FIXNUM &REST T).
    ;       unable to do inline (signed-byte 64) arithmetic (cost 5) because:
    ;       The first argument is a NUMBER, not a (SIGNED-BYTE 64).
    ;       The second argument is a (INTEGER -19807040623954398379958599680
    ;                                 19807040623954398375663632385), not a (SIGNED-BYTE
    ;                                                                        64).
    ;       The result is a (VALUES NUMBER &OPTIONAL), not a (VALUES (SIGNED-BYTE 64)
    ;                                                                &REST T).
    ;       etc.
Note also that the approach is different there: integers are not modular, but you can still perform modular arithmetics if the range of values is adequate (http://www.sbcl.org/manual/#Modular-arithmetic).


Also SBCL has instruction level profiling.


Starting with VS 2015, Visual Studio does show execution time for each function call during a debugging session, while stepping.

Also switching between source view and Assembly alongside original source view is a key away (F12) for all Microsoft languages.


I'm not familiar with the term "Heldencompilers", but this describes compilers for all modern languages. :P C++ moreso than C, but even C can need cajoling to get it to conform to the specific structures that modern optimizers recognize (and that structure can even differ from compiler to compiler), especially for stuff like autovectorization.

Also note that Swift doesn't have the same performance goals that Rust does, preferring to trade some off in favor of ergonomics.


> this describes compilers for all modern languages

Yes, this is the current trend. And the results don't really justify the cost:

http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_201...

http://proebsting.cs.arizona.edu/law.html

Which is why I (roughly) agree with Daniel Bernstein's polemic "The Death of Optimizing Compilers" https://cr.yp.to/talks/2015.04.16/slides-djb-20150416-a4.pdf

(I frequently make code 10x to 1000x or more faster, and compilers' contributions to that total tend to be fairly minimal, though larger than zero and usually worthwhile. But not worth the current shenanigans and not worth having no idea how the code is going to turn out)


This also explains the shift in functional programming from languages with good reasoning about correctness (Haskell, but also SQL) to languages which additionally allow of good reasoning of performance (OCaml, Rust, etc.).

Moreover, this explains the NASA coding rule that all loops shall have an explicit upper bound.

The deeper issue seems to be: If you write code in a fashion that you can reason easily about its performance, then you might not get the optimal performance in all cases, but you establish an minimum level of performance.

So performance stability is more important than reaching optimal performance, because the latter may be easily destroyed accidentally in future versions of the software.

The only exception are well-defined tasks with stable input/output definitions (e.g. numeric primitives like matrix multiplication, fourier transformation, etc.) where the whole point of newer versions is performance improvements and nothing else.


I don't doubt that you make code 10x to 1000x faster, but could this be selection bias? I.e., you are focusing on the code that is slow and not the code that the compiler has already had substantial impact on?


You make code 10x 1000x faster by changing the algorithm or data layout that you use instead of just emitting different instructions for the same higher level code.


I don't think anyone who mattered ever argued than an optimizing compiler could or should attempt to turn bad algorithms into good ones.

It seems clear to me that at the present state of tech that Most developers are better at picking algorithms while compilers are better at picking which low level instructions. Clearly there are some exceptions, but things like gotoBLAS are clearly exceptions and not the rule.


No, but there can be this myth that certain things will be "optimised away" by the compiler, which is why you can get single digit or fractional speedups by just tuning the optimisation level.


I would much rather deal those "myths" than with the class of myths surrounding i++ being faster/lower than ++i or any of the non-sense around divisions.

At least people claiming something will be optimized away can readily be empirically tested in a way that most will listen to. I have no problem testing the / operator, but I have never had holders of such beliefs accept my results, but people claiming something like "virtual function calls can be optimized away" can handle it when I can demonstrated that inside a single binary they can be but at places like library boundaries they cannot.


> I would much rather deal those "myths" than with the class of myths surrounding i++ being faster/lower than ++i or any of the non-sense around divisions.

This is one area I dislike in C and C++ culture, there is this tendency to micro-optimize code like that, without even profiling it, or it causing an actual impact on the application being built.

So one ends up with endless bike shedding discussions about how to write code, instead of writing it.


That is a facet of programming culture. So many programmers in so many languages have their silly myths.

A few years some Ruby dev said that declaring methods without parenthesis would let the Ruby parser parse method declarations faster. The dev never even measured, he just thought the parens were ugly and a ton of gem devs removed parens claiming a speed up. It wasn't until a few years later that someone benchmarked it and determined that there either was no difference or that it cost just the tiniest amount to remove them.


Those are the things that the compiler can do to optimise your program by a few x. The sorts of optimisations that it can't do are things like picking a different algorithm, or reorganising the layout of a class so that it or multiples of it can fit within a cache line.


Swift's long term goals are to handle the exact same space as Rust, and that means the exact same performance/control goals. We're just starting higher up, and working our way down.

Rust is a lot more mature! RIP libgreen.


Can't wait to hear more about swift's opt in borrowing!


The solution is quite simple: you write using the abstractions and then profile; inspect the assembly for the hot parts of the code and tweak if needed.


I don't think I've saw someone look at the assembly of production code. Ever.


Well, we do it, even if you don't see us. Tell your friends to do it too!


Well, maybe they should? Don't really see how else you're going to figure out why a line of code is as fast/slow as it is.


I can work out why stuff is going fast slow all the time without going near assembly, but I do higher lever web / db stuff. For the majority of poorly performing code it is blindingly obvious why.


I do it at least once a week. Certainly C programmers do this, so inasmuch as rust is working to be a better C, once might expect Rust programmers to do it.


Swift does not have many free abstractions - what specifically are you referring to?

(Also Swift's pervasive atomic refcounting means that typical idiomatic code often has pretty atrocious performance.)


Keep in mind that Swift's emphasis on value types means that idiomatic code can have very little refcounting. Obviously that's not the case when you're doing iOS/macOS applications, since you're using the ObjC runtime.

Refcounting can also be dropped to non-atomic by the optimizer when it can prove that an object hasn't escaped.


Value types only avoid refcounts if they contain only primitive fields, which is not all that often in interesting code.

For example, count the number of atomic refcounts on each iteration of the following loop:

  // given
  struct S {
    var x: [Int]
    var y: [String]
    var z: NSView
  }

  var array: [S]

  for element in array {
    // maybe do something with the NSView
  }
That's three atomic increments and three decrements per loop iteration, as the array entry is copied into element because it can't prove the NSView operation wouldn't cause the array to zero somehow.

Given atomic memory operations are some of the most expensive things on a modern CPU, this is brutal, and a place where Rust-style memory ownership would be a big improvement.

I haven't seen the optimizer convert atomic refcounts to non-atomic refcounts - do you have an example situation where that happens?

(By the way, I think Swift is a delightful language, but the pervasive reference counting threw some cold water on my passion as it means today we're not really close to replacing C/C++/Rust for machine-level performance.)


That's exactly the point: the optimizer can drop atomicity. All these ref-counts may be optimized away. Or not. The opposite of predictability.


That is what profilers are for.

Even with pure Assembly, changing between CPU releases can have a big impact due to microcode changes or cache sizes.


Heldencompiler? Google is not turning up much.


"helden" is German for "heroes", concatenating words is fine in German grammar, I'm guessing he means compilers that try to be the hero and write their own code, rather than dumbly transform code into instructions.


"helden" is also Dutch for "heroes" and concatenation like that is also fine in Dutch.


Because, well, Dutch is kind of a glorified German dialect ;)

Just kidding, but German and Dutch share a close, common "ancestor" language, closer than any other Germanic languages. I used to have a professor at Uni who would always repeat, that a language is a dialect with an army, which kind of hits the spot.


My term for stuff like this is "software that suffers from overly-clever syndrome".


Probably supposed to be Heisencompiler as a play on Heisenberg's uncertainty principal.


From context, no, it seems to be "Heroic Compiler".


Is messing up branch prediction "Breaking Bad?"


No. See Heisenberg uncertainty principle.


Nope. But I like it :-)


Helden is german for "hero"


Heroic compiler?


Might be a typo of heisencompiler?


If you're saying 1000x for Swift, my immediate guess is you've got some pathological code that is triggering a massive number of copies of CoW structures (e.g. arrays, dictionaries, sets).


I am pretty sure it wasn't, and I wasn't trying to construct a case either. I haven't run into the CoW issues myself yet, though with that I am actually curious enough to try to construct a case.


Why are you pretty sure it wasn't? Swift is usually pretty efficient. Massive numbers of CoWs is by far the most likely culprit for any kind of huge slowdown. And without actually profiling it, it's hard to detect when CoWs are happening, you mostly just have to be aware of when you're writing code that keeps multiple references to a CoW data structure that you're mutating. It's very common to get single accidental CoW copies. Getting a large number of them tends to require slightly odder code. For example, saying something like `col.reduce([], { $0 + $1.elements })`, which is a normal idiom for Haskell, will introduce a CoW copy on the accumulator array at every single step. And this is true even if you modify it to something like `col.reduce([], { var ary = $0; ary.append(contentsOf: $1.elements); return ary })`, because behind the scenes the reduce method will still have a reference to the accumulator while the closure is executing.


How much speedup was that "1000x" actually though? Like 2x?


This is pretty awesome. One key bit of information that the compiler has is that the coefficients are a constant array of length 12, which makes the loop unrolling possible and also means that the register magic is in play - it's seriously awesome that the compiler does this.

That said, I'd expect something similar to happen with a well-written C program. Would equivalent abstractions in C++1{1,4,7} be "costly"?


All of the optimizations were done by llvm, so nothing stopping this from working in C++ with closures or C with functions annotated for inlining.

With C the lack of generics means that writing composable iterators is hard, though.


At least it's hard to write composable iterators that don't circumvent the type system completely.


Since ranges are not yet standard, the "equivalent" abstractions would look to be std::algorithms, but those are not zero-cost as they are not iterator adaptors, but consumers. This is extremely inefficient.

However, Boost offers iterator adapters, which although weaker than true iterator adapters (which C++ calls ranges), those will suffice in this case.


As for std::algorithms not being zero-cost. I have seen cases were std::foreach was noticably faster or much slower. I have seen it in benchmarks and it seems largely to be a quality of implementation issue.

Good implementations can cheat and use knowledge my code shouldn't have about parts of the underlying system and bad ones do extra useless crap.

Benchmarking and profiling seem to be the only real way to gain confidence in your results reliably.


His snippet is very close to what one would do in scala. And I suspect it is very similar to other functional languages. I have always wanted to have a comparison between the generated code of different functional languages for a similar program. One day..


Yes, Rust feels a bit like a mix of Scala and C. Especially match :)


I think ML is actually the strongest influence. Of course, Scala and Haskell both have ML influence, so there are things that look familiar from all three languages.


OT and not sure if you're the author.. but what a beautifully designed website!


Not the author (Ruud), but I've been in touch with him.

The source code for his website is free software [0], so feel free to have a look or adopt it for yourself.

Ruud uses a small static site generator he's written in Haskell, and in his own words it “includes a tiny templating engine, an html and css minifier, and an aggressive font subsetter.”

[0]: https://github.com/ruuda/blog


That's exactly what I thought! One of best designed blogs I've ever seen. It has exactly what you need and nothing more. Looking at the HTML code its also incredibly simple and elegant! What a pleasure to read


Came here to say exactly this. It's rare for me to be so motivated by design to comment on it - I wanted to read more posts just because it was so beautiful to look at!

Arguably missing a link the homepage/post overview at the top, perhaps I just expect this too much from convention.


Thank you!


> But in any case, a missed opportunity for vectorisation is just that: a missed opportunity. It is not abstraction overhead.

Does the same code implemented in C manage to vectorize? If so isn't that an actual "cost" in comparison?


How difficult is it to add a "zero cost abstraction" like the ones used here? For example, what would it take to make this compile:

    for window in buffer.sliding_window(coefficients.len()) {
        let prediction = coefficients.iter()
                                     .zip(window)
                                     .map(|(&c, &s)| c * s as i64)
                                     .sum::<i64>() >> qlp_shift;
        let delta = buffer[i];
        buffer[i] = prediction as i32 + delta;
    }
with sliding_window returning an iterator over slices of the buffer?


"Zero cost abstraction" to me means a completely "static" abstraction, i.e. there are no runtime mechanisms necessary to implement the abstraction.

Your sliding window is just a series of windows, so there is no reason why it couldn't be compiled in the "most" efficient way. In fact it probably does, just try it out by implementing the sliding_window as an iterable.

However there's always this problem with cleverness: It gets harder and harder to read and maintain. Also http://wiki.c2.com/?SufficientlySmartCompiler


There is a `windows` iterator.

https://doc.rust-lang.org/std/primitive.slice.html#method.wi...

However, your code would not work as-is, since the window borrows the buffer, so `buffer[i]` cannot be written to. Further, there is no `windows_mut` to let you write to the end of the window, because that would let you get multiple mutable references to a given element.


> Fortunately these structures are not allocated on the heap, as they would likely be in a language like Java or Python.

At least in principle, escape analysis would be used to allocate them on the stack. In HotSpot, simple iterators are commonly allocated on the stack (and then optimized further). When you have a JIT (which means you can afford one), general abstractions become zero-cost based on their use-site. The upside is that you get a few, general and powerful abstractions that are then completely optimized based on how they're used. The downside is that it is not guaranteed, and a JIT requires more RAM and power (and in practice usually a warmup period, although that can be solved).


I'm not familiar with some of these operations. I don't know what a Rust "slice" is. Or what "Zip" does.

Could someone show me the most straight forward equivalent in vanilla C? I assume there is no direct equivalent as temporary storage will be needed. But that's fine and would further serve the purpose of explaining why Rust is cool.

Thanks.


A slice is a pointer + a length. Zip takes two iterators and gives an iterator that returns pairs of elements from each of them.

For an imperative translation see https://www.reddit.com/r/programming/comments/5fpghn/zerocos...

Doing an exact translation to C is hard. The C that does the same thing, or C that demonstrates these abstractions that are boiling away?


The thing with C is also that it's hard to do metaprogramming. Example, I have some finite element code that ideally, I would like to run for an arbitrary dimension. In C, this is practically impossible to code efficiently, while in C++ (and I suppose in Rust) I belive this can be done with templates (though I have not too much experience with this).


Uh, the C equivalent might be something like

    uint32_t *buffer = ...;
    uint64_t coefficients[12] = {...};
    uint16_t qlp_shift = ...;

    uint32_t *bufp = &buffer[...];
    uint64_t sum;
    for (size_t i = 0; i < 12; i++)
        sum += coefficients[i] * bufp[i-12];
    uint64_t prediction = sum >> qlp_shift;
    *bufp += (uint32_t)prediction;
Edit: this is incorrect


This is incorrect. The C code is more like

    for (size_t i = 0; i < 12; i++) {
        sum = 0;
        for (size_t j = 0; j < 12; j++)
            sum += coefficients[j] * bufp[i + j];
        buffer[i + 12] += sum >> qlp_shift;
    }


I think this is still a little off. Maybe you used bufp[i + j] where you meant buffer[i + j]? I may have something wrong too, but I probably would go with:

    for (size_t i = 0; i < 12; i++) {
        int64_t sum = 0;
        int32_t *bufp = buffer + i;
        for (size_t j = 0; j < 12; j++)
            sum += coefficients[j] * bufp[j];
        *bufp += (int32_t)(sum >> qlp_shift);
    }
I think this will put it in a form where the compiler should vectorize it[1], probably making the loop about twice as fast as the version shown: 3 vector loads, 3 vector multiplications, 3 vector additions and a horizontal sum versus 16 scalar loads[2], 12 scalar multiplications and 12 scalar additions.

[1] I presume the "various reasons the above snippet is not as obvious to vectorise as it might seem at first sight" is because the coefficients[] are 64-bit and buffer[] is 32-bit. But a smart compiler should be able to use PMOVSXDQ (packed move sign extend double-word to quad-word) here. Unless there is some other issue, I think the author may be underestimating the benefit of vectorizing here.

[2] While the author is right that most of coefficients stay in registers, there isn't actually room for all 12, so 4 of them are loaded from the stack every iteration. Since only 3 128-bit vector registers are required, they should remain there. I'm not sure if the excess loads are actually a bottleneck, though.


> I think this will put it in a form where the compiler should vectorize it

I played around with you snippet using the compiler explorer https://godbolt.org/ . I tried a couple versions of clang, gcc, and icc and none of those were able to auto-vectorize it. Not sure why. Kind of a shame, since there would definitely be a speedup.


Thanks for testing it. You inspired me to try as well, and I've found that Intel ICC 16 and 17 will vectorize it if you put "#pragma vector always" before the inner loop. Without it, it thinks that the write to buffer[i] might interfere with the reads of buffer[i + j]. I think this is because it's trying to "jam" the inner and outer loop.

In addition, I discovered a couple other issues. I was embarrassed to see that I'd messed up the loop bounds in my example. But the bigger one is that AVX2 and earlier do not have a 64-bit x 64-bit -> 64-bit vector multiplication instruction. VPMULLQ was only added with AVX512, possibly because it requires a 512-bit intermediate.

This means that with 64-bit coefficients, the vector approach has to do the multiplications twice, and shift, and add. This is likely going to be slower than the scalar approach. But if you were able to switch to 32-bit (or floating point) coefficients, I think the vectorized solution looks pretty good. ICC estimates a 1.4x speedup, but I think that gets better with longer buffers.

I put the version with int32_t coefficients up here: https://godbolt.org/g/GXNVe2


Actually, the coefficients are only 16 bit, they are just widened before the loop. AVX2 and later are not widely supported, but there is pmuldq in SSE4.1 which is appropriate here. It takes 32 bit operands and and produces a 64 bit result. Because its result is 64 bit (as it should be by the way, because truncating to 32 bit will overflow on real-world FLAC files) it can only do two elements per operation. A few adds can be vectorised too then, but you still need a horizontal add to get the final value. I think it is possible to get a small speedup, but definitely not the 4× or 8× that you might hope for with SSE/AVX. But at this point I would say that this is a pretty advanced optimisation, not your average "vectorise all operations in the loop, increment the counter by 4/8 at a time" transformation that a compiler can do easily. I am impressed that the Intel compiler can actually vectorise things here.


This whole thread: C is fine. We're fine. Everything's fine here now. How are you?


Why the insult?

Memory safety without a hit in performance seems like a great thing, and this is a fine example that it's becoming possible. The point is not that C is always fine. In this case, I think the actual conclusion is something like: C and Rust perform equally here, both have about 50% of the theoretical assembly performance on modern processors, thus better vectorization would help both of them. But if all else is equal, choose the safe one.


Whoops! You're totally right. I missed the outer loop. Thanks for the correction.


Which -in my opinion- is a lot clearer than the Rust code in the article.


That's totally fair, but that may have something to do with how well you know Rust. I obviously don't know your familiarity level, but once your mental models "snap" out of thinking about moving data around, and start thinking about operations upon the data, it gets really intuitive.


I've seen Rust written in imperative style, and it was clear to me in most of the cases. I think it's more the difference between imperative and functional styles that's throwing people for a loop.


> I think it's more the difference between imperative and functional styles that's throwing people for a loop.

Pun intended? If so, nicely done.


Yeah this is clearly someone who's into FP. If you rewrite this code to replace the map/sum with a for loop it's basically identical in length, while being completely idiomatic.


It took me some tome to embrace this style in Rust, but now o love it. It cuts down on boiler plate code.


I am familiar enough with functional programming patterns. I could probably understand this if I had some more experience with Rust. But I think I would be able to understand the snippet in most imperative and functional languages.

My point is not that the snippet is bad, just that Rust feels, looks, and probably is, quite complex. As opposed to, for example, Haskell, which also introduced a bunch of nice, new, ideas.


> I am familiar enough with functional programming patterns.

Really? I've never used rust but the code was pretty clear to me except for the >> shift part.

> But I think I would be able to understand the snippet in most imperative and functional languages.

In Python (ignoring the shift thingy) I'd write

  prediction = sum(map(lambda c, s: c*s, zip(coefficients, buffer[i-12: i])))
To me, both are equally clear


  > Rust feels, looks, and probably is, quite complex. As opposed to, for example, Haskell
Do you really think that Haskell is less complex than Rust, or am I just reading your comment wrong?


On the other hand, the Rust code is guaranteed at compile time to be free from buffer overruns, off-by-one errors, and a whole host of other nasties.

I know, I know, the C code you write never has buffer overruns, just like the C code I write never does, well, except that one time, but then I caught it in the debugger during testing and it was easy to fix. Oh, and then that other time, but I caught that one too, and then.....


You forgot about the time you forgot to zero the memory before using it, and it took a week to track down the random junk appearing in you data.


> you forgot to zero the memory before using it

Nah, I did that on purpose because I remember the time I accidentally swapped the 'value' and 'size' parameters of memset, trashing the stack, and so now I go the 'safer' option to just leave things uninitialised. Not to mention the performance increase I get from avoiding a memset that isn't necessary, so long as I'm really, really careful and never make a mistake, which I almost never do.


I love how a type safe memset that distinguished between size types and value types would have fixed your problem (with no run time overhead), but you take this a reason to do even more unsafe stuff that has dubious costs for dubious advantages.

And I cannot tell if you are serious, trolling or sarcastic, because I know people who think this way, think this way is a joke and think this way is the bane of their professional lives.


Not being serious, but joking based on real life things have have bitten me at one point or another, and the mindset of some developers dismissing the benefits of safety and caring about mostly non-existent speed benefits.


> On the other hand, the Rust code is guaranteed at compile time to be free from buffer overruns

No it's not. In this line:

    .zip(&buffer[i - 12..i])
you can just put any indices and it will compile.


But it will not read outside of the buffer. It will panic at runtime instead. That's not a buffer overrun.


It's still an error discovered at runtime. It's one that smacks you in the face instead of silently doing the wrong thing, but it's still found at runtime. I really prefer my errors to be found at compile time, but that's often harder to do...


zip is a function/method (depending on the language used) that transforms lists of values into A LIST of (n-)tuples of values ending IIRC when the shortest list ends.

For example: https://is.gd/hrUf7q


Would be curious to see a good C++ translation of that, which a modern compiler can unroll/inline about as well.


Given that the Rust compiler uses LLVM as its backend, and given that Rust's closure implementation was inspired by C++'s (though with extra compile-time machinery to make them memory-safe), I'm certain that C++ is just as capable of boiling these abstractions away as Rust is. :) (But admittedly I don't know if iterator adaptors like map and zip are in C++'s stdlib.)


Not part of stdlib, but it can be done using Boost. https://godbolt.org/g/QBN5zP


I don't know anything about FLAC, but is the implicit modular arithmetic the expected behavior here? What if a product or a sum overflows?


Good question! It turns out that 64 bit arithmetic does not overflow (I explain this in a comment in the source code [1]) up to the shift. The shift amount must then be large enough to fit the result in 32 (or actually, 24 or usually 16) bits. For a valid FLAC file this will be the case. For an invalid FLAC file it might truncate, but an invalid file cannot be decoded properly anyway. The shift amount is a 5 bit signed number [2], so it is never possible to shift by the integer width or more.

[1]: https://github.com/ruuda/claxon/blob/91b6af9/src/subframe.rs... [2]: https://github.com/ruuda/claxon/blob/91b6af9/src/subframe.rs...


Thanks for the detailed answer.


I think the shift at the end (>> qlp_shift) takes care of this. The multiplication and sum are done in 64-bit, and then the sum is presumably shifted enough so as to fit in 32-bits.


Thanks.

Isn't it possible to give a "coefficients" vector with large enough values so that computations with 64 bits would overflow too? I don't doubt the code works in practice, because the range of values are reasonable. However this is not explicit, just looking at the code. I'd prefer if the actual domains were being made obvious. Right now, 12, 32 and 64 feel like magic numbers in the code.


My assumption would be that the assumption was that 64-bit arithmetic is somehow more time efficient and 32-bit is more memory efficient while still large enough to hold the results.


Why no constant for the value 12? :(


I like the idea of zero-cost abstractions very much. I think that eventually, we will move to functionally proven code - which is, of course, also a zero-cost abstraction, since functional verification is normally done at compile-time.

The snippet presented is completely unreadable to me though, and I think that in general, Rust is too hard to understand (and has some syntax which seems quite arbitrary).


> The snippet presented is completely unreadable to me though

I suspect this is just a matter of being used to things. For me, the corresponding imperative code is harder to disentangle. Whereas, knowing what zip/map/sum do, both the intent and the behavior of the code is abundantly clear to me.


I suspect that's right. I can describe what zip, map, and sum do, but I don't have an intuitive feel for what each one does or the patterns that they're usually used in because I haven't ever really done any functional programming.

I've seen side-by-side imperative and functional comparisons, written in Rust. Reading through the functional version of the code always gave me a very rough idea of what was happening, and reading the imperative version was always immediately clear to me. I think that it's almost certainly a function of familiarity.


Have you ever learned a new spoken language? At first, the learner understands by translating to primary language. Only later, with repeated experience does the learner stop doing so and become "fluent".

Its very similar when learning the functional style of programming. Even today having most of this functional style be intuitive, when I see a complex iterator chain (usually only when constructing a `Map`), I need to mentally unfold it into a loop.


> Have you ever learned a new spoken language? At first, the learner understands by translating to primary language.

That seems like a good analogy. I underwent that process a few years ago while teaching myself to read assembly, and in the past, other spoken languages.


Which syntax seems arbitrary? I've been reading Rust since its primordial days, and I'm happy to help explain the ultimate reason behind any bit of syntax. :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: