Hey another article at YC frontpage about saying "DO NOT MESS WITH STATE" to sca...

colanderman · on Dec 1, 2016

Computer memory is mutable. For multiple processors to make use of the same memory requires contending for that memory somehow (else the processors are not communicating in any meaningful way).

You can mathematically try to hide that however you want, but somewhere deep in your language runtime stack will be a shared-memory concurrent mutable algorithm, be it a lock, queue, or otherwise.

The alternative is to ditch shared memory entirely and use e.g. hardware IPC to manage contention. Tilera tried that, they had 64-core processors with a hardware IPC mesh and incoherent cache. But last I checked they switched focus to coherent shared-memory concurrency. Not sure what the driver for the change was but it's telling.

EDIT: Not to mention that half the article focused on promise-style concurrency, which has nothing to do with shared memory or even multicore. Immutability does nothing to address that I/O exists and has latency, and that people try to hide it in frankly bizarre ways. (Really… not pumping the event loop to avoid deadlock? That's a design issue, you should be using queues instead of mutexes and eliminate cycles if you find you have that problem.)

eternalban · on Dec 1, 2016

> Computer memory is mutable. For multiple processors to make use of the same memory requires contending for that memory somehow (else the processors are not communicating in any meaningful way).

That's a bit of a strawman. For example, content addressable memory and message passing will certainly permit "processors ... communicating in [a] meaningful way".

colanderman · on Dec 1, 2016

That's why I called out message passing specifically in the very next paragraph.

victorcase · on Dec 1, 2016

I will search about Tilera :) Your answer open my mind cause I hadn't thought about hardware. It got me thinking.. Are GPU architecture more friendly to be immutable or It have the same "problem" of shared memory too? And.. by the way thanks to sharing your knowledge :)

adwn · on Dec 1, 2016

If anything, GPU architectures are even less suited to immutability than CPUs. To get good performance out of a GPU (otherwise, why use them?), you have to avoid thread divergence caused by data-dependent branches as much as possible (I'm simplifying a bit). The simplest and most effective way to do this is to allocate a couple of large arrays and work on them. Immutable data structures, garbage collection, and frequent allocation of small objects, in contrast, will cause a lot of thread divergence.

In addition, immutable data structures tend to be pointer-based. For GPUs (even more so than for CPUs), you want contiguous memory accesses for acceptable performance.

colanderman · on Dec 1, 2016

GPUs deal primarily with "embarrassingly parallel" worksets – those worksets where the data can be trivially divided into unrelated chunks, so mutability is not really a concern (since working memory isn't shared). In fact it's typical for an algorithm not to modify its input, and to write its output to a separate buffer. All the synchronization happens at the start and end of the algorithm, when work is passed to/from the CPU.

While it is possible for threads in a (modern) GPU to modify shared memory, it's typically very costly. Usually it takes the form of one thread running at the exclusion of all others, to do something like collecting the results from other threads after they're done running.

Re: Tilera, looks like they got bought out by Mellanox and haven't progressed past where I last kept track of them: http://www.mellanox.com/page/products_dyn?product_family=238... It was a shame, because the IPC mesh was easily twice as fast as the shared memory solutions they were pushing in the Gx series.

Athas · on Dec 1, 2016

Functional array programming on GPUs works pretty well, and is an immutable programming style. Your object granularity tends to be pretty big (entire arrays), so the low-level shared memory concerns that inevitably arise at some point in the software stack are easy to handle for the compiler and/or runtime. Basically, you have much fewer synchronisation points, so you can afford for them to be slow.

It's a more restricted programming style than common functional programming - in particular, recursion and advanced control flow is out - but that's the name of the game on GPUs, no matter what you do.

pjmlp · on Dec 1, 2016

And there are at least Haskell and F# implementations for GPGPU, not sure about other languages.

jerf · on Dec 1, 2016

Rust is probably the right cost/benefit tradeoff; mutable data structures, but strictly controlled mutability. In safe Rust code, the sort of mutability that gets you in trouble is blocked by the compiler.

I'm not sure there's an immutable language out there that impresses me with its performance at scale. In some ways, GHC works miracles for Haskell, and yet, it can still end up being 2-5x slower than C even if you try to make it fast. I'm not sure if there's anything else out there that's any faster with pervasive immutability. (There are some other "functional programming languages" that are often faster, like OCaml, but they're getting there by letting mutability in.)

As said in the article: "Rust is slightly less opinionated on the overall architecture of your system than Midori was, which means it is easier to adopt piecemeal," emphasis mine. Over the past couple of decades there have been a lot of projects like Midori (though that is one of the better ones) that could solve lots of problems, as long as lots of people dropped everything and adopted it. I've learned from both looking around and my personal experience that while eventually you do have to drop everything and try something new, that if you can to create a system that you can incrementally improve into, it will outcompete the drop-everything one every time. Even if the tradeoffs might make you sick, the reality is stark; the situation must be desparate before a "drop everything" solution can displace an existing one entirely. In Rust's case, in some ways the most exciting thing about it isn't the features its shipping or its capabilities or anything else; it's the way you're seeing people carve out a hunk of Firefox or other programs and replacing one function, one module, one subsystem at a time in Rust. That's what's missing from so many of the other efforts to address the issues Rust is addressing.

nickbauman · on Dec 1, 2016

Last I checked OCaml is single-threaded concurrency. My testing comparing Java with Clojure is they clock in about the same but Clojure pushes you away from locking where Java pushes you toward it but hard. Clojure isn't pure functional but it's enough that it makes a huge difference to achieving correctness without the performance compromise.

pjmlp · on Dec 1, 2016

The multi-core runtime for OCaml already exists, just isn't fully feature complete to be integrated into the stable version.

https://github.com/ocamllabs/ocaml-multicore

slashdev · on Dec 1, 2016

Echoing what colandrman says in a peer comment, current computers are shared mutable memory, and at least in x64, i.e. servers, coherent too. Immutability is nice, message passing is nice. It's certainly easier to reason about most of the time. But often (not always, sometimes it can be faster) there's a performance price to pay because you're layering an abstraction over the base computational model.

I specialize in lock-free everything mutable programming. It's a bloody nightmare to write correct code in an environment like that - but when done well it performs better than anything else.

greenleafjacob · on Dec 1, 2016

Care to share some resources on lock free concurrent data structures? Recently I saw HN post about a wait free stack [1].

[1] https://news.ycombinator.com/item?id=12109206

jwatte · on Dec 1, 2016

Erlang systems have been immutable, parallel, and concurrent for > 20 years.

zerofan · on Dec 1, 2016

Erlang is elegant and noteworthy, but you'll get more computation done in a single-threaded C program.

zzzcpan · on Dec 1, 2016

Yes, Erlang's implementation is a big problem, but it's not something fundamental to its architecture. Some new language with a fresh implementation could definitely solve that.

jwatte · on Dec 1, 2016

Also, Haskell has both useful and safe STM, and a runtime with built in concurrency and parallelism.

(Mentioning Erlang and Haskell because I've shipped solid production systems using both of those, and they work well!)

taeric · on Dec 1, 2016

Funny, because I read it more as learning to cope with state. Do you want to scale for performance? You need to delicately manage state.

You can do local tricks to hide change in state. Consider, many even simple addition operations in your computer are destructive. At the end of the day, though, your application's value is in the state it manages.

There may be a panacea in there some where. However, it often seems that performance is heavily sacrificed in the name of scalability. Which is scary, because performance is the main reason my scalable phone is driving me batty.

Bahamut · on Dec 1, 2016

Performance - immutability comes at a high cost sometimes.

jandrese · on Dec 1, 2016

When I saw the example in the paper of building a list by copying the entire list for every new entry and trashing the old one I knew performance would be a problem. A smart compiler can optimize that if it knows what you are doing, but they were writing a research compiler.

There is a tradeoff with this kind of immutable oriented programming that you trade off relatively easy parallelism for heinous memory thrashing--and cache misses are already a huge bottleneck on modern architectures. Rust seems to have struck a good balance here, I'm keeping a close eye on it as it continues to mature.

harpocrates · on Dec 1, 2016

Another thing to keep in mind is that with everything being immutable you end up with a pure language. Once that happens, the compiler starts being able to perform some really nice rewrites of your code really easily. The problem is that this is all or nothing - unless your language is actually pure you can't do this.

See this blog post for more <https://donsbot.wordpress.com/2008/06/04/haskell-as-fast-as-...

VodkaHaze · on Dec 1, 2016

Especially when memory constrained

victorcase · on Dec 1, 2016

Ok in the context of - critical - constrained memory I understand;

However do you agree that is more easy to solve memory (buying new hardware) than handle with race condition? kkkk

Then, Maybe we need to think about optimize our daily data structures.. I have following a few papers about this.. e.g:

http://dl.acm.org/citation.cfm?id=2658763 http://dl.acm.org/citation.cfm?id=2993251

jandrese · on Dec 1, 2016

It's easy to buy memory, but hard to buy L2/L3 cache. The whole point of the exercise is to scale more easily on multicore architectures, but it's no good if you blow out the cache thousands of times per second and bottleneck the system on memory accesses.

AstralStorm · on Dec 1, 2016

Additionally, DRAM and VRAM bandwidth are always at premium. Whenever you're making a copy (which you do a lot when objects are immutable), you use memory bandwidth.

This is especially important on mobile, FPGAs and in cases where sheer volume of data is huge. (GPU and big data)

adwn · on Dec 1, 2016

In addition to what jandrese said: You can nearly always add main memory capacity, but you cannot easily improve main memory throughput or latency.