.NET JIT and SIMD are getting married

kvb · on April 8, 2014

This is great to see. While Mono already had a method for using SIMD intrinsics, this tweet[1] indicates that this approach is better. Can anyone elaborate?

[1] https://twitter.com/migueldeicaza/status/452099923157065728

migueldeicaza · on April 8, 2014

Our implementation is too tightly coupled to the Intel SIMD approach, while this approach works for SIMD implementation on other CPU architectures.

Mono.SIMD has a few extra features that are missing from the current design, but they are not that important.

Microsoft has found a pleasant and easier to use API than we did. We have shared our feedback with them, and hopefully it will keep improving.

runfaster2000 · on April 8, 2014

We shared our design with Miguel for feedback ahead of the release. He gave us a boatload of very high quality and informed feedback. Thanks! We haven't been able to act on most of it yet, since it relates more to what to do next than what we released just now.

We're hoping to see this NuGet package supported on top of Mono, too. As we've done previously, we'll share our tests with Xamarin to ensure consistent behavior between implementations.

kvb · on April 8, 2014

Thanks for the definitive reply!

sergiosgc · on April 8, 2014

The article starts with a falsehood. Moore's law is not finished. Much like Mark Twain's death, reports of its demise are greatly exaggerated. Transistor count is still surprisingly following Moore's curve, and single threaded performance, while having taken a hit, is still growing at a healthy 25% per year.

Oh well opening lines are what opening lines need to be. I'm just pedantic about stepping on solid ground.

Guvante · on April 8, 2014

When people say Moore's law is finished they are saying that the free lunch that was the growth rate of single threaded performance is gone.

25% isn't bad, but is no where near the 60% we were getting before. Especially since a 2 year wait used to be a 150% boost but is now a 56% one (with many analysts thinking that is generous).

rjdagost · on April 9, 2014

You have to be careful about what Moore's law actually says. It says that transistor density doubles roughly every 2 years. So transistor count can still be increasing while Moore's law is null and void. Most lithographers I've talked with have said that Moore's law is basically done for, and its demise is due to economics and not physics. It's getting MORE expensive to produce higher transistor densities due to the very complicated manufacturing processes. So fabs can still produce processors with smaller transistors but you probably won't buy them.

chinpokomon · on April 9, 2014

It was also careful to frame it in terms of clock speed. I remember fondly the ramp up from 4.77 MHz to our current clock rates, and we have really plateaued around 3.4 GHz at the top end. [1] There have been challenges to reach higher speeds, and I think the highest clock rate is an impressive 8+ GHz, but that isn't off-the-shelf stock you just pick up at Newegg. That blog post has some interesting and relevant data. It is a few years old now, but still applies.

[1] http://csgillespie.wordpress.com/2011/01/25/cpu-and-gpu-tren...

jmnicolas · on April 9, 2014

To my knowledge the fastest (in ghz) commercial processor is an Power 7 at 5.5 ghz.

I wonder how it compares to a x86, but benchmarks are hard to come by.

bad_user · on April 9, 2014

> single threaded performance, while having taken a hit, is still growing at a healthy 25% per year.

That's false.

Moore's Law used to translate into increased clock rate. In consumer CPUs, this clock rate has stagnated at around 3 GHz ever since 2002 !!! This happens due to issues with power dissipation and increasing GHz count was a waste anyway, because performance increases from higher GHz isn't linear (i.e. a 25% increase in GHz count translates in less than 25% increase in performance and the difference in top of the line consumer CPUs running at 3 or 4 GHz and over-clocked prototypes running at 8 GHz is much smaller than you'd think ;))

Ever since then, single-threaded performance has been improving by design changes (e.g. instruction-level parallelism, caching, etc.), however this doesn't scale and isn't necessarily related to Moore's Law (i.e. the processor doesn't know the code's intention and so there are hard limits to how many smarts it can pull off).

https://en.wikipedia.org/wiki/Clock_rate#Historical_mileston...

sergiosgc · on April 9, 2014

Moore's law "translates" to transistor count. Any other interpretation is your own. I couldn't care less about clock frequency. I care about operations per second, and these are following the numbers I referenced.

See: http://preshing.com/20120208/a-look-back-at-single-threaded-...

bad_user · on April 9, 2014

Moore's law "translates" to transistor count. Any other interpretation is your own. I don't care about "operations per second", because (1) it depends on what you're counting and (2) the numbers you referenced aren't necessarily due to Moore's Law.

ihnorton · on April 8, 2014

Lots of interesting news in the .NET world lately. (Kind of hard to believe it took this long to get SIMD though!)

SIMD can be a huge boost for numerical workloads; for example, SIMD auto-vectorization was just added to Julia, and people saw 8x boosts on some workloads (and in some cases another 4x from inlining improvements merged the next day).

protomyth · on April 8, 2014

> Kind of hard to believe it took this long to get SIMD though!

Does the JVM have SIMD?

pjmlp · on April 8, 2014

If you mean as part of the language standard no.

Having said this, each JVM vendor is free to make use of them if it sees fit to do so. They just don't offer a standard way to explore the SIMD instructions, or let the JIT decide how to compile the code.

Additionally, Java 9 will have integrated support for GPGPU via the Aparapi project for HSAIL support. Which you can already use anyway.

http://developer.amd.com/tools-and-sdks/heterogeneous-comput...

ihnorton · on April 8, 2014

Wow, that is really neat. I would imagine Java's parallel-friendly data structures help make that feasible? (vigorously discussed in this thread a few months back: https://news.ycombinator.com/item?id=6505786)

seanmcdirmid · on April 8, 2014

Its probably not as sexy as that. All data structures have to be copied from CPU land to GPU land, and the GPU is limited in the kind of pointer chasing it can do (so simple linked lists might work, but they are mostly about vectors and matrices).

zurn · on April 9, 2014

Well, this is OpenCL backed and the latest OpenCL is all about sharing the memory space between CPU and GPU. AMD has for some time been making a big deal of sharing the same view of virtual memory on GPU and CPU sides. Which lets you use the same pointers from both sides.

https://www.khronos.org/news/press/khronos-finalizes-opencl-...

"Host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices."

Alphasite_ · on April 9, 2014

iirc they're aiming for Value types and a new JNI for Java 9, so that could simplify some of those issues.

rbanffy · on April 8, 2014

If what's needed is speed, neither Java nor C# are optimal choices. As for SSE/AVX from Java, http://stackoverflow.com/a/10809123/158026 shows a nice example of how it can be done.

OTOH, I find this specific C# implementation neat and portable. I like it a lot.

bunderbunder · on April 8, 2014

If what's needed is speed, neither Java nor C# are optimal choices.

True, but if you're just trying to get some numerical speed out of some specific routines in a larger .NET application then having access to SSE like this could still be very helpful. Calling into native libraries from .NET isn't necessarily a performant option because of the cost of marshaling data back and forth between the managed and unmanaged memory spaces.

The end result can be pretty significant; from my own experience I'm usually pretty hard-pressed to come up with a C++ implementation that can beat the C# code it intends to replace outside of a microbenchmark. If the C# code now has the option of banging on SSE then I'm not sure it'll even be worth trying to trot out C++.

rbanffy · on April 8, 2014

I agree. This is clearly the way to go for the most common use-case ("I don't want to write C/C++, but I want to make this thing faster"). Having support for SIMD in the standard library, even if at the expense of specific types (like NumPy) is definitely the way to go.

Locke1689 · on April 8, 2014

You can get rid of a fair amount of marshaling overhead if you use unsafe code (if that's acceptable in your org).

bunderbunder · on April 8, 2014

You can, and I've generally had better luck with unsafe code than with C++ code. Unsafe code creates GC overhead, though, so it can also end up doing more harm than good if you're not careful. It's another spot where I've found that microbenchmarks can be misleading - the performance cost that pinning incurs is insidious and hard to measure.

magic_haze · on April 9, 2014

That sounds interesting, could you elaborate? (what is pinning in this context?)

ygra · on April 9, 2014

Usually in the managed world you have references that point to an object. The objects themselves don't live in the same place forever, they may be moved by the GC (to consolidate "holes" in memory). References reflect that movement so you don't notice. However, when using unsafe code (which has pointers and pointer arithmetic) you need to keep the objects in place. That's pinning and it essentially forces the GC to work around those islands of pinned objects.

bad_user · on April 9, 2014

> If what's needed is speed, neither Java nor C# are optimal choices

"Speed" is relative. If you mean throughput or scalability (2 different things), then the JVM or the CLR may be exactly what you want due to the ease with which one can juggle with multi-threading on multi-core processors.

Single-threaded performance is becoming less and less interesting and dealing with multithreading or with asynchronous I/O in lower-level languages, such as C/C++, is an extreme pain in the ass - because for example, C/C++ doesn't come with a memory model by itself (i.e. you can get fucked even when running with a different Glibc version) and the poster child for async I/O, libevent, has been plagued for years with concurrency issues, leading to a whole generation of insecure web servers.

The JVM, .NET and managed runtimes in general are great choices for going forward, because not only they come with memory model guarantees and a sane standard library - but having higher-level bytecode under the hood that can be generated at runtime means that either the runtime or the libraries running on top can repurpose that bytecode at runtime for optimal execution. Like for example, you could target the GPU when dealing with parallel collections.

stusmall · on April 8, 2014

Yes but there is room to improve.

https://bugs.openjdk.java.net/browse/JDK-6340864

kvb · on April 8, 2014

That's a bit different; this is about user-controlled SIMD instructions, not whether .NET can use SIMD instructions as part of its routine JIT code generation (see http://blogs.msdn.com/b/davidnotario/archive/2005/08/15/4518... for some very old commentary on where .NET took advantage of SSE instructions in its code gen).

stusmall · on April 8, 2014

Awesome! The other day I tried to find the amount of per-cpu tuning the .NET runtime was capable of but could never find anything. Thanks for the link, this is exactly what I was looking for.

bigdubs · on April 8, 2014

SIMD will also be huge for games.

mrec · on April 8, 2014

Really? I'd have thought that most of the things that would benefit from SIMD would be down in engine code, which is still mostly C/C++. Are many people writing numerically-intensive full-stack games in managed languages?

rbanffy · on April 8, 2014

> Are many people writing numerically-intensive full-stack games in managed languages?

It could just be that the current languages don't support it well.

seanmcdirmid · on April 8, 2014

I'm sure there are than a few.

Box2D for Java/C# is very useful (see Farseer in C# land), and can be more efficient than going through lots of interop calls. You can also hit DirectX directly using Sharp/SlimDX, which again, is also quite useful (if you want to write a game for DirectX, you are pretty much stuck in C++ unless you use these bindings).

MaulingMonkey · on April 9, 2014

Not to mention all the fellows using Unity (or previously XNA) to develop games with. Writeoff #1 always seems to be "not quite fast enough" and each high performance option can help take a chunk out of that population.

mrec · on April 9, 2014

Unity only uses C# for application-level user code. The core engine is C++.

_random_ · on April 9, 2014

Unity is not an XNA replacement. MonoGame is.

MaulingMonkey · on April 10, 2014

Extremely arguable.

I'm not hearing nearly as much about MonoGame as I am about Unity (although I am hearing some promising things.)

I don't mean to imply Unity is 1:1 API replacement for XNA - it isn't, MonoGame is, yes. But I'd argue Unity has replaced (and quite effectively at that) XNA/MonoGame as a strong game building target for the masses. And it's not even arguable that, for me, Unity has absolutely replaced XNA, and MonoGame at this point is background noise.

And don't get me wrong, I wish this weren't the case - because the more I've used Unity's APIs, the more I've come to hate them. I've an extreme dislike of the out of date and buggy compiler Unity uses for some of those platforms. But Unity's got a broader platform set than XNA and MonoGame combined, which unfortunately for me trumps the rest. I'd unscientifically wager it has far more libraries and middleware being written with it in mind.

Maybe that will change. I would welcome such change. There have already been changes, even. But I'm not holding my breath - even if it does, it's too little and far too late for my day job.

_random_ · on April 10, 2014

"I don't mean to imply Unity is 1:1 API replacement for XNA" - this is what I meant. You cannot build your own engine on top of Unity, but some people don't need to.

alkonaut · on April 9, 2014

A lot of the content tooling for games is usually written in some other more productive language (C#, python, ...), so there is a lot of contact between managed and native even for AAA games, even though very little of what actually ships in the game is managed code. Getting a performance boost (or not having to interop) in a level editor, texture tool etc. will likely be welcome.

The tradeoff between safety/convenience/productivity vs. performance of C# vs C++ for example can't be motivated at the moment, and probably won't be with SIMD. Writing .NET code with deterministic performance is hard, so hard that C# is no more convenient than C++. I think any complex performance sensitive (Web browser, Game, SSL-Implementation...) is soon too complex to be done in C++ and too performance sensitive to be done in managed. Rust can't be released soon enough.

fragmer · on April 8, 2014

Although this technology preview only works on 64-bit Windows 8.1, Microsoft's Kevin Frei promised that this will be released for "all platforms that .NET supports" in the future: http://blogs.msdn.com/b/dotnet/archive/2013/09/30/ryujit-the...

bunderbunder · on April 8, 2014

As someone who was recently contemplating pushing some functionality out into C++ code purely for the sake of SIMD, I can only describe this development as pants-wettingly exciting.

_random_ · on April 8, 2014

If JS and Node will eat enterprise I might as well move into .NET game programming :).

bananas · on April 8, 2014

That will never happen.

Then again the NHS here in the UK just rewrote a ton of stuff in RabbitMQ, Riak and Erlang from Oracle...

kevingadd · on April 8, 2014

Sorry to break it to you, but .NET has been devouring game programming for years. Unity uses it extensively, XNA was until recently a big choice for indie development, MonoGame is growing, etc...

npizzolato · on April 8, 2014

I still don't understand why XNA was discontinued. Is there a suitable replacement for creating games rather easily in C#?

_random_ · on April 8, 2014

MonoGame is meant to be an OSS XNA replacement. It's very good for indies with good programming skills. Fez and Bastion were implemented using XNA/MonoGame. If one is more of a game-designer/scripter or is business-oriented then Unity3d is also a good choice.

gebe · on April 8, 2014

MonoGame mirrors XNA down to the namespace names so that's a very suitable replacement. You also have Unity of course.

Guvante · on April 8, 2014

Completely anecdotal, but I would bet it was due to difficulty getting returns from the program. DirectX itself helps sell licenses of Windows, but XNA not so much due to only impacting indie games.

Locke1689 · on April 8, 2014

Unity maybe, but honestly no.

wtetzner · on April 8, 2014

I think the "never gonna happen" was about JS and Node taking over the enterprise.

lern_too_spel · on April 8, 2014

He didn't say anything about the suitability of .NET for game programming. His post was entirely about enterprise programming.

spyder · on April 9, 2014

I was curious about SIMD support in JavaScript and it looks Firefox nightly already has an API for it and Chrome too is getting it.

http://www.2ality.com/2013/12/simd-js.html

https://01.org/blogs/tlcounts/2014/bringing-simd-javascript

dantiberian · on April 8, 2014

  For performance reasons, we’ve defined those types as immutable value types.

Immutability strikes again.

GyrosOfWar · on April 8, 2014

How would immutability help with performance? (I'm not trying to ask a leading question, just curious) Value types are obviously nice for better cache performance (less pointer chasing, more linear data) but I generally associate immutability with better correctness but worse performance (not much worse, but worse).

edit: On topic, this is really, really cool. I was playing around with SIMD intrinsics in C++ a few days ago (and realized that the compiler was in most cases generating equal or better code with the auto-vectorizer than me using intrinsics) so I have kind of a new-found interest in the topic.

bunderbunder · on April 8, 2014

Two guesses:

First, making them immutable means you don't have to worry about memory barriers. That could be huge for data that's being shared by multiple threads. While these types logically have multiple elements, in reality they all fit within a single SSE register. Meaning the performance cost associated with having to worry about mutability could easily annihilate any potential performance boost you might get from being able to fiddle with the vector unit.

(Guess one-and-a-half is that, since these values are meant fit in a single CPU register, they're really more analogous to atomic types than they are to objects, anyway.)

Second guess is that it's more of a "pit of success" thing than a performance thing. Mutable value types in .NET are really problematic. I've seen so many bugs resulting from them that nowadays I consider them grounds for automatic rejection in code reviews.

freikev · on April 9, 2014

The immutable design came from the class library folks (not my team [the JIT folks]). I believe the analogy to atomic types (integers aren't mutable) is pretty sound. The API really was cleaner by making them immutable. If you want to allow mutation, the API surface area really explodes, resulting in dramatically more JIT work to make them perform well.

millstone · on April 9, 2014

I would guess this is due to the underlying ISAs. Most SIMD ISAs don't have a quick way to, say, replace the 9th byte in a 16 byte vector. Since the whole point of SIMD is to make stuff go faster, it makes perfect sense to omit operations which are slow.

millstone · on April 9, 2014

There's not much in the way of contrary opinion here, so let me offer some. The approach of not tying you to a particular architecture is fundamentally wrong. The right way is to expose APIs for each processor architecture.

Here's why. SIMD offers no new capabilities (1), only more speed, and not much more at that, maybe 4x if you're lucky. It's also hard to use: it requires unnatural data layouts, and lacks many operations (e.g. integer division). None of this is specific to the .NET implementation: it's just the nature of the beast.

So successfully exploiting SIMD is not easy, and requires thinking at the level where instruction counts matter. And because the amount of parallelism is so limited, high level languages (by which I include C!) can very easily blow away any gains with suboptimal codegen. Just a handful of additional instructions can ruin your performance (2).

Here's what will go wrong with an architecture-independent SIMD API:

1. Say you invoke an operation without an underlying native instruction. The compiler is forced to implement this by transferring the data to scalar registers, performing the operation, and then transferring the result back. Game over: this exercise is likely to eat up any performance benefit.

2. To avoid this, say you limit the API to some "common subset" of all extant SIMD ISAs. The problem is, many algorithms admit vectorization only through the exotic instructions, such as POPCNT on SSE4, or the legendary vec_perm on Altivec. If this instruction is not exposed, you can't vectorize the algorithm. Game over again.

That's why software that takes advantage of SIMD invariably has separate implementations for each supported ISA. .NET should have followed suit: expose an API for each ISA (or a mega-API that covers all ISAs), and then provide rich information about which operations are implemented efficiently, and which are not, to allow apps to choose an optimal implementation at runtime. This API would demo and market very poorly, but the engineers will love it, because it's the one that enables the most benefit from SIMD.

1: with rare exceptions, such as the new fused multiply-add support in x86

2: Several years back, VC++ generated an all-bits-1 register by loading it from memory, instead of issuing a pcmpeqd, which caused my vector implementation to underperform my scalar one. This is my fear for the .NET implementation.

aktau · on April 9, 2014

This made me think of clang/gcc's vector extensions [1], which, together with __builtin_shuffle can be used to get some real "ok" cross-platform (SSE/NEON/...) SIMD code going on. An example of this in usage is [2].

That said you're right, usually the best performance can only be obtained by using really specific instructions. But in my experience, a decent performance increase can be obtained by using the generic vector extensions.

Moreover, if you can use the vector extensions for a large part of the code, that means you have to write a lot less platform-specific stuff. I.e.: you increase portability anyway, since now you only have to rewrite 5 out of 20 functions instead of 20/20. Even better, they allow one to write v3 = v1 + v2 instead of v3 = _mm_add_ps(v1,v2). The first one being clearer, more portable (will generate appropriate addps or equivalent NEON, ...) and plain nicer to read.

Your pcmpeqd example is a good example of an optimizer flaw. In my opinion this is orthogonal to whether or not to expose a specific or generic API. The compiler should've use the most efficient instruction for that simple idiom, period (without you telling it to use pcmpeqd). If we continue your line of reasoning, we're back to assembly for everything.

[1]: https://vec.io/posts/gcc-and-clang-vector-extensions (The vector extensions allow +,-,*,/,<,>,==,!= to be naturally used for SIMD types) [2]: https://github.com/rikusalminen/threedee-simd

dbaupp · on April 8, 2014

I don't see any mention of shuffles, which I believe are regarded as the most important part of SIMD for high-performance.

Does anyone happen to know more about how/if shuffles are going to be exposed?

freikev · on April 9, 2014

How would you like shuffles exposed? One of the things we really tried to do with this design is make sure it's NOT tied to one particular hardware implementation of SIMD.

dbaupp · on April 9, 2014

I would guess that any interesting SIMD hardware would expose shuffles in one form or another since they're so powerful. I would even consider defining "interesting SIMD hardware" as one that exposes shuffles. (Example of the power of shuffles: http://software.intel.com/en-us/articles/using-simd-technolo... which uses shuffles to make super-fast 4x4 matrix multiplies (very common for games).)

Prior art:

- LLVM's shufflevector intrinsic http://llvm.org/docs/LangRef.html#shufflevector-instruction (not meant for human consumption, but is an example of a multi-architecture backend with SIMD shuffle support)

- OpenCL's shuffle https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/s... (Also note the `.s0123` "swizzle" syntax there.)

zvrba · on April 9, 2014

Couldn't you expose machine-dependent stuff in a subclass?

SSE has a bunch of other useful instructions like PMOVMSKB (useful for fetching the result of vectorized comparisons, yay!), then there are string instructions (sometimes also useful outside of string processing), etc.

New versions (AVX-512) will also have mask registers for masked operations.

chinpokomon · on April 9, 2014

It sounds like they're trying to maintain platform independence while still abstracting platform specific low-level operations. If you utilize machine specific code in subclasses, how do you generate MSIL that remains independent?

If you really needed something like that, could you not use C++/CLI and expose those operations in your own unmanaged library? You will of course lose portability, but that seems like a possible work around.

deadc0de · on April 8, 2014

And we've had vectorization years ago in the JVM.

jodrellblank · on April 8, 2014

And we've had straight marriage for years so why is gay marriage news, right?