Intel will add deep-learning instructions to its processors

antirez · on Oct 14, 2016

That's a good news! About that, two days ago I modified the implementation of neural networks inside Neural Redis in order to use AVX2. It was a pretty interesting experience and in the end after modifying most of it, the implementation is 2x faster compared to the vanilla C implementation (already optimized to be cache obvious).

I never touched AVX or SSE in the past, so this was a great learning experience. In 30 minutes you can get 90% of it, but I think that to really do great stuff you need to also understand the relative cost of every AVX operation. There is an incredible user at Stack Overflow that replies to most AVX / SSE questions, if you check the AVX / SSE tag you'll find it easily.

However I noticed that when there were many load/store operations to do, there was no particular gain. See for example this code:

    #ifdef USE_AVX
            __m256 es = _mm256_set1_ps(error_signal);

            int psteps = prevunits/8;
            for (int x = 0; x < psteps; x++) {
                __m256 outputs = _mm256_loadu_ps(o);
                __m256 gradients = _mm256_mul_ps(es,outputs);
                _mm256_storeu_ps(g,gradients);
                o += 8;
                g += 8;
            }
            k += 8*psteps;
    #endif

What I do here is to calculate the gradient after I computed the error signal (error * derivative of the activation function). The code is equivalent to:

    for (; k < prevunits; k++) *g++ = error_signal*(*o++);

Any hint about exploiting AVX at its max in this use case? Thanks. Ok probably this was more a thing for Stack Overflow, but too late, hitting enter.

gok · on Oct 14, 2016

Assuming this is your code: https://github.com/antirez/neural-redis/blob/master/nn.c

You're really just implementing a vector*matrix multiply. You probably want to just use BLAS's sgemv routine. On macOS, link Accelerate and use cblas_sgemv(); on Linux consider installing Intel MKL or OpenBLAS.

If you're just looking to learn what a reasonably state-of-the-art SGEMV kernel looks like for a modern chip like Haswell, check out OpenBLAS's code:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_6...

antirez · on Oct 14, 2016

That sounds a great advice, I'll check this library. I've the feeling that the fact I've non aligned addresses in the current weights scheme will be a problem and padding will be required. That's why I used AVX "loadu" that deals with non aligned addresses, but perhaps I'm paying a lot of performances because of this. Thanks.

EDIT: apparently on modern CPUs that's not the case and magically LOADU can be as fast as LOAD.

gok · on Oct 15, 2016

Actually if you have a Mac running Sierra, check out BasicNeuralNetworkSubroutines. Highly optimized primitives for exactly this purpose.

gens · on Oct 15, 2016

Don't do unaligned memory access, whatever your cpu flags say.

Another thing you can do, if you don't plan on using the results immediately, is to use non-temporal store (movntps for SSE2). If you do plan to use the results right away, then just use them without storing in main memory.

And you can do the usual unrolling of loop.

nkurz · on Oct 15, 2016

Don't do unaligned memory access, whatever your cpu flags say.

I don't think this is good advice. Do you have an example of poor performance of unaligned memory access on a processor that supports AVX?

I think the better advice is:

1) Prefer aligned access if possible, favoring writes over reads.

2) A single unaligned access per cycle usually does not cause a slowdown.

3) Accesses that cross 64B cachelines essentially act as two accesses, thus may add a cycle.

4) You can only sustain multiple accesses per cycle if you avoid crossing cachelines.

5) Writes that cross 4KB pages have a 100+ cycle penalty on pre-Skylake Intel (down to 5 cycles on Skylake).

gens · on Oct 15, 2016

Does anybody have an example of good performance of unaligned memory access on modern cpus ? And note that it doesn't matter if the cpu supports AVX, but if it has a flag that says it can do fast unaligned memory access (i don't remember, is it misalignsse ?).

Common sense says that unaligned access can't be faster then aligned. And if you have data that fits into a ymm register, then you might as well use aligned access (a neural network is usually an example of such).

I did test it a while ago. Problem is that i don't remember if it was on this, modern, cpu or the older one. I could test if i cared enough for other peoples opinion, but alas i don't (only usage of unaligned AVX access i found to be from newbies to SIMD). An example, that you request, would be to look at glibc memcpy, that uses ssse3 [0] so that it could always get aligned access (ssse3 has per-byte operations).

In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it ? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.

[0]https://github.com/lattera/glibc/blob/master/sysdeps/x86_64/...

nkurz · on Oct 15, 2016

In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.

I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.

One good reference for the "alignment doesn't matter" view is an earlier post on the same blog that is being discussed here: http://lemire.me/blog/2012/05/31/data-alignment-for-speed-my...

I linked elsewhere in the thread to my more detailed experiments regarding unaligned vector access on Haswell and Skylake: http://www.agner.org/optimize/blog/read.php?i=415#423. This is the source of my conclusion that alignment is not a significant factor when reading from L3 or memory, but does matter when attempting multiple reads per cycle from L1.

Both of these link to code that can be run for further tests. If you find an example of an unaligned access that is significantly slower than an aligned on a recent processor (and they certainly may exist) I'll nudge Daniel into writing an update to his blog post.

gens · on Oct 16, 2016

>I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.

For me it depends on the context. Here aligned access makes more sense so unaligned should be defended.

I hacked together a test, feel free to point out mistakes.

c part: http://pastebin.com/zMha8Fre

asm part(SSE and SSE2 for nt): http://pastebin.com/mxEFC8Cw

results:

aligned: 0 sec, 69049070 nsec

unaligned on aligned data: 0 sec, 69210069 nsec

unaligned on one byte unaligned data: 0 sec, 70278354 nsec

unaligned on three bytes unaligned data: 0 sec, 70315162 nsec

aligned nontemporal: 0 sec, 42549571 nsec

naive: 0 sec, 67741031 nsec

Repeating the test only shows non-temporal to be of benefit. The difference of, on average, 1-2% is not much, that i yield. But it is measurable.

But that is not all! Changing the copy size to something that fits in the cache (1MB) showed completely different results.

aligned: 0 sec, 160536 nsec

unaligned on aligned data: 0 sec, 179999 nsec

unaligned on one byte unaligned data: 0 sec, 375108 nsec

aligned nontemporal: 0 sec, 374811 nsec // usually a bit slower then one byte unaligned

And, out of interest, i made all the copy-s skip every second 16 bytes, (relative) results are the same as the original test except non-temporal being over 3x slower then anything else.

And this is on a amd fx8320 that has the misalignsse flag. On my former cpu (can't remember if it was the celeron or the amd 3800+) the results were very much in favor of aligned access.

So yea, align things. It's not hard to just add " __attribute__ ((aligned (16))) " (for gcc, idk anything else).

PS It may seem like the naive way is good, but memcpy is a bit more complicated then that.

qb45 · on Oct 17, 2016

See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024. I think what you observed is the result of loads and stores hitting the same cache set at the same time, all while misalignment additionally increases the number of cache banks involved in any given operation. But that's just hand-waving, I don't know the internals enough to say with confidence what's going on exactly.

BTW, changing misalignment from 1 to 8 reduces this effect by half on my Thuban. Which is important, because nobody sane would misalign an array of doubles by 1 byte, while processing part of an array starting somewhere in the middle is a real thing.

Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

  typedef float sse_a __attribute__ ((vector_size(16), aligned(16)));
  typedef float sse_u __attribute__ ((vector_size(16), aligned(1)));
  
  void c_is_faster_than_asm_a(sse_a *dst, sse_a *src, int count) {
          for (int i = 0; i < count/sizeof(sse_a); i += 8) {
                  dst[i] = src[i+0];
                  dst[i] = src[i+1];
                  dst[i] = src[i+2];
                  dst[i] = src[i+3];
                  dst[i] = src[i+4];
                  dst[i] = src[i+5];
                  dst[i] = src[i+6];
                  dst[i] = src[i+7];
          }
  }
  void c_is_faster_than_asm_u(sse_u *dst, sse_u *src, int count) {
          // ditto

gens · on Oct 18, 2016

>See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024.

Tested. There's a greater difference between aligned and aligned_unaligned. But that made the test go over my cache size (2MB per core), so i tested with 512kB with and without your +128. Results were (relatively) similar to the original 1MB test.

>Which is important, because nobody sane would misalign an array of doubles by 1 byte [...]

Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy all the time). The code from the person above also does because compilers sometimes do (aligned mov sometimes segfaults if you don't tell the compiler to aligned an array, especially if it's in a struct).

>Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:

Of course you did, you unrolled the loop. The whole point was to test memory access, not to write a fast copy function.

>c_is_faster_than_asm_a()

First of all, that is not in the C specification. It is a gcc/clang/idk_if_others extension to C. It compiles to similar what I would write if i had unrolled the loop. Actually worse, here's what it compiled to http://pastebin.com/yL31spR2 . Note that this is still a lot slower then movnpts when going over cache size.

edit: I didn't notice at first. Your code copies 8 16byte... chunks to the first. You forgot to add +n to dst.

qb45 · on Oct 18, 2016

Crap, that was bad. Fixed. And removed the insane unrolling, now 2x is sufficient.

You are right, 128 is not enough on Piledriver. Still,

  ./test $(( 512*1024+1024*0 ))
  aligned: 0 sec, 134539 nsec
  unaligned on aligned data: 0 sec, 101471 nsec
  unaligned on one byte unaligned data: 0 sec, 190368 nsec
  unaligned on three bytes unaligned data: 0 sec, 181823 nsec
  aligned nontemporal: 0 sec, 359920 nsec
  naive: 0 sec, 214007 nsec
  c_is_faster_than_asm_a:   0 sec, 92437 nsec
  c_is_faster_than_asm_u:   0 sec, 92643 nsec
  c_is_faster_than_asm_u+1: 0 sec, 156574 nsec
  c_is_faster_than_asm_u+3: 0 sec, 156359 nsec
  c_is_faster_than_asm_u+4: 0 sec, 154932 nsec
  c_is_faster_than_asm_u+8: 0 sec, 155784 nsec

  ./test $(( 512*1024+1024*1 ))
  aligned: 0 sec, 107036 nsec
  unaligned on aligned data: 0 sec, 94861 nsec
  unaligned on one byte unaligned data: 0 sec, 114444 nsec
  unaligned on three bytes unaligned data: 0 sec, 115915 nsec
  aligned nontemporal: 0 sec, 407951 nsec
  naive: 0 sec, 219215 nsec
  c_is_faster_than_asm_a:   0 sec, 82474 nsec
  c_is_faster_than_asm_u:   0 sec, 82554 nsec
  c_is_faster_than_asm_u+1: 0 sec, 112544 nsec
  c_is_faster_than_asm_u+3: 0 sec, 115159 nsec
  c_is_faster_than_asm_u+4: 0 sec, 198434 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118952 nsec

4k is the stride of L1, your code slows down 1.5x:

  ./test $(( 512*1024+1024*4 ))
  aligned: 0 sec, 107576 nsec
  unaligned on aligned data: 0 sec, 94010 nsec
  unaligned on one byte unaligned data: 0 sec, 140534 nsec
  unaligned on three bytes unaligned data: 0 sec, 140517 nsec
  aligned nontemporal: 0 sec, 467981 nsec
  naive: 0 sec, 206891 nsec
  c_is_faster_than_asm_a:   0 sec, 85294 nsec
  c_is_faster_than_asm_u:   0 sec, 85174 nsec
  c_is_faster_than_asm_u+1: 0 sec, 118674 nsec
  c_is_faster_than_asm_u+3: 0 sec, 118902 nsec
  c_is_faster_than_asm_u+4: 0 sec, 118370 nsec
  c_is_faster_than_asm_u+8: 0 sec, 118638 nsec

128k is the stride of L2, both codes slow down further:

  ./test $(( 512*1024+1024*128 ))
  aligned: 0 sec, 167906 nsec
  unaligned on aligned data: 0 sec, 140650 nsec
  unaligned on one byte unaligned data: 0 sec, 239271 nsec
  unaligned on three bytes unaligned data: 0 sec, 251342 nsec
  aligned nontemporal: 0 sec, 458850 nsec
  naive: 0 sec, 364731 nsec
  c_is_faster_than_asm_a:   0 sec, 125240 nsec
  c_is_faster_than_asm_u:   0 sec, 118917 nsec
  c_is_faster_than_asm_u+1: 0 sec, 197348 nsec
  c_is_faster_than_asm_u+3: 0 sec, 196755 nsec
  c_is_faster_than_asm_u+4: 0 sec, 199757 nsec
  c_is_faster_than_asm_u+8: 0 sec, 197842 nsec

ajtulloch · on Oct 14, 2016

https://github.com/Maratyszcza/NNPACK/blob/master/src/x86_64...

(from https://github.com/Maratyszcza/NNPACK) is also a pretty reasonable AVX2 GEMM implementation written in a Python assembler, it's a bit easier to follow than the OpenBLAS kernels and has very reasonable performance vs OpenBLAS/MKL.

I second the guidance to just use a BLAS library. I'd guess you'd see a 2-10x speedup at least. It looks like you're only implementing a classical MLP library (which is fine for learning, but not the best choice for something like this, integrating VW into Redis would be a much better choice IMO).

genericpseudo · on Oct 15, 2016

VW is the business. Strongly recommended.

nkurz · on Oct 14, 2016

This should be a very fast loop. If the input and output are larger than cache, you should be limited by RAM bandwidth. And if your outputs and gradients fit in L1 cache, it should be possible to get your loop down to single cycle per iteration: load, multiply, store, increment, and test can all execute in the same cycle. It will probably be difficult to convince a compiler to produce the necessary code, though, and if you do, it will likely be specific to that particular compiler and flags.

If you want to make this as fast as it can be, the first step is to measure what you do have, and see how fast it is in terms of cycles per iteration. Then you'll want to look at the generated assembly, and what the compiler has done to your carefully crafted loop. The better you've optimized it to begin with, the more likely that it will have done something to make it worse. If you want guaranteed performance on your target platform, you need to lock down the assembly.

The fastest loop structure here will involve using a single negative offset, so that you can increment it by 8 and test if you have reached zero. This allows the addition and loop test to "fuse" into a single µop (ADD-JNZ). The loads will be (end_outputs + neg_offset), and the stores (end_gradients + neg_offset). Unrolling should not be necessary if you can get the right assembly.

In the absence of making it "perfect", though, you may benefit by unrolling it a few times, although your compiler is probably already doing this. You might try to encourage the compiler to generate a single instruction "load-multiply" instruction by using a dereferenced pointer in the multiply. The processor is limited to decoding 4 instructions per cycle, and depending on what the compiler produces, saving an instruction per iteration here might avoid this bottleneck.

And as someone else mentioned, it will help if your inputs and outputs are 32B aligned. In particular, stores that cross 4KB boundaries can be expensive enough to slow down the whole algorithm (although I think this has gotten better on Skylake). It shouldn't be a big difference, though, unless everything else is extremely optimized. I explore it here: http://www.agner.org/optimize/blog/read.php?i=415#419

antirez · on Oct 14, 2016

Thanks you for the interesting advices. Unfortunately no 32B alignment. The start address of the weights vectors is aligned, but the problem is that as I jump from here to there to get the weights, the alignment is broken. I bet that if I analyze the way I loop trough the weights, I can figure out a way to make this whole thing a linear aligned access. Probably that's my starting point! Thank you. I'll make sure to try different cycles numbers and unrolling, I don't want to reach the level of writing the assembler, but to have at least a good speedup for what's possible.

About RAM bandwidth, basically it depends a lot on the neural network size, for small/medium networks it's all inside L1, for very large not, and of course the bigger they are, the slower to train, the more interesting to get the speedup. Thanks.

dfbrown · on Oct 14, 2016

I would try unrolling 2-4 iterations of the loop. Multiple sequential loads isn't much slower than a single load, so batching your loads and stores together will let you do more arithmetic operations for each time you hit memory.

YZF · on Oct 14, 2016

It might help to try and squeeze more work between the load and the store. E.g. do two of these in parallel (load/load/mul/mul/store/store). Think about latency vs. throughput, the load and the stores may have high latency but also high throughput so you need to be issuing more of them concurrently (without any dependencies between them). Just off the top of my head.

nkurz · on Oct 14, 2016

Actually, in most cases instruction ordering isn't really a factor on modern Intel processors. They operate "out of order" (often abbreviated OoO), which means that they execute instructions (µops really) from a "reorder buffer" as soon as dependencies are satisfied. Since the loads have no dependencies, they will already execute several (possibly many) iterations ahead of the multiplies and stores.

The latency of the store doesn't matter to use, since we aren't immediately using the result. This doesn't mean that you won't see a benefit from your suggestion, but if you do it will probably be from the reduced loop overhead rather than the improved concurrency.

YZF · on Oct 14, 2016

I'm aware of out of order execution but my experience is that at least with SSE/AVX instruction ordering does matter... The loads do have a dependency on the address for example. Anyways, some experimentation will help, even loop unrolling doesn't always behave the way you'd think.

nkurz · on Oct 14, 2016

The loads do have a dependency on the address for example.

Sort of. They depend on the address, but there is nothing that prevents the address from being calculated ahead of time. So what happens is that both the loads and the addition get executed multiple iterations ahead of the multiplication. While we tend to think of one iteration completing before the next begins, from the processors point of view it's just a continuous stream of instructions.

gpderetta · on Oct 14, 2016

Loops (especially FP loops) are often dependency chains limited wich prevents OoO execution. Unrolling (and using multiple accumulators) help create multiple independent chains that can be executed in parallel.

Edit: for this specific loop the only dependency is on the iteration variables whih is not an issue here, as the loop should only be limited by load/store bandwith, assuming proper scheduling and induction variable elimination from the compiler.

antirez · on Oct 14, 2016

Thanks, this is one of the simplest things to try that can produce an improvement. I'll try for sure.

makmanalp · on Oct 14, 2016

I'm not the most experienced, so I may be talking out of my ass.

I imagine that what you're hitting is IO. Getting stuff from lower levels of the hierarchy is slow. To benefit from vector ops, you want your operators to be already in registers. The question is, I wonder whether you're hitting the bandwidth limit of that operation, basically. If not, I wonder if there is something that helps you help the hardware prefetch.

You might also want to try and see if unrolling the loop helps, gets rid of the iteration conditionals, but I guess you're adding more instructions, so perhaps more instruction cache misses.

Also, are the pointers to where you're grabbing the block of vectors aligned? Looks like it could be. Try something like _mm256_load_ps perhaps.

antirez · on Oct 14, 2016

The problem I've is exactly lack of alignment... unfortunately why the vectors themselves are aligned, the way I need to jump inside the vectors is not... Padding would complicate quite a bit the implementation so I was not sure if implement it or not, also I'm not sure about the effects on the cache... I wonder what's the penalty of loadu vs load, maybe that's the problem indeed. Thanks.

qb45 · on Oct 14, 2016

> However I noticed that when there were many load/store operations to do, there was no particular gain. See for example this code:

You may be running out of cache/memory bandwidth, compare your throughput with numbers from benchmarks or "Software Optimization Manual" of your CPU. If you are on Linux, try

  perf stat -d ./whatever

which reports, among other things, the rate of cache and memory hits, obtained from hardware "performance counters".

> _mm256_loadu_ps

Some years ago and with SSE, unaligned was slower than aligned. On some chips only if the address happened to really be unaligned, on others in every case. Not sure how it is with AVX.

zbjornson · on Oct 14, 2016

On recent processors (last few years), as long as you're not crossing a cache line, unaligned is essentially the same as aligned in terms of performance.

737774 · on Oct 14, 2016

Oh wow - how is that achieved? I've spent my career in mobile and didn't know this was true on desktop

stephencanon · on Oct 14, 2016

This is true on modern mobile CPUs, too. ARMv8 doesn't even have explicit aligned vector loads.

antirez · on Oct 14, 2016

Oh! That's super interesting, so I know there is no point in trying to improve the alignment too much. I'll try anyway if I can make the whole operation a linear scan but if there is no big penalty, to add padding among weights looks like a bad idea.

qb45 · on Oct 15, 2016

> That's super interesting, so I know there is no point in trying to improve the alignment too much.

Typical x86 cache line size is 64B, AVX registers are 32B. When your o pointer isn't 32B-aligned, you are loading across cache line boundary every second time. If nkurz is right and this loop can run 1 iteration per cycle, misalignemnt probably slows it down to 2i/3c. Benchmark this loop if you want to know.

Of course it only matters if the array is still in L1, maybe L2 (?).

edit:

Not exactly 2i/3c, see http://www.agner.org/optimize/blog/read.php?i=415#423

sbierwagen · on Oct 14, 2016

Blogspam, kinda. Post just links to https://software.intel.com/sites/default/files/managed/69/78... and says it mentions two new instructions: AVX512_4VNNIW (Vector instructions for deep learning enhanced word variable precision) and AVX512_4FMAPS (Vector instructions for deep learning floating-point single precision)

On Intel's part, it seems kinda... late? It's like adding Bitcoin instructions when you already know everyone's racing to make Bitcoin ASICs. How could it beat dedicated hardware, or even GPUs, on ops/watt? Maybe it's intended for inference, not training, but that doesn't sound compelling either.

dsabanin · on Oct 14, 2016

Intel is also adding FPGAs[1] right next to their CPUs, which can possibly become a next big thing in HPC.

[1] http://www.extremetech.com/extreme/184828-intel-unveils-new-...

yid · on Oct 14, 2016

This aspect of their new chips is massively underrated. An FPGA is the future-proof solution here, not chip-level instructions for the soup-du-jour in machine learning.

Edit: which is not to say that I'm not welcoming the new instructions with open arms...

astrodust · on Oct 14, 2016

I'm not as hyped about FPGA-in-CPU so much as I am of having Intel release a specification for their FPGAs that will allow development of third-party tools to program them.

Right now the various vendors seem to insist on their own proprietary everything which makes it hard to streamline your development toolchain. Many of the tools I've used are inseparably linked to a Windows-only GUI application.

wcrichton · on Oct 14, 2016

We're starting a project at Stanford to solve just this problem! The Agile Hardware Center: https://aha.stanford.edu/

The plan is to have a completely open source toolchain from the HDL to P&R.

nickpsecurity · on Oct 14, 2016

Your lab has some interesting publications. Doing good work. Appreciate you taking time to try to solve the FPGA problem. Here's a few related works in case you want to build on or integrate with them:

https://github.com/haojunliu/OpenFPGA

https://www.synflow.com/#home-Synflow

http://legup.eecg.utoronto.ca/

The Leon3 GPL SoC might also make a good test case for you since they're designed for easy customization. Lots of academics use it for experiments.

http://www.gaisler.com/index.php/products/processors/leon3

ronald_raygun · on Oct 14, 2016

I'm not too familiar with FPGAs, but isn't the tradeoff that since they are flexible they are usually much slower than CPUs/GPUs and it is usually used to prototype an ASIC? How is FPGA-in-CPU going to be a good thing?

astrodust · on Oct 14, 2016

They're slower in terms of clock speed, but they're not slower in terms of results.

You can do things in an FPGA that a CPU can't even touch, it can be configured to do massively parallel computations for example.

If Bitcoin is any example, GPU is faster than CPU, FPGA is faster than GPU, and ASIC is faster than FPGA. Each one is at least an order of magnitude faster than the other.

A GPU can do thousands of calculations in parallel, but an FPGA can do even more if you have enough gates to support it.

I haven't looked too closely at the SHA256 implementations for Bitcoin, but it is possible to not only do a lot of calculations in parallel, but also have them pipelined so each round of your logic is actually executing simultaneously on different data rather than sequentially.

nhaehnle · on Oct 14, 2016

One of the biggest caveats of FPGAs that I don't see mentioned often is that they're slow to program. This implies some unusual restrictions about where they can be used. I.e. data centers will benefit, general purpose computing not so much.

Symmetry · on Oct 14, 2016

Well, normally what you lose in terms of clockspeed when using a FPGA you make up in being able establish hardware dataflows for increased parallelism. But I don't have a sense for whether deep learning problems are amenable to that sort of thing.

greendragon · on Oct 14, 2016

It will only really take off if they can get the user experience of getting a HDL program to the FPGA to the same level as getting a shader program to the GPU. Unless they can do all that in microcode though it's going to force them to open up some stacks of Altera's closed processes. I'm hopeful but there's a lot of proprietary baggage around FPGAs that I think have kept them from truly reaching their potential.

yid · on Oct 14, 2016

> I'm hopeful but there's a lot of proprietary baggage around FPGAs that I think have kept them from truly reaching their potential.

I don't really know much about this aspect, could you elaborate? I'm genuinely curious.

aseipp · on Oct 14, 2016

Almost all FPGAs have entirely proprietary toolchains, from Verilog/HDL synthesis all the way down to bitstream loaders for programming your board. These tools are traditionally very poorly built as applications, scriptable only with things like TCL, terrible error messages, godawful user interfaces designed in 1999, etc etc.[1] This makes integrating with them a bit more "interesting" for doing things like build system integration, etc.

Furthermore, FPGAs are inherently much more expensive for things like reprogramming, than just recompiling your application. Synthesis and place-and-route can take excessive amounts of time even for some kinds of small designs. (And for larger designs, you tend to do post-synthesis edits on the resulting gate netlists to fix problems if you can, in order to avoid the horrid turn around time of hours or days for your tools). The "flow" of debugging, synthesis, testing etc is all fundamentally different and has a substantially different kind of cost model.

This means many kind of "dream" ideas of on-the-fly reconfiguration "Just-in-Time"-ing your logic cells, based on things like hot, active datapaths -- is a bit difficult for all but the simplest designs, due to the simple overhead of the toolchain and design phase. That said, some simple designs are still incredibly useful!

What's more likely with these systems is you'll use, or develop, pre-baked sets of bitstreams to add in custom functionality for customers or yourselves on-demand, e.g. push out a whole fleet of Xeons, configure the cells of machines on the edge to do hardware VPN offload, configure backend machines that will run your databases with a different logic that your database engine can use to accelerate queries, etc.

Whether or not that will actually pan out remains to be seen, IMO. Large enterprises can definitely get over the hurdles of shitty EDA tools and very long timescales (days) for things like synthesis, but small businesses that may still be in a datacenter will probably want to stick with commercial hardware accelerators.

Note that I'm pretty new to FPGA-land, but this is my basic understanding and impressions, given some time with the tools. I do think FPGAs have not really hit their true potential, and I do think the shit tooling -- and frankly god awful learning materials -- has a bit to do with it, which drives away many people who could learn.

On the other hand: FPGAs and hardware design is so fundamentally different for most software programmers, there are simply things that, I think, cannot so easily be abstracted over at all -- and you have to accept at some level that the design, costs, and thought process is just different.

[1] The only actual open-source Verilog->Upload-to-board flow is IceStorm, here: http://www.clifford.at/icestorm/. This is only specifically for the iCE40 Lattice FPGAs; they are incredibly primitive and very puny compared to any modern Altera or Xilinx chip (only 8,000 LUTs!), but they can drive real designs, and it's all open source, at least.

Just as an example, I've been learning using IceStorm, purely because it's so much less shitty. When I tried Lattice's iCE40 software they provided, I had to immediately fix 2 bash scripts that couldn't detect my 4.x version kernel was "Linux", and THEN sed /bin/sh to /bin/bash in 20 more scripts __which used bash features__, just so that the Verilog synthesis tool they shipped actually worked. What the fuck.

Oh! Also, this was after I had to create a fake `eth0` device with a juked MAC address (an exact MAC being required for the licensing software), because systemd has different device names for ethernet interfaces based on the device (TBF, for a decent reason - hotplugging/OOO device attachment means you really want a deterministic naming scheme based on the card), but their tool can't handle that, it only expects eth0.

aseipp · on Oct 14, 2016

I'd also just like to add on one fun bit here:

As a software programmer, my mind is blown by the idea of a 3GHz CPU. That's 3,000,000,000 cycles per second. The speed of light c = 299,792,458 meters per second. Which means that, in the time it takes to execute 1 cycle, 1/3,000,000,000 seconds -- light has moved approximately 0.0999m ~= 4 inches. Unbelievable. Computers are fast!

On the other hand... Once you begin to deal with FPGAs, you often run into "timing constraints" where your design cannot physically be mapped onto a board, because it would take too long for a clock signal (read: light) to traverse the silicon, along the path of the gates chosen by the tool. You figure this out after your tool took a long amount of time to synthesize. You suddenly wish the speed of light wasn't so goddamn slow.

In one aspect of life, the speed of light, and computers, are very fast, and I hate inefficiency. In another, the speed of light is just way too slow for me sometimes, and it's irritating beyond belief because my design would just look nicer if it weren't for those timing requirements requiring extra work.

How to reconcile this kind of bizarre self-realization is an exercise left to the reader.

krupan · on Oct 16, 2016

It's not the speed of light that's the long pole in the timing constraint, it's the "switching" time of the circuits that make up the digital logic. You can't switch from zero to one or one to zero instantaneously. It's modeled (or at least it was when I went to college) as a simple RC circuit. Your clock can't tick any faster than the time it takes to charge the slowest capacitor in your system. That's why chips have gotten faster and faster over the years, we've reduced that charge/discharge time (not because we've sped up light).

sliverstorm · on Oct 14, 2016

"pre-baked sets of bitstreams" is a good way to describe what I expect to be the most likely scenario. In other words, spiritually very similar to ordinary configurable hardware (PCIe cards, CPUs, DIMMs, USB devices, etc)

Don't forget the niche nature of an FPGA, too. They are incredibly slow & power hungry compared to a dedicated CPU, so similar to the CPU/GPU split you find that only particular workloads are suitable.

yid · on Oct 14, 2016

Your comment represents some of the best of HN (detailed, illuminating, informative), but is incredibly depressing for someone with an idle curiosity in FPGAs. This is what I've long suspected, and it seems that the barrier to entry is generally a bit too high for "software" people.

aseipp · on Oct 14, 2016

I do think there's some light at the end of the tunnel. IceStorm is a thousand times better than any of the existing EDA tools if you just want to learn, and has effectively done a full reverse engineering of the iCE40 series. It only has a Verilog frontend, but it's a great starting point.

You can get an iCE40 FPGA for ~$50 USD, or as low as $20. It'll take you probably 30 minutes to compile all the tools and install them (you'll spend at least 2x that much time just fucking with stupid, traditional EDA tools, trying to make sense of them, and you'll still do that more than once probably), and you're ready.

The learning material stuff is much more difficult... Frankly, just about every Verilog tutorial out there, IMO, totally sucks. And Verilog is its own special case of "terrible language" aside from that, which will warp your mind.

Did you just use an undeclared wire (literally the equivalent of using an undeclared variable in any programming language)? Don't worry: Verilog won't throw an error if you do that. Your design just won't work (??????).

If you're a "traditional" software programmer (I don't know how else to put it?) who's mostly worked in languages like C, Java, etc. then yes: just the conceptual change of hardware design will likely be difficult to overcome, but it is not insurmountable.

The actual "write the design" part hasn't been too bad for me - but that's because I don't write Verilog. I write my circuits in Haskell :) http://www.clash-lang.org -- it turns out Haskell is actually an OK language for describing sequential and combinatoral circuits in very concise way that embodies the problem domain very well -- and I have substantial Haskell experience. So I was able to get moving quickly, despite being 100% software person.

There are also a lot of other HDLs that will probably take the edge off, although learning and dealing with Verilog is basically inevitable, but I'd rather get accustomed to the domain than inflict self-pain.

MyHDL, which is an alternative written in Python, seems to be an oft-recommended starting point, and can output Verilog. Someone seems to even have a nice tool that will combine MyHDL with IceStorm in a tiny IDE! Seems like a great way to start -- https://github.com/nturley/synthia

alain94040 · on Oct 14, 2016

HDLs such as Verilog are fine when you keep in mind what the D stands for. They are not programming languages. They are hardware description languages. Languages used to describe hardware. In order to describe hardware, you need to picture it in your head (muxes, flops, etc.). Once you do, writing the HDL that describes that hardware is reasonably easy.

What doesn't work is using an HDL to write as if it were software. It's just not.

[0] I'm talking about the RTL subset obviously

aseipp · on Oct 14, 2016

Sure, I don't think anything I said disagrees with that. This conceptual shift in "declarative description" of the HW is a big difference in moving from software to hardware design, and no language can remove that difference. It doesn't matter if you're using MyHDL or Haskell to do it! It just so happens Haskell is a pretty declarative language, so it maps onto the idea well.

But it's not just about that... Frankly, most HDLs have absolutely poor abstraction capabilities, and are quite verbose. Sure, it isn't a software language, but that's a bit aside from the point. I can't even do things like write or use higher order functions, and most HDLs don't have very cheap things like data types to help enforce correctness (VHDL is more typed, but also pretty verbose) -- even when I reasonably understand the compilation model and how the resulting design will look, and know it's all synthesizable!

On top of that, simply due to language familiarity, it's simply much easier for me to structure hardware descriptions in terms of things like Haskell concepts, than directly in Verilog. It wouldn't be impossible for me to do it in Verilog, though -- I'm just more familiar with other things!

At some level this is a bit of laziness, but at another, it's a bit of "the devil you know". I do know enough Verilog to get by of course, and do make sure the output isn't completely insane. And more advanced designs will certainly require me going deeper into these realms where necessary.

I've recently been writing a tiny 16-bit processor with Haskell, and this kind of abstraction capability has been hugely important for me in motivation, because it's simply much easier for me to remember, describe and reason about at that level. It's simply much easier for me to think about my 2-port read, 1-port write register file, mirrored across two BRAMs, when it looks as nice and simple as this: https://gist.github.com/thoughtpolice/99202729866a865806fd6d..., and that code turns into the kind of Verilog you'd expect to write by hand, for the synthesis tool to infer the BRAMs automatically (you can, of course, also use the direct cell library primitive).

That said -- HDL choice is really only one part of the overall process... I've still got to have post-synthesis and post-PNR simulation test benches, actual synthesizable test benches to image to the board, etc... Lots of things to do. I think it's an important choice (BlueSpec is another example here, which most people I've known to use it having positive praise), but only one piece of the overall process, and grasping the whole process is still something I'm grappling with.

AlphaSite · on Oct 14, 2016

Out of interest, how does MyHDL compare to something like PyMTL?

https://github.com/cornell-brg/pymtl

greendragon · on Oct 14, 2016

I haven't used PyMTL but have used MyHDL, years ago. (I'm glad for aseipp's comments because that proprietary mess is exactly why I haven't bothered keeping up with FPGA development, though IceStorm has gotten me interested...) Glancing at PyMTL it seems they're comparable -- neither require that much code (https://github.com/jandecaluwe/myhdl). I'd say without digging deeper MyHDL is older, stabler, better documented, and has been used for real world projects including ASICs. PyMTL might allow for more sophisticated modeling. I'd be concerned about PyMTL's future after the students working on it graduate, whereas MyHDL is Jan's baby.

Found http://ucsbarchlab.github.io/PyRTL/ which has a good overview of these sorts of projects.

AlphaSite · on Oct 15, 2016

Thanks!

dsabanin · on Oct 14, 2016

As someone who recently got into programming FPGAs from decades of usual software development, it's not that bad. Takes a totally different mindset, but it's much more approachable than what it seems to be.

Actually a couple of YouTube videos can get you up and running pretty fast. Verilog is quite simple and doing things like writing your own VGA adapter is pretty straightforward and teaches you a lot and is a lot of fun.

RandomName2020 · on Oct 14, 2016

Anecdote evidence, but I am currently learning FPGA and VHDL, and find it not more difficult than "normal" C. Way more more expressive, for sure: a complete 8 channel PWM controller under 150 lines. Parallelism without locking feels very cool.

greendragon · on Oct 14, 2016

I think VHDL makes things easier. Verilog has some fundamental nondeterminism: http://insights.sigasi.com/opinion/jan/verilogs-major-flaw.h... Still, HDL in general is probably on the order of difficulty as getting used to pointers. If you've already done a bunch of work with breadboards and logic gates it's even easier to grok. Of course a lot of people just instantiate a CPU core on the FPGA and program in C...

dakami · on Oct 16, 2016

My understanding (and please correct me if I'm wrong!) is that it's pretty easy to burn out FPGAs, electrically, in a way that it's not particularly trivial to do with other sorts of circuitry. So one downside of exposing your bitstreams is people are going to blow things up and demand new chips.

flamedoge · on Oct 15, 2016

lolol bash script: bane of my life

oneshot908 · on Oct 14, 2016

There's no evidence so far that FPGAs come anywhere close to GPUs w/r to deep learning performance. All the benchmarks so far, through Arria 10, show it to be mediocre for inference, and the lack of training benchmark data IMO implies it's a disaster for that task. See also Google flat out refusing to define what processors they measured TPU performance and efficiency against.

FPGAs are best when deployed on streaming tasks. And one would think inference would be just that, yet the published performance numbers are on par with 2012 GPUs. That said, if they had as efficient a memory architecture as GPUs do, things could get interesting down the road. But by then I suspect ASICs (including one from NVIDIA) will be the new kings.

wyldfire · on Oct 14, 2016

GPUs have GDDR5 and that is primarily what allows them to dominate in so many applications. Many of them are primarily memory-bound and not computation-bound. This means that the super-fast GDDR memory and the algorithms which can do a predictable linear walk through memory get an enormous speed boost over almost anything else out there.

protomok · on Oct 14, 2016

> But by then I suspect ASICs (including one from NVIDIA) will be the new kings.

Yes, I suspect Nvidia has been developing/prototyping a Deep Learning ASIC for some time now. The power savings from an ASIC (particularly for inference) are just too massive to ignore.

Nvidia also seems to be involved in an inference only accelerator from Stanford called EIE (excellent paper here - https://arxiv.org/pdf/1602.01528v2.pdf).

mastazi · on Oct 15, 2016

If anyone, like me, prefers non-pdf arxiv links, here's the paper linked by the parent: https://arxiv.org/abs/1602.01528

Edit @protomok thanks for the link, it is an interesting paper!

jwr · on Oct 14, 2016

Oh, these are entirely different approaches. New instructions are something I can use immediately (well, after I get the CPU). I know how to feed data to them, I can debug them, they are predictable, my tools support them, overall — I can make good use of them from day one.

As for FPGAs, anybody who actually used one, especially for more than a toy problem, will tell you that the experience is nowhere near that. Plus, at the moment, we are comparing a solution that could possibly perhaps materialise with something that will simply appear as part of the next CPU.

sliverstorm · on Oct 14, 2016

FPGAs sound amazing, but if you work with them you learn they can be a real PITA. The vision of the dynamically shapeshifting coprocessor FPGA is a long way in the future.

tankXos · on Oct 14, 2016

Might be a dumb question, but what is the implication of adding FPGAs[1] right next to their CPUs?

fudged71 · on Oct 14, 2016

This is my understanding from CS. CPUs are designed to be general purpose which unfortunately also means that they are slower and less efficient than dedicated logic circuits.

The implication of FPGAs is efficient and high speed hardware-coded algorithms which can be reprogrammed.

It's almost like you're refactoring the hardware using software, giving you the ability to upgrade a chips capabilities over time.

dsabanin · on Oct 14, 2016

You could program any algorithm as a set of logic gates in hardware. Hardware can run completely in parallel so this allows insane performance gains for some tasks. FPGA are very close to ASIC (regular chips, like the CPU itself) in their performance, but are completely re-programmable. Some FPGAs I believe can be even re-programmed on the fly by the FPGA itself.

jrowley · on Oct 14, 2016

ASICS often are faster than more general FPGAs, and FPGA's use a lot more energy than a comparable ASIC. But the reprogrammable nature of FPGAs is an incredible advantage. And there are even some phones that have FPGAs integrated now a days, so there are lower power options (altera) which would be appropriate for mobiles/laptops.

sargun · on Oct 14, 2016

Any idea when this is going to land?

dsabanin · on Oct 14, 2016

They are already shipping it to some customers[1].

[1] http://www.eweek.com/servers/intel-begins-shipping-xeon-chip...

snippyhollow · on Oct 14, 2016

[2014]

wpietri · on Oct 15, 2016

> On Intel's part, it seems kinda... late? It's like adding Bitcoin instructions when you already know everyone's racing to make Bitcoin ASICs.

Nah. You're thinking they have to be the best, that they have to win. That is true when the choice is buying an Intel CPU vs buying a GPU. But most developers will already have a late-model Intel CPUs around.

For Intel, this is a defensive move. They're bending the performance curve for particular use cases. All this has to do is to keep a certain number of developers from bothering to go the extra mile to use special hardware and software. People working at Google's scale won't care. But there will be other people who try running something and say, "Current hardware's fine. We could do better with GPUs, but not so much better that it's worth the time and money."

I'm sure it's also the start of a learning cycle for them. Some people told them their current processors were not enough, so they added this. They'll now hear why this isn't adequate either and adapt again if the business case is there.

valarauca1 · on Oct 14, 2016

>>On Intel's part, it seems kinda... late?

Not exactly AVX512 is used on the Xeon Phi which is Intel's Floating Co-Processor for Deep Learning/OpenCL tasks. But that being said, GPU's blow the Xeon Phi out of the water in terms of FLOPS, local RAM, and RAM bandwidth.

astrodust · on Oct 14, 2016

Isn't Phi better at dealing with branch-heavy code? GPUs absolutely kill on straight-forward compute, but they seem to flail helplessly if the logic is highly conditional.

I've read that in some cases GPUs evaluate both branches of a condition and discard the unwanted results. A CPU doesn't do this.

valarauca1 · on Oct 14, 2016

>>Isn't Phi better at dealing with branch-heavy code?

This maybe true. Generally code that is infinitely parallelize to be ran on a GPU isn't going to be branch heavy. There are no benchmarks to support this, but I would believe it is true. GPU's typically have 1 branching unit per 64-128threads. While the Phi has 1 branching unit per 2 threads.

The real difference is GPU's use SIMT, while Phi uses MIMD. Each Phi thread is a POSIX thread. It can do its own thing, live its own life. While GPU's you just specify how parallel your work-group, thread block, warp, or wave front is (name depends on platform, in order OpenCL, CUDA, Nvida DX12, AMD DX12).

>>I've read that in some cases GPUs evaluate both branches of a condition and discard the unwanted results. A CPU doesn't do this.

Intel CPU's have done this since SSE4.

astrodust · on Oct 14, 2016

If you mean branch prediction, then yes, it will end up inadvertently executing the wrong instructions and discard them, but it halts execution of the wrong branch at the first opportunity and reschedules to correct its mistake.

A GPU will execute both branches to completion.

jasonwatkinspdx · on Oct 14, 2016

No, he means the SSE predication instructions.

The comparison instructions let you build a mask vector, and instructions like blendv use that mask vector to only impact a subset of elements.

It's been a common feature on many RISC and VLIW cpu's over the years. It is in no way unique to GPU's.

astrodust · on Oct 14, 2016

Ah, thanks for clarifying.

gpderetta · on Oct 14, 2016

> GPU's use SIMT

I know very little of GPU architectures, but I tought that the last few generations of GPUs were straightforward SIMDs (i.e. all lanes are run in lockstep and divergence is handled at an higher level)

vegabook · on Oct 14, 2016

disingenuous to use a highly charged term like "blogspam" and then qualify it as "kinda". kettle/black. Why say "blogspam" if you don't mean it?

And while I agree that the OP is not indepth, it is not blogspam. AVX512 is a very real, very very real, order of magnitude, performance improvement. To give you an idea, going from standard 64-bit to AVX-256 with MKL Numpy (under ipython) you will go from 6-7 seconds, to under 1 second, on an identical 2012+ processor, on this code:

  import numpy as np
  np.random.seed(1234)
  xx = np.random.rand(1000000).reshape(1000, 1000)
  %timeit np.linalg.eig(xx)

The best way to test this is to run the above on bog-standard Ubuntu 16.04 (sudo apt-get install python(3)-pip / sudo pip(3) install numpy first), and then run it again after installing Anaconda Python.

AVX512 will halve this timing again, and before you tell me that Cuda will do this much faster, I'll remind you that there is a point at which 4000 cuda stream processors operating on 64bit data can, in many algos, be a vastly too large window. Phi KL's 72 branching cores with 4 threads and 2x512-bit AVX each, can be a very strong competitor to the flatspace GPUs in nonlinear algos. There is clear blue sky between the scalar architectures and the massively parallel architectures, which has not been targeted yet and this is where KL fits in. After all, the GPUs are still, first and foremost, designed to rotate and translate massive-amounts of triangles in 3d shooters, a task for which a vast number of very dumb little cores is optimal. As much as Nvidia make like you to believe different, that is still their primary market and it doesn't always map well to everything.

Between the current extremes of a small number of very flexible and clever cores (xeon/i7), and 4000 very stupid but numerous cores (GPU), there is a lot of room something in between.

Intel isn't so much late to the party as attacking an obliquely different segment of the market, which may prove very interesting indeed.

ninjin · on Oct 14, 2016

I am not so sure if it is "late". Sure, for training models it is far too late, we have semi-dedicated hardware for that at this point. But it is possible that it will become increasingly common to run these models on end-user systems and in that case these instructions could provide a bit of a speed boost.

adevine · on Oct 14, 2016

I am not familiar with how deep learning algorithms would benefit from these instructions in a way different from how they can just use vector instructions optimized for graphics processing. So my question is, how are these "Vector instructions for deep learning enhanced word variable precision" different from other vector instruction sets?

sbierwagen · on Oct 14, 2016

Good question. It doesn't say! The PDF has absolutely no detail on what it actually does. Googling the instruction brings up people updating CPUID fields, but nothing to do with actually using it.

p1esk · on Oct 15, 2016

I'm guessing it means you can choose the precision (word length) of the computations, eg 8 bits, 16 bits, 32 bits, etc. This probably only applies to integer ops, but if they could let us pack FP16 operands into 512 bit vectors, that would be awesome!

rspeer · on Oct 15, 2016

I hope they're going even farther and considering minifloats. 8-bit integers are easy but limiting and unexciting. 8-bit floats are exactly what some machine learning needs.

If you're updating your weights enough, precision can just be wasted computation. A 512-bit-wide computation could be updating 64 very coarse-grained weights at a time.

ericjang · on Oct 14, 2016

Not exactly late. There's no physical law that says a CPU can't beat a GPU for a certain family of operations, and in fact in practice, small toy networks with non-dense operations (for instance, if you are doing research on optimization algorithms for DNNs or Bayesian inference algorithms) tend to run faster on the CPU than GPU.

obmelvin · on Oct 14, 2016

As the other poster mentions, I assume their view is that these types of operations will become increasingly prevalent in consumer software applications. Lots will be powered by the cloud, but it would be useful if some inference can be done on a laptop CPU.

Eridrus · on Oct 14, 2016

"The cloud" still has to run the models on something, and as long as Intel roughly keeps up with the cost of specialised hardware for doing inference, many people will just keep doing inference on Intel CPUs, rather than bothering with specialised hardware.

obmelvin · on Oct 15, 2016

That's true... I'd been assuming they'd use specialized hardware for inference when writing that, but totally failed to think how cost may make a CPU more desirable.

yongjik · on Oct 14, 2016

Apologies if I'm dumb, but did Intel actually tell exactly what operations these "deep learning" instructions do?

I skimmed through the linked Intel manual, but it seems that it just defines two instruction family names (AVX512_4VNNIW and AVX512_4FMAPS) without actually saying what they do.

* It almost feels like a marketing term, in the same way everything was "multimedia instructions" back in the 90s.

iaw · on Oct 14, 2016

It's been a long time since I look at machine language but, if I understand correctly, the difference is the capacity of the operation. Instead of performing an addition operation on 1 or 2 bytes at a time the operation can be performed on 64 bytes at a time. Given that 'deep learning' requires a number of parallel simple operations expanding the volume of parallel operations in a single CPU instruction boost capacity.

Seriously, just supposition until someone with better understanding chimes in.

bradhe · on Oct 15, 2016

heh, if this is the case then we're going to benefit in a lot of areas from this.

zbjornson · on Oct 14, 2016

Really curious how Intel (others?) determined that these new instructions are for deep learning. The instructions are for permutation, packed multiply+low/high accumulate, and a sort of masked byte move. Are these super common in deep learning?

p1esk · on Oct 15, 2016

Deep learning is all about GEMM operations: matrix-matrix, or matrix-vector multiplications. And the values in these matrices are typically low precision (16 or even 8 bits is enough). So if you can pack many of such low precision values into vectors, and perform a dot product in parallel, you will get a pretty much linear speedup.

unsignedqword · on Oct 14, 2016

Cool stuff, but unfortunately Intel has really delayed AVX512 instructions for their main consumer processors (ffs, it was supposed to hit on Skylake). It looks like we have another die shrink to go after Kaby Lake before we get that sweet ultra-wide SIMD:

https://en.wikipedia.org/wiki/Cannonlake

p1esk · on Oct 15, 2016

Skylake Xeons are still on track to have AVX512, last I checked.

pritambaral · on Oct 14, 2016

Hugged. Google cache link: http://webcache.googleusercontent.com/search?q=cache:lemire....

eveningcoffee · on Oct 14, 2016

I did tests with Torch7 few days ago to compare it with OpenBLAS library on multiple cores against CUDNN.

Single core difference was around 100x.

Using multiple cores improved the situation but there still was about 15x difference that did not improve by adding more cores (I tested with up to Xeon 64 cores on a single machine).

I did not test with Intel MKL library as I did not have time for this.

I wonder how much these instructions would improve the situation and does anybody here have experience with Intel MKL library.

happycube · on Oct 14, 2016

Not holding my breath, since they haven't gotten AVX512 into desktops yet! (Kaby Lake might be worth something if they did... but nope, just more lie7's mostly...)

schappim · on Oct 14, 2016

There is actually an undocumented neural net engine within the Intel chip on the Arduino 101.

schappim · on Oct 15, 2016

I guess it's documented now: http://www.general-vision.com/products/curieneurons/

rosstex · on Oct 14, 2016

Could someone explain how these are deep-learning optimized? All I see is that they're part of the 512-bit vector family that has previously existed.

_ae2r · on Oct 15, 2016

Isn't this basically just wider SIMD? Is this in anyway actually tailored specifically to deep learning?

jostmey · on Oct 14, 2016

How much will this help? Neural networks work so well on distributed computing systems because the operations run in parallel. With a neural network, you can leverage tens of thousands of separate computing cores. Top intel processors have around a dozen cores or so.

What am I missing here?

gok · on Oct 14, 2016

Let's say you want to run a service that does scene classification in images using deep convolutional networks. It takes 100ms on an Intel chip today, and 3ms on a GPU. But you can't run your entire service on a GPU...you're going to need that CPU anyway. So if Intel can close the gap even a little bit (say, down to 25ms), it might be worth it to avoid the headache of running a GPU server and just buy more Intel chips, which can run anything.

steego · on Oct 14, 2016

> What am I missing here?

There may be a market sweet spot between dedicated high-performance machine learning that's ideal for experts who are willing to buy/lease dedicated hardware and regular developers who want to use simpler machine learning toolkits to solve easier problems. Toolsets like Julia, NumPy, R, etc would take advantage of the new instruction and basically give your average a speed boost for simply having an Intel PC.

eb0la · on Oct 14, 2016

This reminds me when you had to buy crypto accelerator cards for your webservers (does anyone remember Soekris?)...

The market stalled after Intel started adding support for crypto operations on-die.

This looks like Intel is protecting his server chip business shooting at nVidia before they start selling Titan-X with other server chip (what chip - I don't know; but I bet there's a business plan on a spreadsheet somewhere).

I guess training neural nets will a GPUs / custom chips task for some time; but as soon as you have developed the classifier / predictor you have to run it on whatever hardware you have and that means you might need less parallel computing power and it doesn't make (financial) sense buying GPUs for that.

wmu · on Oct 14, 2016

Xeon Phi have got more cores, 50 or 60, and the cores share memory. Unlike GPUs memory space is uniform, there is no need to do explicit transfers to/from GPU's memory.

arcanus · on Oct 14, 2016

> Xeon Phi have got more cores, 50 or 60, and the cores share memory

60 cores. Furthermore, the cores have multiple (4) hardware threads per core. So on the latest Phi (Knights Landing) you can have several hundred threads (240) running concurrently without contention (in principle).

partycoder · on Oct 14, 2016

You can also try a binarized neural network.

https://arxiv.org/abs/1602.02830

visarga · on Oct 14, 2016

> Intel is hoping that the Xeon+FPGA package is enticing enough to convince enterprises to stick with x86, rather than moving over to a competing architecture (such as Nvidia’s Tesla GPGPUs ).

Then let it prove its performance on a few deep learning benchmarks. If is so fast and accessible, we'll be impressed.

betolink · on Oct 14, 2016

Is this the moment when Sarah Connor shows up to ruin the party?

thasaleni · on Oct 18, 2016

Excuse my ignorance, but isn't "deep learning instructions" an oxymoron

joelthelion · on Oct 14, 2016

So Intel CPUs will be able to perform 16 operations at a time, when Nvidia can do thousands?

wmu · on Oct 14, 2016

On Knights Landing CPU a single core is be able to perform 2 x 16 operations on single-precision numbers, as a core has two AVX512 execution units. Each core can run 4 thread, and KNL has 64 physical cores.

sitkack · on Oct 14, 2016

Latency

j1vms · on Oct 14, 2016

> Latency

This. Also, trying to slow the loss of customers purchasing what are essentially Nvidia GPU-heavy systems with Nvidia CPUs tacked on to act as the controllers.

ktamiola · on Oct 14, 2016

Sounds good to me. Let's hope this is not just a marketing move.

AvenueIngres · on Oct 14, 2016

What are the implications of this? Faster processing?

_ae2r · on Oct 15, 2016

TL;DR: You can basically do 8 operations at once (but of course, it's not that simple).

Think about a normal 64-bit register: you can add and subtract 64-bit numbers, OR and AND them, etc. Now think about a 64-byte register. What could you do with that? Well, suppose you're looping through contiguous memory and changing each word AND the change that your making to each word is independent of the change you make to any other word. With a normal 64-bit register you'd have to do one operation on each word. But with a 64-byte "register", you can load 8 words into it and do just one operation for the same effect as applying that operation 8 times (once to each word). Thus, some code---vectorizable code---can be sped up nearly 8x.

zenobit256 · on Oct 14, 2016

Great, another extension to the ever-growing instruction x86 instruction set...

I'd question why, but at this point, we're just going to keep bolting things on to an existing architecture.

"Microcode it. Why not?"

tomrod · on Oct 15, 2016

Well. I guess I'm going to be using Intel!

denim_chicken · on Oct 14, 2016

I hope they will be more useful than MPX.

sunstone · on Oct 16, 2016

"get me a beer' would be an appreciated instruction to add.