That's a good news! About that, two days ago I modified the implementation of neural networks inside Neural Redis in order to use AVX2. It was a pretty interesting experience and in the end after modifying most of it, the implementation is 2x faster compared to the vanilla C implementation (already optimized to be cache obvious).
I never touched AVX or SSE in the past, so this was a great learning experience. In 30 minutes you can get 90% of it, but I think that to really do great stuff you need to also understand the relative cost of every AVX operation. There is an incredible user at Stack Overflow that replies to most AVX / SSE questions, if you check the AVX / SSE tag you'll find it easily.
However I noticed that when there were many load/store operations to do, there was no particular gain. See for example this code:
#ifdef USE_AVX
__m256 es = _mm256_set1_ps(error_signal);
int psteps = prevunits/8;
for (int x = 0; x < psteps; x++) {
__m256 outputs = _mm256_loadu_ps(o);
__m256 gradients = _mm256_mul_ps(es,outputs);
_mm256_storeu_ps(g,gradients);
o += 8;
g += 8;
}
k += 8*psteps;
#endif
What I do here is to calculate the gradient after I computed the error signal (error * derivative of the activation function). The code is equivalent to:
for (; k < prevunits; k++) *g++ = error_signal*(*o++);
Any hint about exploiting AVX at its max in this use case? Thanks. Ok probably this was more a thing for Stack Overflow, but too late, hitting enter.
You're really just implementing a vector*matrix multiply. You probably want to just use BLAS's sgemv routine. On macOS, link Accelerate and use cblas_sgemv(); on Linux consider installing Intel MKL or OpenBLAS.
If you're just looking to learn what a reasonably state-of-the-art SGEMV kernel looks like for a modern chip like Haswell, check out OpenBLAS's code:
That sounds a great advice, I'll check this library. I've the feeling that the fact I've non aligned addresses in the current weights scheme will be a problem and padding will be required. That's why I used AVX "loadu" that deals with non aligned addresses, but perhaps I'm paying a lot of performances because of this. Thanks.
EDIT: apparently on modern CPUs that's not the case and magically LOADU can be as fast as LOAD.
Don't do unaligned memory access, whatever your cpu flags say.
Another thing you can do, if you don't plan on using the results immediately, is to use non-temporal store (movntps for SSE2). If you do plan to use the results right away, then just use them without storing in main memory.
Does anybody have an example of good performance of unaligned memory access on modern cpus ? And note that it doesn't matter if the cpu supports AVX, but if it has a flag that says it can do fast unaligned memory access (i don't remember, is it misalignsse ?).
Common sense says that unaligned access can't be faster then aligned. And if you have data that fits into a ymm register, then you might as well use aligned access (a neural network is usually an example of such).
I did test it a while ago. Problem is that i don't remember if it was on this, modern, cpu or the older one. I could test if i cared enough for other peoples opinion, but alas i don't (only usage of unaligned AVX access i found to be from newbies to SIMD). An example, that you request, would be to look at glibc memcpy, that uses ssse3 [0] so that it could always get aligned access (ssse3 has per-byte operations).
In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it ? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.
In other words, how about that the people who claim that operations that do extra work are as fast as the ones that don't prove it? Instead of the burden of proof falling on people that don't have such an opinion/experience ? Then i will bow my head and say "You are right. Thank you for pointing that out". But alas google-ing for 10min and have found no such benchmark anywhere. And writing such a test isn't hard, not in the slightest.
I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.
I linked elsewhere in the thread to my more detailed experiments regarding unaligned vector access on Haswell and Skylake: http://www.agner.org/optimize/blog/read.php?i=415#423. This is the source of my conclusion that alignment is not a significant factor when reading from L3 or memory, but does matter when attempting multiple reads per cycle from L1.
Both of these link to code that can be run for further tests. If you find an example of an unaligned access that is significantly slower than an aligned on a recent processor (and they certainly may exist) I'll nudge Daniel into writing an update to his blog post.
>I tend to the opposite view: those saying "do not do X" are in fact obligated to explain why X should be avoided. But perhaps this is just a difference in worldview.
For me it depends on the context. Here aligned access makes more sense so unaligned should be defended.
I hacked together a test, feel free to point out mistakes.
unaligned on one byte unaligned data: 0 sec, 70278354 nsec
unaligned on three bytes unaligned data: 0 sec, 70315162 nsec
aligned nontemporal: 0 sec, 42549571 nsec
naive: 0 sec, 67741031 nsec
Repeating the test only shows non-temporal to be of benefit. The difference of, on average, 1-2% is not much, that i yield. But it is measurable.
But that is not all! Changing the copy size to something that fits in the cache (1MB) showed completely different results.
aligned: 0 sec, 160536 nsec
unaligned on aligned data: 0 sec, 179999 nsec
unaligned on one byte unaligned data: 0 sec, 375108 nsec
aligned nontemporal: 0 sec, 374811 nsec // usually a bit slower then one byte unaligned
And, out of interest, i made all the copy-s skip every second 16 bytes, (relative) results are the same as the original test except non-temporal being over 3x slower then anything else.
And this is on a amd fx8320 that has the misalignsse flag. On my former cpu (can't remember if it was the celeron or the amd 3800+) the results were very much in favor of aligned access.
So yea, align things. It's not hard to just add " __attribute__ ((aligned (16))) " (for gcc, idk anything else).
PS It may seem like the naive way is good, but memcpy is a bit more complicated then that.
See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024. I think what you observed is the result of loads and stores hitting the same cache set at the same time, all while misalignment additionally increases the number of cache banks involved in any given operation. But that's just hand-waving, I don't know the internals enough to say with confidence what's going on exactly.
BTW, changing misalignment from 1 to 8 reduces this effect by half on my Thuban. Which is important, because nobody sane would misalign an array of doubles by 1 byte, while processing part of an array starting somewhere in the middle is a real thing.
Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:
>See what happens when you change HALF_OF_BUFFER_SIZE from 1M to 1M+64. Or 128 or 1024.
Tested. There's a greater difference between aligned and aligned_unaligned. But that made the test go over my cache size (2MB per core), so i tested with 512kB with and without your +128. Results were (relatively) similar to the original 1MB test.
>Which is important, because nobody sane would misalign an array of doubles by 1 byte [...]
Adobe flash would, for starters (idk if doubles but it calls unaligned memcpy all the time). The code from the person above also does because compilers sometimes do (aligned mov sometimes segfaults if you don't tell the compiler to aligned an array, especially if it's in a struct).
>Also, your assembly isn't really that great. In particular, LOOP is microcoded and sucks on AMD. I got better results with this:
Of course you did, you unrolled the loop. The whole point was to test memory access, not to write a fast copy function.
>c_is_faster_than_asm_a()
First of all, that is not in the C specification. It is a gcc/clang/idk_if_others extension to C. It compiles to similar what I would write if i had unrolled the loop. Actually worse, here's what it compiled to http://pastebin.com/yL31spR2 . Note that this is still a lot slower then movnpts when going over cache size.
edit: I didn't notice at first. Your code copies 8 16byte... chunks to the first. You forgot to add +n to dst.
(from https://github.com/Maratyszcza/NNPACK) is also a pretty reasonable AVX2 GEMM implementation written in a Python assembler, it's a bit easier to follow than the OpenBLAS kernels and has very reasonable performance vs OpenBLAS/MKL.
I second the guidance to just use a BLAS library. I'd guess you'd see a 2-10x speedup at least. It looks like you're only implementing a classical MLP library (which is fine for learning, but not the best choice for something like this, integrating VW into Redis would be a much better choice IMO).
This should be a very fast loop. If the input and output are larger than cache, you should be limited by RAM bandwidth. And if your outputs and gradients fit in L1 cache, it should be possible to get your loop down to single cycle per iteration: load, multiply, store, increment, and test can all execute in the same cycle. It will probably be difficult to convince a compiler to produce the necessary code, though, and if you do, it will likely be specific to that particular compiler and flags.
If you want to make this as fast as it can be, the first step is to measure what you do have, and see how fast it is in terms of cycles per iteration. Then you'll want to look at the generated assembly, and what the compiler has done to your carefully crafted loop. The better you've optimized it to begin with, the more likely that it will have done something to make it worse. If you want guaranteed performance on your target platform, you need to lock down the assembly.
The fastest loop structure here will involve using a single negative offset, so that you can increment it by 8 and test if you have reached zero. This allows the addition and loop test to "fuse" into a single µop (ADD-JNZ). The loads will be (end_outputs + neg_offset), and the stores (end_gradients + neg_offset). Unrolling should not be necessary if you can get the right assembly.
In the absence of making it "perfect", though, you may benefit by unrolling it a few times, although your compiler is probably already doing this. You might try to encourage the compiler to generate a single instruction "load-multiply" instruction by using a dereferenced pointer in the multiply. The processor is limited to decoding 4 instructions per cycle, and depending on what the compiler produces, saving an instruction per iteration here might avoid this bottleneck.
And as someone else mentioned, it will help if your inputs and outputs are 32B aligned. In particular, stores that cross 4KB boundaries can be expensive enough to slow down the whole algorithm (although I think this has gotten better on Skylake). It shouldn't be a big difference, though, unless everything else is extremely optimized. I explore it here: http://www.agner.org/optimize/blog/read.php?i=415#419
Thanks you for the interesting advices. Unfortunately no 32B alignment. The start address of the weights vectors is aligned, but the problem is that as I jump from here to there to get the weights, the alignment is broken. I bet that if I analyze the way I loop trough the weights, I can figure out a way to make this whole thing a linear aligned access. Probably that's my starting point! Thank you. I'll make sure to try different cycles numbers and unrolling, I don't want to reach the level of writing the assembler, but to have at least a good speedup for what's possible.
About RAM bandwidth, basically it depends a lot on the neural network size, for small/medium networks it's all inside L1, for very large not, and of course the bigger they are, the slower to train, the more interesting to get the speedup. Thanks.
I would try unrolling 2-4 iterations of the loop. Multiple sequential loads isn't much slower than a single load, so batching your loads and stores together will let you do more arithmetic operations for each time you hit memory.
It might help to try and squeeze more work between the load and the store. E.g. do two of these in parallel (load/load/mul/mul/store/store). Think about latency vs. throughput, the load and the stores may have high latency but also high throughput so you need to be issuing more of them concurrently (without any dependencies between them). Just off the top of my head.
Actually, in most cases instruction ordering isn't really a factor on modern Intel processors. They operate "out of order" (often abbreviated OoO), which means that they execute instructions (µops really) from a "reorder buffer" as soon as dependencies are satisfied. Since the loads have no dependencies, they will already execute several (possibly many) iterations ahead of the multiplies and stores.
The latency of the store doesn't matter to use, since we aren't immediately using the result. This doesn't mean that you won't see a benefit from your suggestion, but if you do it will probably be from the reduced loop overhead rather than the improved concurrency.
I'm aware of out of order execution but my experience is that at least with SSE/AVX instruction ordering does matter... The loads do have a dependency on the address for example. Anyways, some experimentation will help, even loop unrolling doesn't always behave the way you'd think.
The loads do have a dependency on the address for example.
Sort of. They depend on the address, but there is nothing that prevents the address from being calculated ahead of time. So what happens is that both the loads and the addition get executed multiple iterations ahead of the multiplication. While we tend to think of one iteration completing before the next begins, from the processors point of view it's just a continuous stream of instructions.
Loops (especially FP loops) are often dependency chains limited wich prevents OoO execution. Unrolling (and using multiple accumulators) help create multiple independent chains that can be executed in parallel.
Edit: for this specific loop the only dependency is on the iteration variables whih is not an issue here, as the loop should only be limited by load/store bandwith, assuming proper scheduling and induction variable elimination from the compiler.
I'm not the most experienced, so I may be talking out of my ass.
I imagine that what you're hitting is IO. Getting stuff from lower levels of the hierarchy is slow. To benefit from vector ops, you want your operators to be already in registers. The question is, I wonder whether you're hitting the bandwidth limit of that operation, basically. If not, I wonder if there is something that helps you help the hardware prefetch.
You might also want to try and see if unrolling the loop helps, gets rid of the iteration conditionals, but I guess you're adding more instructions, so perhaps more instruction cache misses.
Also, are the pointers to where you're grabbing the block of vectors aligned? Looks like it could be. Try something like _mm256_load_ps perhaps.
The problem I've is exactly lack of alignment... unfortunately why the vectors themselves are aligned, the way I need to jump inside the vectors is not... Padding would complicate quite a bit the implementation so I was not sure if implement it or not, also I'm not sure about the effects on the cache... I wonder what's the penalty of loadu vs load, maybe that's the problem indeed. Thanks.
> However I noticed that when there were many load/store operations to do, there was no particular gain. See for example this code:
You may be running out of cache/memory bandwidth, compare your throughput with numbers from benchmarks or "Software Optimization Manual" of your CPU. If you are on Linux, try
perf stat -d ./whatever
which reports, among other things, the rate of cache and memory hits, obtained from hardware "performance counters".
> _mm256_loadu_ps
Some years ago and with SSE, unaligned was slower than aligned. On some chips only if the address happened to really be unaligned, on others in every case. Not sure how it is with AVX.
On recent processors (last few years), as long as you're not crossing a cache line, unaligned is essentially the same as aligned in terms of performance.
Oh! That's super interesting, so I know there is no point in trying to improve the alignment too much. I'll try anyway if I can make the whole operation a linear scan but if there is no big penalty, to add padding among weights looks like a bad idea.
> That's super interesting, so I know there is no point in trying to improve the alignment too much.
Typical x86 cache line size is 64B, AVX registers are 32B. When your o pointer isn't 32B-aligned, you are loading across cache line boundary every second time. If nkurz is right and this loop can run 1 iteration per cycle, misalignemnt probably slows it down to 2i/3c. Benchmark this loop if you want to know.
Of course it only matters if the array is still in L1, maybe L2 (?).
Blogspam, kinda. Post just links to https://software.intel.com/sites/default/files/managed/69/78... and says it mentions two new instructions: AVX512_4VNNIW (Vector instructions for deep learning enhanced word variable precision) and AVX512_4FMAPS (Vector instructions for deep learning floating-point single precision)
On Intel's part, it seems kinda... late? It's like adding Bitcoin instructions when you already know everyone's racing to make Bitcoin ASICs. How could it beat dedicated hardware, or even GPUs, on ops/watt? Maybe it's intended for inference, not training, but that doesn't sound compelling either.
This aspect of their new chips is massively underrated. An FPGA is the future-proof solution here, not chip-level instructions for the soup-du-jour in machine learning.
Edit: which is not to say that I'm not welcoming the new instructions with open arms...
I'm not as hyped about FPGA-in-CPU so much as I am of having Intel release a specification for their FPGAs that will allow development of third-party tools to program them.
Right now the various vendors seem to insist on their own proprietary everything which makes it hard to streamline your development toolchain. Many of the tools I've used are inseparably linked to a Windows-only GUI application.
Your lab has some interesting publications. Doing good work. Appreciate you taking time to try to solve the FPGA problem. Here's a few related works in case you want to build on or integrate with them:
I'm not too familiar with FPGAs, but isn't the tradeoff that since they are flexible they are usually much slower than CPUs/GPUs and it is usually used to prototype an ASIC? How is FPGA-in-CPU going to be a good thing?
They're slower in terms of clock speed, but they're not slower in terms of results.
You can do things in an FPGA that a CPU can't even touch, it can be configured to do massively parallel computations for example.
If Bitcoin is any example, GPU is faster than CPU, FPGA is faster than GPU, and ASIC is faster than FPGA. Each one is at least an order of magnitude faster than the other.
A GPU can do thousands of calculations in parallel, but an FPGA can do even more if you have enough gates to support it.
I haven't looked too closely at the SHA256 implementations for Bitcoin, but it is possible to not only do a lot of calculations in parallel, but also have them pipelined so each round of your logic is actually executing simultaneously on different data rather than sequentially.
One of the biggest caveats of FPGAs that I don't see mentioned often is that they're slow to program. This implies some unusual restrictions about where they can be used. I.e. data centers will benefit, general purpose computing not so much.
Well, normally what you lose in terms of clockspeed when using a FPGA you make up in being able establish hardware dataflows for increased parallelism. But I don't have a sense for whether deep learning problems are amenable to that sort of thing.
It will only really take off if they can get the user experience of getting a HDL program to the FPGA to the same level as getting a shader program to the GPU. Unless they can do all that in microcode though it's going to force them to open up some stacks of Altera's closed processes. I'm hopeful but there's a lot of proprietary baggage around FPGAs that I think have kept them from truly reaching their potential.
Almost all FPGAs have entirely proprietary toolchains, from Verilog/HDL synthesis all the way down to bitstream loaders for programming your board. These tools are traditionally very poorly built as applications, scriptable only with things like TCL, terrible error messages, godawful user interfaces designed in 1999, etc etc.[1] This makes integrating with them a bit more "interesting" for doing things like build system integration, etc.
Furthermore, FPGAs are inherently much more expensive for things like reprogramming, than just recompiling your application. Synthesis and place-and-route can take excessive amounts of time even for some kinds of small designs. (And for larger designs, you tend to do post-synthesis edits on the resulting gate netlists to fix problems if you can, in order to avoid the horrid turn around time of hours or days for your tools). The "flow" of debugging, synthesis, testing etc is all fundamentally different and has a substantially different kind of cost model.
This means many kind of "dream" ideas of on-the-fly reconfiguration "Just-in-Time"-ing your logic cells, based on things like hot, active datapaths -- is a bit difficult for all but the simplest designs, due to the simple overhead of the toolchain and design phase. That said, some simple designs are still incredibly useful!
What's more likely with these systems is you'll use, or develop, pre-baked sets of bitstreams to add in custom functionality for customers or yourselves on-demand, e.g. push out a whole fleet of Xeons, configure the cells of machines on the edge to do hardware VPN offload, configure backend machines that will run your databases with a different logic that your database engine can use to accelerate queries, etc.
Whether or not that will actually pan out remains to be seen, IMO. Large enterprises can definitely get over the hurdles of shitty EDA tools and very long timescales (days) for things like synthesis, but small businesses that may still be in a datacenter will probably want to stick with commercial hardware accelerators.
Note that I'm pretty new to FPGA-land, but this is my basic understanding and impressions, given some time with the tools. I do think FPGAs have not really hit their true potential, and I do think the shit tooling -- and frankly god awful learning materials -- has a bit to do with it, which drives away many people who could learn.
On the other hand: FPGAs and hardware design is so fundamentally different for most software programmers, there are simply things that, I think, cannot so easily be abstracted over at all -- and you have to accept at some level that the design, costs, and thought process is just different.
[1] The only actual open-source Verilog->Upload-to-board flow is IceStorm, here: http://www.clifford.at/icestorm/. This is only specifically for the iCE40 Lattice FPGAs; they are incredibly primitive and very puny compared to any modern Altera or Xilinx chip (only 8,000 LUTs!), but they can drive real designs, and it's all open source, at least.
Just as an example, I've been learning using IceStorm, purely because it's so much less shitty. When I tried Lattice's iCE40 software they provided, I had to immediately fix 2 bash scripts that couldn't detect my 4.x version kernel was "Linux", and THEN sed /bin/sh to /bin/bash in 20 more scripts __which used bash features__, just so that the Verilog synthesis tool they shipped actually worked. What the fuck.
Oh! Also, this was after I had to create a fake `eth0` device with a juked MAC address (an exact MAC being required for the licensing software), because systemd has different device names for ethernet interfaces based on the device (TBF, for a decent reason - hotplugging/OOO device attachment means you really want a deterministic naming scheme based on the card), but their tool can't handle that, it only expects eth0.
As a software programmer, my mind is blown by the idea of a 3GHz CPU. That's 3,000,000,000 cycles per second. The speed of light c = 299,792,458 meters per second. Which means that, in the time it takes to execute 1 cycle, 1/3,000,000,000 seconds -- light has moved approximately 0.0999m ~= 4 inches. Unbelievable. Computers are fast!
On the other hand... Once you begin to deal with FPGAs, you often run into "timing constraints" where your design cannot physically be mapped onto a board, because it would take too long for a clock signal (read: light) to traverse the silicon, along the path of the gates chosen by the tool. You figure this out after your tool took a long amount of time to synthesize. You suddenly wish the speed of light wasn't so goddamn slow.
In one aspect of life, the speed of light, and computers, are very fast, and I hate inefficiency. In another, the speed of light is just way too slow for me sometimes, and it's irritating beyond belief because my design would just look nicer if it weren't for those timing requirements requiring extra work.
How to reconcile this kind of bizarre self-realization is an exercise left to the reader.
It's not the speed of light that's the long pole in the timing constraint, it's the "switching" time of the circuits that make up the digital logic. You can't switch from zero to one or one to zero instantaneously. It's modeled (or at least it was when I went to college) as a simple RC circuit. Your clock can't tick any faster than the time it takes to charge the slowest capacitor in your system. That's why chips have gotten faster and faster over the years, we've reduced that charge/discharge time (not because we've sped up light).
"pre-baked sets of bitstreams" is a good way to describe what I expect to be the most likely scenario. In other words, spiritually very similar to ordinary configurable hardware (PCIe cards, CPUs, DIMMs, USB devices, etc)
Don't forget the niche nature of an FPGA, too. They are incredibly slow & power hungry compared to a dedicated CPU, so similar to the CPU/GPU split you find that only particular workloads are suitable.
Your comment represents some of the best of HN (detailed, illuminating, informative), but is incredibly depressing for someone with an idle curiosity in FPGAs. This is what I've long suspected, and it seems that the barrier to entry is generally a bit too high for "software" people.
I do think there's some light at the end of the tunnel. IceStorm is a thousand times better than any of the existing EDA tools if you just want to learn, and has effectively done a full reverse engineering of the iCE40 series. It only has a Verilog frontend, but it's a great starting point.
You can get an iCE40 FPGA for ~$50 USD, or as low as $20. It'll take you probably 30 minutes to compile all the tools and install them (you'll spend at least 2x that much time just fucking with stupid, traditional EDA tools, trying to make sense of them, and you'll still do that more than once probably), and you're ready.
The learning material stuff is much more difficult... Frankly, just about every Verilog tutorial out there, IMO, totally sucks. And Verilog is its own special case of "terrible language" aside from that, which will warp your mind.
Did you just use an undeclared wire (literally the equivalent of using an undeclared variable in any programming language)? Don't worry: Verilog won't throw an error if you do that. Your design just won't work (??????).
If you're a "traditional" software programmer (I don't know how else to put it?) who's mostly worked in languages like C, Java, etc. then yes: just the conceptual change of hardware design will likely be difficult to overcome, but it is not insurmountable.
The actual "write the design" part hasn't been too bad for me - but that's because I don't write Verilog. I write my circuits in Haskell :) http://www.clash-lang.org -- it turns out Haskell is actually an OK language for describing sequential and combinatoral circuits in very concise way that embodies the problem domain very well -- and I have substantial Haskell experience. So I was able to get moving quickly, despite being 100% software person.
There are also a lot of other HDLs that will probably take the edge off, although learning and dealing with Verilog is basically inevitable, but I'd rather get accustomed to the domain than inflict self-pain.
MyHDL, which is an alternative written in Python, seems to be an oft-recommended starting point, and can output Verilog. Someone seems to even have a nice tool that will combine MyHDL with IceStorm in a tiny IDE! Seems like a great way to start -- https://github.com/nturley/synthia
HDLs such as Verilog are fine when you keep in mind what the D stands for. They are not programming languages. They are hardware description languages. Languages used to describe hardware. In order to describe hardware, you need to picture it in your head (muxes, flops, etc.). Once you do, writing the HDL that describes that hardware is reasonably easy.
What doesn't work is using an HDL to write as if it were software. It's just not.
Sure, I don't think anything I said disagrees with that. This conceptual shift in "declarative description" of the HW is a big difference in moving from software to hardware design, and no language can remove that difference. It doesn't matter if you're using MyHDL or Haskell to do it! It just so happens Haskell is a pretty declarative language, so it maps onto the idea well.
But it's not just about that... Frankly, most HDLs have absolutely poor abstraction capabilities, and are quite verbose. Sure, it isn't a software language, but that's a bit aside from the point. I can't even do things like write or use higher order functions, and most HDLs don't have very cheap things like data types to help enforce correctness (VHDL is more typed, but also pretty verbose) -- even when I reasonably understand the compilation model and how the resulting design will look, and know it's all synthesizable!
On top of that, simply due to language familiarity, it's simply much easier for me to structure hardware descriptions in terms of things like Haskell concepts, than directly in Verilog. It wouldn't be impossible for me to do it in Verilog, though -- I'm just more familiar with other things!
At some level this is a bit of laziness, but at another, it's a bit of "the devil you know". I do know enough Verilog to get by of course, and do make sure the output isn't completely insane. And more advanced designs will certainly require me going deeper into these realms where necessary.
I've recently been writing a tiny 16-bit processor with Haskell, and this kind of abstraction capability has been hugely important for me in motivation, because it's simply much easier for me to remember, describe and reason about at that level. It's simply much easier for me to think about my 2-port read, 1-port write register file, mirrored across two BRAMs, when it looks as nice and simple as this: https://gist.github.com/thoughtpolice/99202729866a865806fd6d..., and that code turns into the kind of Verilog you'd expect to write by hand, for the synthesis tool to infer the BRAMs automatically (you can, of course, also use the direct cell library primitive).
That said -- HDL choice is really only one part of the overall process... I've still got to have post-synthesis and post-PNR simulation test benches, actual synthesizable test benches to image to the board, etc... Lots of things to do. I think it's an important choice (BlueSpec is another example here, which most people I've known to use it having positive praise), but only one piece of the overall process, and grasping the whole process is still something I'm grappling with.
I haven't used PyMTL but have used MyHDL, years ago. (I'm glad for aseipp's comments because that proprietary mess is exactly why I haven't bothered keeping up with FPGA development, though IceStorm has gotten me interested...) Glancing at PyMTL it seems they're comparable -- neither require that much code (https://github.com/jandecaluwe/myhdl). I'd say without digging deeper MyHDL is older, stabler, better documented, and has been used for real world projects including ASICs. PyMTL might allow for more sophisticated modeling. I'd be concerned about PyMTL's future after the students working on it graduate, whereas MyHDL is Jan's baby.
As someone who recently got into programming FPGAs from decades of usual software development, it's not that bad. Takes a totally different mindset, but it's much more approachable than what it seems to be.
Actually a couple of YouTube videos can get you up and running pretty fast. Verilog is quite simple and doing things like writing your own VGA adapter is pretty straightforward and teaches you a lot and is a lot of fun.
Anecdote evidence, but I am currently learning FPGA and VHDL, and find it not more difficult than "normal" C. Way more more expressive, for sure: a complete 8 channel PWM controller under 150 lines. Parallelism without locking feels very cool.
I think VHDL makes things easier. Verilog has some fundamental nondeterminism: http://insights.sigasi.com/opinion/jan/verilogs-major-flaw.h... Still, HDL in general is probably on the order of difficulty as getting used to pointers. If you've already done a bunch of work with breadboards and logic gates it's even easier to grok. Of course a lot of people just instantiate a CPU core on the FPGA and program in C...
My understanding (and please correct me if I'm wrong!) is that it's pretty easy to burn out FPGAs, electrically, in a way that it's not particularly trivial to do with other sorts of circuitry. So one downside of exposing your bitstreams is people are going to blow things up and demand new chips.
There's no evidence so far that FPGAs come anywhere close to GPUs w/r to deep learning performance. All the benchmarks so far, through Arria 10, show it to be mediocre for inference, and the lack of training benchmark data IMO implies it's a disaster for that task. See also Google flat out refusing to define what processors they measured TPU performance and efficiency against.
FPGAs are best when deployed on streaming tasks. And one would think inference would be just that, yet the published performance numbers are on par with 2012 GPUs. That said, if they had as efficient a memory architecture as GPUs do, things could get interesting down the road. But by then I suspect ASICs (including one from NVIDIA) will be the new kings.
GPUs have GDDR5 and that is primarily what allows them to dominate in so many applications. Many of them are primarily memory-bound and not computation-bound. This means that the super-fast GDDR memory and the algorithms which can do a predictable linear walk through memory get an enormous speed boost over almost anything else out there.
> But by then I suspect ASICs (including one from NVIDIA) will be the new kings.
Yes, I suspect Nvidia has been developing/prototyping a Deep Learning ASIC for some time now. The power savings from an ASIC (particularly for inference) are just too massive to ignore.
Nvidia also seems to be involved in an inference only accelerator from Stanford called EIE (excellent paper here - https://arxiv.org/pdf/1602.01528v2.pdf).
Oh, these are entirely different approaches. New instructions are something I can use immediately (well, after I get the CPU). I know how to feed data to them, I can debug them, they are predictable, my tools support them, overall — I can make good use of them from day one.
As for FPGAs, anybody who actually used one, especially for more than a toy problem, will tell you that the experience is nowhere near that. Plus, at the moment, we are comparing a solution that could possibly perhaps materialise with something that will simply appear as part of the next CPU.
FPGAs sound amazing, but if you work with them you learn they can be a real PITA. The vision of the dynamically shapeshifting coprocessor FPGA is a long way in the future.
This is my understanding from CS. CPUs are designed to be general purpose which unfortunately also means that they are slower and less efficient than dedicated logic circuits.
The implication of FPGAs is efficient and high speed hardware-coded algorithms which can be reprogrammed.
It's almost like you're refactoring the hardware using software, giving you the ability to upgrade a chips capabilities over time.
You could program any algorithm as a set of logic gates in hardware. Hardware can run completely in parallel so this allows insane performance gains for some tasks. FPGA are very close to ASIC (regular chips, like the CPU itself) in their performance, but are completely re-programmable. Some FPGAs I believe can be even re-programmed on the fly by the FPGA itself.
ASICS often are faster than more general FPGAs, and FPGA's use a lot more energy than a comparable ASIC. But the reprogrammable nature of FPGAs is an incredible advantage.
And there are even some phones that have FPGAs integrated now a days, so there are lower power options (altera) which would be appropriate for mobiles/laptops.
> On Intel's part, it seems kinda... late? It's like adding Bitcoin instructions when you already know everyone's racing to make Bitcoin ASICs.
Nah. You're thinking they have to be the best, that they have to win. That is true when the choice is buying an Intel CPU vs buying a GPU. But most developers will already have a late-model Intel CPUs around.
For Intel, this is a defensive move. They're bending the performance curve for particular use cases. All this has to do is to keep a certain number of developers from bothering to go the extra mile to use special hardware and software. People working at Google's scale won't care. But there will be other people who try running something and say, "Current hardware's fine. We could do better with GPUs, but not so much better that it's worth the time and money."
I'm sure it's also the start of a learning cycle for them. Some people told them their current processors were not enough, so they added this. They'll now hear why this isn't adequate either and adapt again if the business case is there.
Not exactly AVX512 is used on the Xeon Phi which is Intel's Floating Co-Processor for Deep Learning/OpenCL tasks. But that being said, GPU's blow the Xeon Phi out of the water in terms of FLOPS, local RAM, and RAM bandwidth.
Isn't Phi better at dealing with branch-heavy code? GPUs absolutely kill on straight-forward compute, but they seem to flail helplessly if the logic is highly conditional.
I've read that in some cases GPUs evaluate both branches of a condition and discard the unwanted results. A CPU doesn't do this.
>>Isn't Phi better at dealing with branch-heavy code?
This maybe true. Generally code that is infinitely parallelize to be ran on a GPU isn't going to be branch heavy. There are no benchmarks to support this, but I would believe it is true. GPU's typically have 1 branching unit per 64-128threads. While the Phi has 1 branching unit per 2 threads.
The real difference is GPU's use SIMT, while Phi uses MIMD. Each Phi thread is a POSIX thread. It can do its own thing, live its own life. While GPU's you just specify how parallel your work-group, thread block, warp, or wave front is (name depends on platform, in order OpenCL, CUDA, Nvida DX12, AMD DX12).
>>I've read that in some cases GPUs evaluate both branches of a condition and discard the unwanted results. A CPU doesn't do this.
If you mean branch prediction, then yes, it will end up inadvertently executing the wrong instructions and discard them, but it halts execution of the wrong branch at the first opportunity and reschedules to correct its mistake.
I know very little of GPU architectures, but I tought that the last few generations of GPUs were straightforward SIMDs (i.e. all lanes are run in lockstep and divergence is handled at an higher level)
disingenuous to use a highly charged term like "blogspam" and then qualify it as "kinda". kettle/black. Why say "blogspam" if you don't mean it?
And while I agree that the OP is not indepth, it is not blogspam. AVX512 is a very real, very very real, order of magnitude, performance improvement. To give you an idea, going from standard 64-bit to AVX-256 with MKL Numpy (under ipython) you will go from 6-7 seconds, to under 1 second, on an identical 2012+ processor, on this code:
import numpy as np
np.random.seed(1234)
xx = np.random.rand(1000000).reshape(1000, 1000)
%timeit np.linalg.eig(xx)
The best way to test this is to run the above on bog-standard Ubuntu 16.04 (sudo apt-get install python(3)-pip / sudo pip(3) install numpy first), and then run it again after installing Anaconda Python.
AVX512 will halve this timing again, and before you tell me that Cuda will do this much faster, I'll remind you that there is a point at which 4000 cuda stream processors operating on 64bit data can, in many algos, be a vastly too large window. Phi KL's 72 branching cores with 4 threads and 2x512-bit AVX each, can be a very strong competitor to the flatspace GPUs in nonlinear algos. There is clear blue sky between the scalar architectures and the massively parallel architectures, which has not been targeted yet and this is where KL fits in. After all, the GPUs are still, first and foremost, designed to rotate and translate massive-amounts of triangles in 3d shooters, a task for which a vast number of very dumb little cores is optimal. As much as Nvidia make like you to believe different, that is still their primary market and it doesn't always map well to everything.
Between the current extremes of a small number of very flexible and clever cores (xeon/i7), and 4000 very stupid but numerous cores (GPU), there is a lot of room something in between.
Intel isn't so much late to the party as attacking an obliquely different segment of the market, which may prove very interesting indeed.
I am not so sure if it is "late". Sure, for training models it is far too late, we have semi-dedicated hardware for that at this point. But it is possible that it will become increasingly common to run these models on end-user systems and in that case these instructions could provide a bit of a speed boost.
I am not familiar with how deep learning algorithms would benefit from these instructions in a way different from how they can just use vector instructions optimized for graphics processing. So my question is, how are these "Vector instructions for deep learning enhanced word variable precision" different from other vector instruction sets?
Good question. It doesn't say! The PDF has absolutely no detail on what it actually does. Googling the instruction brings up people updating CPUID fields, but nothing to do with actually using it.
I'm guessing it means you can choose the precision (word length) of the computations, eg 8 bits, 16 bits, 32 bits, etc. This probably only applies to integer ops, but if they could let us pack FP16 operands into 512 bit vectors, that would be awesome!
I hope they're going even farther and considering minifloats. 8-bit integers are easy but limiting and unexciting. 8-bit floats are exactly what some machine learning needs.
If you're updating your weights enough, precision can just be wasted computation. A 512-bit-wide computation could be updating 64 very coarse-grained weights at a time.
Not exactly late. There's no physical law that says a CPU can't beat a GPU for a certain family of operations, and in fact in practice, small toy networks with non-dense operations (for instance, if you are doing research on optimization algorithms for DNNs or Bayesian inference algorithms) tend to run faster on the CPU than GPU.
As the other poster mentions, I assume their view is that these types of operations will become increasingly prevalent in consumer software applications. Lots will be powered by the cloud, but it would be useful if some inference can be done on a laptop CPU.
"The cloud" still has to run the models on something, and as long as Intel roughly keeps up with the cost of specialised hardware for doing inference, many people will just keep doing inference on Intel CPUs, rather than bothering with specialised hardware.
That's true... I'd been assuming they'd use specialized hardware for inference when writing that, but totally failed to think how cost may make a CPU more desirable.
Apologies if I'm dumb, but did Intel actually tell exactly what operations these "deep learning" instructions do?
I skimmed through the linked Intel manual, but it seems that it just defines two instruction family names (AVX512_4VNNIW and AVX512_4FMAPS) without actually saying what they do.
* It almost feels like a marketing term, in the same way everything was "multimedia instructions" back in the 90s.
It's been a long time since I look at machine language but, if I understand correctly, the difference is the capacity of the operation. Instead of performing an addition operation on 1 or 2 bytes at a time the operation can be performed on 64 bytes at a time. Given that 'deep learning' requires a number of parallel simple operations expanding the volume of parallel operations in a single CPU instruction boost capacity.
Seriously, just supposition until someone with better understanding chimes in.
Really curious how Intel (others?) determined that these new instructions are for deep learning. The instructions are for permutation, packed multiply+low/high accumulate, and a sort of masked byte move. Are these super common in deep learning?
Deep learning is all about GEMM operations: matrix-matrix, or matrix-vector multiplications. And the values in these matrices are typically low precision (16 or even 8 bits is enough). So if you can pack many of such low precision values into vectors, and perform a dot product in parallel, you will get a pretty much linear speedup.
Cool stuff, but unfortunately Intel has really delayed AVX512 instructions for their main consumer processors (ffs, it was supposed to hit on Skylake). It looks like we have another die shrink to go after Kaby Lake before we get that sweet ultra-wide SIMD:
I did tests with Torch7 few days ago to compare it with OpenBLAS library on multiple cores against CUDNN.
Single core difference was around 100x.
Using multiple cores improved the situation but there still was about 15x difference that did not improve by adding more cores (I tested with up to Xeon 64 cores on a single machine).
I did not test with Intel MKL library as I did not have time for this.
I wonder how much these instructions would improve the situation and does anybody here have experience with Intel MKL library.
Not holding my breath, since they haven't gotten AVX512 into desktops yet! (Kaby Lake might be worth something if they did... but nope, just more lie7's mostly...)
How much will this help? Neural networks work so well on distributed computing systems because the operations run in parallel. With a neural network, you can leverage tens of thousands of separate computing cores. Top intel processors have around a dozen cores or so.
Let's say you want to run a service that does scene classification in images using deep convolutional networks. It takes 100ms on an Intel chip today, and 3ms on a GPU. But you can't run your entire service on a GPU...you're going to need that CPU anyway. So if Intel can close the gap even a little bit (say, down to 25ms), it might be worth it to avoid the headache of running a GPU server and just buy more Intel chips, which can run anything.
There may be a market sweet spot between dedicated high-performance machine learning that's ideal for experts who are willing to buy/lease dedicated hardware and regular developers who want to use simpler machine learning toolkits to solve easier problems. Toolsets like Julia, NumPy, R, etc would take advantage of the new instruction and basically give your average a speed boost for simply having an Intel PC.
This reminds me when you had to buy crypto accelerator cards for your webservers (does anyone remember Soekris?)...
The market stalled after Intel started adding support for crypto operations on-die.
This looks like Intel is protecting his server chip business shooting at nVidia before they start selling Titan-X with other server chip (what chip - I don't know; but I bet there's a business plan on a spreadsheet somewhere).
I guess training neural nets will a GPUs / custom chips task for some time; but as soon as you have developed the classifier / predictor you have to run it on whatever hardware you have and that means you might need less parallel computing power and it doesn't make (financial) sense buying GPUs for that.
Xeon Phi have got more cores, 50 or 60, and the cores share memory. Unlike GPUs memory space is uniform, there is no need to do explicit transfers to/from GPU's memory.
> Xeon Phi have got more cores, 50 or 60, and the cores share memory
60 cores. Furthermore, the cores have multiple (4) hardware threads per core. So on the latest Phi (Knights Landing) you can have several hundred threads (240) running concurrently without contention (in principle).
> Intel is hoping that the Xeon+FPGA package is enticing enough to convince enterprises to stick with x86, rather than moving over to a competing architecture (such as Nvidia’s Tesla GPGPUs ).
Then let it prove its performance on a few deep learning benchmarks. If is so fast and accessible, we'll be impressed.
On Knights Landing CPU a single core is be able to perform 2 x 16 operations on single-precision numbers, as a core has two AVX512 execution units. Each core can run 4 thread, and KNL has 64 physical cores.
This. Also, trying to slow the loss of customers purchasing what are essentially Nvidia GPU-heavy systems with Nvidia CPUs tacked on to act as the controllers.
TL;DR: You can basically do 8 operations at once (but of course, it's not that simple).
Think about a normal 64-bit register: you can add and subtract 64-bit numbers, OR and AND them, etc. Now think about a 64-byte register. What could you do with that? Well, suppose you're looping through contiguous memory and changing each word AND the change that your making to each word is independent of the change you make to any other word. With a normal 64-bit register you'd have to do one operation on each word. But with a 64-byte "register", you can load 8 words into it and do just one operation for the same effect as applying that operation 8 times (once to each word). Thus, some code---vectorizable code---can be sped up nearly 8x.
I never touched AVX or SSE in the past, so this was a great learning experience. In 30 minutes you can get 90% of it, but I think that to really do great stuff you need to also understand the relative cost of every AVX operation. There is an incredible user at Stack Overflow that replies to most AVX / SSE questions, if you check the AVX / SSE tag you'll find it easily.
However I noticed that when there were many load/store operations to do, there was no particular gain. See for example this code:
What I do here is to calculate the gradient after I computed the error signal (error * derivative of the activation function). The code is equivalent to: Any hint about exploiting AVX at its max in this use case? Thanks. Ok probably this was more a thing for Stack Overflow, but too late, hitting enter.