I guess I'm not understanding something. How are they reversing a trend?
Apple spent a lot of time and money developing the OpenCL standard and gaining industry support. OpenCL isn't strictly about GPGPU computing, it's about maximally utilizing available resources. AMD is adopting the standard, and they just implemented the CPU portion of it for now and went ahead to release it. There is no doubt in anyone's mind that they'll follow up with supporting GPU target. So if anything, they're embracing the trend, just trying to bring their tools to completion.
So yeah, the current SDK doesn't support GPU targets, but how is this in any way shape or form reversing the trend?
I feel like CPUs and GPUS are converging anyways. They are both going multicore, and GPUs are becoming better suited to general-purpose processing. Sure GPUs have lots of simple dedicated cores, while CPUs have fewer more complex cores, but over time this will converge.
In the long run, you won't need dedicated logic for graphics processing, at least on the low-end. Processors will just dedicate a few cores to graphics processing.
If you doubt this, look at what people said a few years ago about graphics integrated into the chipset. Now you have respectable solutions like NVidia's 9400. Within a few years, I bet that some relatively high-end solutions will be integrated into the silicon, and the low-end processors will do everything on the CPU. Over the long-term (probably when we have 32-64 cores per CPU), the GPU will become irrelevant.
I don't think they are converging at all. The fundamental difference is that GPUs are using the vast amount of silicon real estate that in a CPU is used for cache for computational units. That gives it much higher computational peak performance, but it also means that your memory is uncached, which will kill you in applications that do unstructured memory access.
I think of it as analogous to cars vs semitrucks. Trucks are efficient at hauling massive amounts of stuff to the same place, but not at transporting 100 commuters to their different workplaces. They are optimized for different problems and I don't think there's a better argument for saying that GPUS and CPUs will converge than to say that semitrucks and cars will converge.
But GPUs do have an internal memory hierarchy; there is a local cache with much faster access times than global memory.
Personally, I agree with the parent. Aside from the obvious solution of integrating a GPU-like accelerator on the chip, the multicore trend in general purpose CPUs has more, simpler cores. I agree with your overall characterization of what CPUs are good at versus what GPUs are good at, but I still see some convergence.
This looks to me like a publicity stunt since ATI's original GPGPU offering was a flop. Nvidia's OpenCL SDK actually works on their graphics cards though it is in limited beta last I checked.
Not to mention that you can compile CUDA code in "emulation" mode for execution on the CPU, too. The utility is largely for debugging, because the CPU is darned slow executing it, so I'm not sure it has any use in production.
I think that's just their way of making sure they don't lose track of where their possibly still very buggy software goes so that they can get the feedback and tell their users when it's time to upgrade.
The differences with CUDA are so small though that it is fine to develop for CUDA now and upgrade to OpenCL when they release it publicly.
Interesting development! One thing I don't get is if you are developing stuff for GPGPUs and you don't have one yourself that getting one is hardly a big hurdle.
The cheapest cards that will work are in the $100 range, hardly a major expense.
The GPGPU is probably one of the biggest developments in the last couple of years, it puts enormous power in the hands of individuals with a surprisingly small power bill and footprint.
I'm not sure if most people understand the kinds of threads/cores that GPUs actually have, and with that, the kinds of computations they're actually good at. A GPU's threads are extremely simple execution pipelines - nothing like the cores of an Intel Quad Core chip. Groups of GPU threads actually execute together in lock-step, executing the same code (but with different parameters).
Such devices are extremely good at handling data parallelism - where the same computation is performed on a large chunk of data - seen in areas such as graphics and scientific computing. But you're probably not going to, say, speed up your webserver with a GPU. If you don't have a computation that is extremely data parallel - that is, capable of having thousands of execution contexts working on data at the same time - a GPU won't help.
Researchers (especially in parallel computing) are already figuring out how to port all sorts of algorithms over to the GPU -- and not just scientific or graphics code -- we're talking general purpose algorithms like sorting, ranking, cryptography, networking, database operations, etc:
The biggest obstacle to preventing widespread adoption of GPGPU techniques today is that things are still in so much flux that mature, stable APIs have not yet emerged. There is fierce competition between Intel (Larrabee), Apple/AMD (Grand Central/OpenCL), and NVIDIA (CUDA) to get their respective APIs to be the dominant ones, and no one's yet come up with a mature wrapper API that can target any of them.
This is in large part because the hardware itself is still in flux, morphing away from the graphics-specific pipeline of 10 years ago and into a general purpose SIMD architecture. This has meant a change in the kinds of operations allowed on the GPU. For example, branches used to be a big no-no in graphics hardware: the performance hit caused by a branch was disastrous on a massively-multicore system. However, all hardware makers are now adding supporting for branches, because it's almost impossible to port a lot of CPU algorithms without them.
In any case, I'm confident that within a year or two, things will have settled down somewhat, at which point there's going to be a mad dash by developers to start using GPUs (probably systems- and library-level developers rather than application-programmers).
I am a researcher in parallel computing. All of these papers represent good work and are contributions to the field, but they still obey the limited model I was referring to: ship a large amount of data to the GPU, do data parallel computations on that data, and ship the results back. As I pointed out in a post below, this model does not work well when you only have small quantities of data at a time, yet there is parallelism to exploit.
I'll defer to your judgment about this stuff, then. (I'm a PhD student in computer vision, and we've recently been using a GPU SVM library that's been amazing for cutting down our processing times, so I guess I've been a little dazzled by this stuff.)
Anyways, since you're in this field, what's your feeling about the future of parallel computing, with regards to the different vendors? Which of CUDA/OpenCL/Larrabee will win out? Or none of the above? When will APIs settle down?
Honestly, I don't know, and anyone who claims to know is selling you something.
Your question is the question in parallel computing right now. And it effects all sizes and scales, from processor architecture (look at the different architectures of an Intel Quad Core, Cell, GPUs and upcoming Larabee and Fusion) to supercomputers (BlueGene style thousands of slow cores with fast interconnect, RoadRunner style of typical multicore processors with Cells as accelerators, Nvidia's giant GPU box, or just lots of SMPs). We don't know what the future will look, which makes this an interesting time to be in the field. People at all levels are experimenting with different architectures. We don't know what will win, if any one thing will win, and when we'll know.
With that said, I don't think APIs at the processor level will settle down until the hardware does. My understanding of OpenCL is to have a programming model that would work on architectures as different as GPUs, Cell and Larrabee, and that this will supplant Cuda. That sounds like a great idea, but lots of great ideas haven't worked in practice before.
I think it's going to be several at least several years of experimentation before the hardware settles down. My own belief (that is, opinion not based on experimental data) is that we'll end up with a heterogeneous chip with lots of simple cores for parallelism, a small number of sophisticated cores for sequential computation, all part of an integrated memory hierarchy.
The basic metric for this kind of comparison is how long can you get your 'compute' node to work on a part of a problem without any new input data and without any intermediate results that need posting for other parts of the code to continue (rendez-vous I believe these are called).
The longer that time the better suited the problem is for a massive parallel solution.
If the time is low relative to the IO that needs to be done then you'll find very soon that the bus that carries data between the host CPU and the number cruncher is the bottle-neck.
The big problem is that you can work with some fairly large amounts of data and still have the bus be your bottleneck. I was doing some GPU work a few summers ago that focused on matrix multiplications. We were sending matrices with 8k numbers on a side to the GPU for multiplication and still ending up with the bus being the slowest part of the computation.
How much RAM was on the card? 64bit numbers * 8000^2 = 512 MB. Granted today you can have 4GB per card, but back then you where probably stuck with a fraction of that.
Still, PCIe 2.0 x16 is limited to 8GByte/s so I guess the real question is how many matrixes where you multiplying?
GPUs are not a solution to every problem. There are several problems currently:
a) Explicit data copy required to and from GPU makes GPGPU useless for many problems.
b) GPUs are not as general purpose as CPUs. For example, no indirect jumps (goto address x) is allowed so no real subroutine calls. And no recursion.
c) GPUs are not good at branching. The toll on branching can be heavy.
d) Caches are fairly small. The on-chip shared memory, which is essentially a software controlled cache is a good idea in some cases but its still of a very small size.
Well the thing is that the computational peak power of the GPUs is not free. It comes at the cost of reduced flexibility. Nothing is free in computer architecture.
Well, count me in, as far as I'm concerned the mad dash has already begun. I've decided to splurge on the largest CUDA capable card that will fit this machine without blowing up the power supply.
Stuff like this is still cutting edge but will be mainstream soon, the time to gain experience is today. Not to wait.
The model is parallel SIMD, so in other words, multiple banks of cores, within each bank all the cores are doing the same thing at the same time.
A GPU won't help with our current programming model, maybe there is room for a new model then ?
That won't happen overnight, but one thing is for sure, if you want to get more performance than what your applications are giving you today then you'd better learn how to code for massively parallel systems.
I'm intimately familiar with the programming model that GPUs use. The difficulty is that only a limited subset of our problems are amenable to it. I think we need something a little less special-purpose than a GPU to be more widely applicable.
Certainly, the computing world is going parallel. My personal belief is that what we end up with will be a heterogeneous chip with some parts good at massively parallel computations, and some good at sequential computations.
But I think GPUs themselves represent a stop-gap; the communication cost of transferring data off the motherboard to the card is just too high for fine-grained parallelism. Once we have that sort of architecture on-chip the game changes. Intel's Larabee and AMD's Fusion promise to do this. The Cell was close, but making the vector processors completely divorced from the normal memory hierarchy makes programming for them much more difficult. (Although it does give me many research opportunities.)
It sounds like you have an interesting work environment.
I agree that GPUs are a stopgap, but they're a pretty interesting one.
I've had all kinds of co processors over the years, small clusters, DSP daughter boards, transputers and so on. Nothing to date comes even close to the power of the GPU boards that you can buy right now.
The comms cost is indeed where the problems lie, but that can go two ways:
1) a more formalized form of high speed interconnect (such as infiniband directly on the GPU card)
2) a port of a general purpose OS to a GPU.
Some people are already working on (2), (1) is probably not in the cards unless somebody orders a very large number of boards.
There are motherboards that will hold 4 or even 6 PCI express cards too.
I don't have any experience with the cell architecture, I'll go do some reading, it sounded interesting enough when it was released but I never got around to study it. More reading to do I guess :)
I'm a CS PhD student doing high performance systems research. My dissertation work has focused on Cell, and this summer I'm working with GPUs in an internship. (My profile has a link to my academic page where you can find my Cell related work.)
Personally, I think that GPUs as we know them won't be around long enough for 1) or 2). Moving the GPU (or a GPU-like computation accelerator) onto the chip itself changes everything.
I think it's also worth keeping in mind that GPUs are, essentially, tiny computers unto themselves. Putting a GPU in a system now is like installing a computer in a computer.
> I think it's also worth keeping in mind that GPUs are, essentially, tiny computers unto themselves. Putting a GPU in a system now is like installing a computer in a computer.
I've been doing exactly that for the longest time, first with a DSP32, then with a transputer card. Then I switched to small scale clusters (the largest had 10 nodes, see here http://www.clustercompute.com/ ), now GPUs.
Your take on 1) and 2) is noted, thanks for that insight.
Apple spent a lot of time and money developing the OpenCL standard and gaining industry support. OpenCL isn't strictly about GPGPU computing, it's about maximally utilizing available resources. AMD is adopting the standard, and they just implemented the CPU portion of it for now and went ahead to release it. There is no doubt in anyone's mind that they'll follow up with supporting GPU target. So if anything, they're embracing the trend, just trying to bring their tools to completion.
So yeah, the current SDK doesn't support GPU targets, but how is this in any way shape or form reversing the trend?