Perhaps it's been addressed in the talk (still listening to it), but what kinds of programming languages/paradigms are particularly well suited for the belt-model?
I mean, it's nice to be assured the Mill is a general purpose CPU that handles most existing code really well, but what kind of code would it handle really well, if you know what I mean? And what kind of programming language would be well suited to making good use of its strengths?
They have better than usual performance on function calls, so feed it a poorly optimized Lisp and you'll probably get a large relative performance advantage. But I don't think that's what you're looking for.
Going through the most striking parts of the architecture:
- They can pipeline outer loops, even with function calls in them. I expect that will perform well on event loop patterns like you find in non-blocking IO server frameworks.
- They have cheap pipelining, vector operations, and (in the high-end models) lots of functional units. All the statistical languages like R and Julia should be able to get something out of that.
- If you have a lot of repeated pure loops (ie: map() calls), your compiler could optimize two loops into one to try to saturate the mill's functional units. Using their pick and smear operations, I think you can even get this to work on different-size loops (with less benefit the more their sizes differ, of course).
- They have novel memory protection stuff, but that's more OS than programming language.
I'm clearly not part of the Mill team though, so I could be wrong about any or all of those.
Hi Will, A recurring question by the audience in the talks is that you will not get many benefits from such a wide architecture on general purpose code because of there is not enough "available parallelism". Do you have any concrete code examples where the Mill will find stuff to do where a OOO will not?
There are lots of small ways we can get more units in use more of the time.
ILP is a bit of a misnomer on the Mill as each Mill instruction contains many operations and these execute in 'phases' over the subsequent cycles. In the extreme(ly common) case you have a single-instruction tight loop but the second phase of the first iteration is running as the first phase of the second iteration runs. Mind-bending.
And as the Mill is a belt machine, and as the hardware manages the call stack, the units can be in use by the caller even when the PC is in the callee. If you schedule, say, a `multiply` that takes 4 cycles and then next cycle call into a function (which usually takes just a cycle) then the machine can go run the function, however long it takes. When the function returns that `multiply` has long since completed but the spiller is going to use result replay to drop the results on the belt at the right time.
A lot of time is spent in loops, though, and the Mill does really well at those! On the Mill, all levels of loop can be pipelined because each level of loop is given its own call frame, allowing the result replay to work. See http://millcomputing.com/topic/pipelining/
Sorry for linking to just so many of the talks! ;)
Thanks, I'll take a look at those links. In the meantime, I have another question. How much do the Mill team plan to share with the general public before any product release, can we expect a emulator, it would be great to toy with.
I am curious why you folks didn't go the RISC-V route (open source) and create a new architecture for all to use and expand? With patents etc. i fear the Mill is going the Tilera route and won't be heared much from in a couple of years.
If you have working prototypes and a community to collaborate them, getting funds for actual testing it in silicon is not that hard esp. not if you collaborate with an University. With good performing silicon one can then expand and the folks who designed it are garanteed to have a job updating and enhancing it for the next 10 years or more :) Esp. if other companies sponser their work.
Are the Mills creators afraid say Intel will use their ideas and create their own chips excluding them from business?
ARM itself is beaten by RISC-V's initial Rocket prototypes in power and area and the upcomming BOOMs are a good match for ARM Cortex A15+ (only 64 bit). And they only recently started just with OoO chips. And do you see ARM suddenly creating RISC-V chips? or borrowing their ideas?
Do I understand correctly that the load-time code specialization described in this presentation is not inherent to the Mill hardware, but instead merely a proposed OS-level mechanism for supporting binaries that are portable across the model family? Could a Mill-targeting compiler choose to bypass this mechanism and directly output model-specific concrete code?
The specialization isn't necessarily load-time. I think they're envisioning it as most likely being an install-time process, so that distributed code is still portable but they don't have to worry quite so hard about the latency of the specializer. Ahead-of-time specialization will still need to be possible for creating things like bootloaders and OS install media.
I think it's safe to say that the future will have a lot of fat binaries.
Try watching the videos in 1.5x or 2x speed on youtube. This gives you the same information density as reading and you don't miss the nice animations travisb mentioned.
I haven't seen the video yet (2h is not that easy), but I'm curious - currently most of the time fast CPUs spend is waiting for cache misses. Does Mill CPU Architecture deal somehow with this besides the claimed "10x single-thread power/performance gain over conventional out-of-order superscalar architectures"¹?
If you go zigzagging all over main DRAM or disk unpredictably then there's no magic wand we can wave ;)
We can however make generally less memory accesses overall, as we have backless memory
When you do a load you specify when it will retire, and it will get the value in memory at the time it retires. What happens is that the load-store unit goes off and gets the memory as soon as possible, swallowing latency, but snoops on the cache coherency traffic.
I am curious about how does it compare to streaming loads? It is a common technic to avoid cache misses when you have a large number of registers, you can issue loads in advance, by-passing the cache, and the CPU will manage the data dependency in HW. The idea is that hopefully when the you will use the register, its data will already be available. Otherwise you stall as usual.
There are many facets to this. From a great height they are roughly simular, but in the details they differ.
For example, the Mill is a belt machine and each stack frame has its own belt and the hardware takes care of spilling in-flights across calls to any depth. Also, the Mill loads are orthogonal to cache management and we offer immunity to aliasing and false sharing.
I think Godard actually explicitly said that in a Q&A of one of the very first lectures - although I have to admit at that point the finer details of the discussion were way over my head.
yeah well during mis-predictions, interrupts etc. undoing that entire state (of name-mappings, and re-create a different one perhaps) might not be easy.
moreover, since it is running kind of a cycle ahead, it seems that this whole network, seems more like a cross-bar switch, for renames :)
On a conventional register-based OoO interrupts require issue replay and this is a major source of complexity and budget.
On the Mill, which is a "belt machine", interrupts are as far as the hardware is concerned just an involuntary function call; the Mill does result replay.
Regards mispredicts the Mill has a shockingly short pipeline and mispredicts (where the target is in cache) is only around 5 cycles or so. The Mill has a novel prediction mechanism called transfer prediction as covered in this talk http://millcomputing.com/topic/prediction/
Any thoughts about debugger support on the Mill? It seems to me that this could be a bit of a headache. For example, the usual technique for breakpoints is to trigger an interrupt at the desired instruction, which then causes a context switch to the debugger process. How will the debugger be able to access all the bypass-network/belt/spiller data?
My off-the-cuff thought is that since the spiller has a mechanism for dumping its state in memory anyway, there is probably (or should be) a way for software to force it to dump this state, which could be used by the debugger to read all the required data from a specified (probably model-specific) structure in memory.
He says that they are going to implement all instructions that are not supported by the hardware[1].
I wonder if they will use automatic synthesis of code for each instruction, and if so, how they are going to ensure good performance. Maybe superoptimization?
Another question I have is about their stance on free firmware. Will the mill family members require blobs?
I recall somebody asking Ivan about that in a previous talk, and he pointed out that Intel is a much larger company than ARM, so they'd rather manufacture and sell chips, and fall back on licensing the architecture if Plan A didn't work out.
Any questions, ask away, happy to explain or dodge as appropriate ;)