Zombie Moore's Law shows hardware is eating software

janekm · on Sept 24, 2016

This article is based on a faulty premise. The A10 processor is still far away from the performance of recent Intel CPUs (a quick browse of geek bench shows about 2 times the single and 4 times the multi core performance). Apple is very quickly making the gains towards the limits of Moore's law not because of a different model of computation but because those gains had not yet been realised for mobile CPUs. As the performance gap is getting smaller, it is likely that the year-on-year improvements in processor performance will also slow down for Apple CPUs. Which is not to take away from the achievements of the A10 design team, considering performance per Watt this chip is incredible.

sliken · on Sept 27, 2016

Yes, an A10 core is only 60% as fast as a single intel core. But the Intel part has a 91 watt TDP.

IMO 60% as fast is not "far away". Clearly if apple wanted to use more power and/or transistors they could improve the performance further.

Apple's so competitive in that space that intel's actually in retreat in the mobile space.

Sure Intel is much faster in multicore, but that's because it has much more power to work with. Intel's has tried to scale decent performance to lower power levels, it hasn't worked out particularly well so far.

The more interesting question is will Apple have better luck scaling 64-bit arms up than intel has had with scaling down.

brudgers · on Sept 24, 2016

I don't disagree. What I found interesting is the bigger picture of hardware

By which I mean that, I think 'performance' is a slippery eel of a concept. The A10 fits in my pocket and and the Xeon Phi does not...or rather it doesn't in a way that provides me with useful computations...and the latest i7 doesn't hold a candle to the GPU in my $40 graphics card when I want to rotate the 16 million pixels my camera stuffs into a RAW file...and if I want to build a Kubernetes cluster, I can throw Raspberry Pi's at the problem.

nkurz · on Sept 24, 2016

the latest i7 doesn't hold a candle to the GPU in my $40 graphics card

I think I understand your point (unless you meant iPhone 7?), but interestingly, recent Intel processors have impressively powerful graphics processors built in: https://software.intel.com/sites/default/files/managed/c5/9a.... So if you were using full capacity of that i7, I think it would beat your graphics card handily. The issue is that practically no one is writing code to use the full capabilities of these modern CPU's. A friend did a write up here: http://lemire.me/blog/2016/09/19/the-rise-of-dark-circuits/

Even without the built-in GPU, I'd bet that the right software running on that i7 would blow away that $40 graphic card. I think problem is that we don't really have the right tools for writing low level multi-core software. Image rotation parallelizes and vectorizes really well, the x64 side of those processors have excellent vector capabilities, but it still requires hand coding to get top performance.

if I want to build a Kubernetes cluster, I can throw Raspberry Pi's at the problem

A thought experiment: how fast a cluster could you make from a few dozen iPhone 7's connected wirelessly? The processors are surprisingly fast, and I think they support 802.11ac at gigabit plus speeds. Could you possibly do distributed computing on an ad-hoc network of iPhones that happen to be nearby? An app with a sandboxed work queue that accepts local connections? There's lots of reasons it makes no practical sense, but it would be quite a demo.

brudgers · on Sept 24, 2016

An Intel i7-6700 will do about 200 single precision GFLOPS[1] and costs about $300 [2]. An Nvidia GeForce 710GT will do about 350 single precision GFLOPS [3] and costs about $40 [4]. A pixel pipeline is one of those 'embarrassingly parallel' workloads and the software I use, Darktable [5] is tuned to take advantage of GPU parallelism...the tools are there in no small part due to the gaming industry.

Thoughting on an iPhone cluster, the hurdle seems to be software that is designed to create friction against implementing such a thing: drivers and firmware in particular.

[1]: https://www.pugetsystems.com/labs/hpc/Skylake-S-i7-6700K-and...

[2]: http://www.newegg.com/Product/Product.aspx?Item=N82E16819117...

[3]: http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=...

[4]: https://www.techpowerup.com/gpudb/1990/geforce-gt-710

[5]: https://www.darktable.org/usermanual/ch10s02s04.html.php

nkurz · on Sept 25, 2016

An Intel i7-6700 will do about 200 single precision GFLOPS[1] and costs about $300 [2]. An Nvidia GeForce 710GT will do about 350 single precision GFLOPS [3] and costs about $40 [4].

You are right, and I stand (mostly) corrected. Alternatively said for Skylake, with unrolled 256-bit FMA, you can calculate close to 16 single-precision floating point calculations per cycle per (physical) core: two vector loads, one vector store, and the 8x32-bit FMA itself.

At 4 GHz this is about 100 Gflop/s per core. For 4 cores this is about the same 350 Gflop/s as the $40 graphics card you pointed to. And realistically, loop overhead will knock you down another 10%, and if you run for any length of time you probably be thermally throttled back to something closer to the 200 Gflop/s you cite.

My general question about why no one is interested in using the built-in Tflop/s capable side of the Skylake die stands, but I was wrong to think that a well-tuned desktop processor could come anywhere close to the price/performance of a cheap graphics card for raw flop/s. Thanks for pointing out the real numbers.

sliken · on Sept 27, 2016

Don't forget the i7-6700 is the one with the lame graphics.

Try the i7-6770HQ if you want better graphics performance.

Retric · on Sept 24, 2016

The iris pro 560 is about as fast as an ultra low end graphics card. It's simply memory starved so it might do ok on some benchmarks, but in the real world it's a ~2010 graphics card.

nkurz · on Sept 24, 2016

Sure, as a standalone graphics card for gaming it's low end.

But as an accelerator to a CPU, it seems like a phenomenally underutilized resource. It has a direct connection to RAM, a 128MB cache, and the high tier can do over a TFlop/s[1]. For use cases like Ben mentioned of rotating a RAW file that fits in this cache, it's almost a perfect fit, yet almost no-one would think of using it so it stays dark.

Why is this? Not too many years ago it would have been thought insane to have a TFlop of unused capability on die, and now it's a non-story. I don't think it's because we have no use for the speed-up. Rather I think that the tooling just isn't there for most programmers to be able to make use of it.

[1] http://wccftech.com/intel-iris-pro-graphics-gamers/

sliken · on Sept 27, 2016

Well there's a bunch of problems: 1) If your are going to completely rewrite your code for an accelerator you are liking going to use CUDA to target a MUCH better accelerator. 2) If you don't you are likely going to ignore the decent intel GPUs because astonishingly small number of the intel SKUs (let alone the non-intel) have a decent GPU. Try to find a regular product/laptop with a i5-5675c or i7-5775c in it for instance. 3) CUDA was available first and seems to have the lion share of the mindshare. OpenCL is a distant second. 4) Even if you have the right SKU, devices with said SKU are often not designed for heavy GPU use and will heavily thermally throttle.

I've played this game hoping for a linux laptop with a non-crappy GPU and I didn't want to play games with nvidia feeding pixels through the intel GPU to get to the screen. Very few laptops have the Iris chips. The lenovo 721s has the iris 540, but they locked linux out with their broken BIOS that disables AHCI. The XPS 13 has it as an option, but you have to get the I7 and the 3200x1800 shiny/relective screen that halves your battery life.

I did manage to buy an Iris 540 NUC, nice little machine and the GPU seems to keep up quite well for the light usage I've tried. Most web games, full screen 1080P movies, and minecraft.

halomru · on Sept 24, 2016

GPUs and i7 CPUs have radically different design philosophies. The i7 has great performance per core and giant caches, while the GPU has lots of (relatively) slow cores with tiny caches, and effectively only SIMD-instructions.

For simple, highly parallelized floating-point number crunching nearly any GPU will blow your i7 out of the water, no matter how you program your i7. Only a tiny fraction of the surface area of the i7 has stuff useful for that task, but GPUs are designed for that works. Conversely, any GPU will be absolutely terrible at running Microsoft Word.

twoodfin · on Sept 25, 2016

Not clustering, but Apple has hinted that they're exploiting (at least when they're plugged in) the computational power of hundreds of millions of phones to do work like training image recognition algorithms.

okket · on Sept 24, 2016

This perspective fits into the top-down evolution narrative: Write highly specialised solutions in the most general way in software until essential, generic parts are discovered and stable enough to be translated into transistor logic. Repeat.

Or like the Bitcoin "evolution": CPU -> GPU -> FPGA -> ASIC (this is a very simple, single purpose optimisation, but it can illustrate part of picture)

Only optimising transistor speed, their size or the whole CPU/GPU package has obviously limitations and may be a dead end.

usrusr · on Sept 24, 2016

Dead end is far too negative. There is nothing wrong with improving general purpose computing. It's the very best thing you can do - when you can. Just be ready to adapt (by going for specialization) when you can't.

eximius · on Sept 24, 2016

Huh, not where I thought this article was going from the title. I thought it'd be about software bloat eating into the gains given by hardware (which, in hindsight, is exactly backwards from the title).

gaius · on Sept 24, 2016

What is old is new again - this is the architecture of the Amiga! And before that the C=64 and the Atari 800.

untilHellbanned · on Sept 24, 2016

Why was it a non-starter for Intel and what made Apple use it? I'm curious because I don't know about computer architecture.

headShrinker · on Sept 24, 2016

One important aspect to apple's position is it had a new set of limitations other processor designers weren't dealing with; power consumption and size. Yes there were small processors and there were processors that didn't use tons of power, but they weren't fast. Apple had a very specific set of needs and apple being apple, was able to completely costomize its design and supply chain

mattnewton · on Sept 24, 2016

My speculation: Intel doesn't want to threaten it's core buisiness in the pc market (which is heavily dependent on backwards compatibility), and the experience with itanium in the hpc market probably left a bad taste in their mouth.

neppo · on Sept 24, 2016

Apple has control over the chip and the OS, whereas Intel only makes the chip

gaius · on Sept 24, 2016

Again that is how it was back in the day, CPU and supporting hardware, OS, compiler, standard libs etc all co-engineered. Tho' in the case of the Amiga the CPU was part of the supporting hardware, the heart of the system was something else that farmed work out to the CPU or a co-pro as appropriate...

walterbell · on Sept 24, 2016

Is there a good online reference for this design?

gaius · on Sept 24, 2016

There is an excellent book https://www.amazon.co.uk/Future-Was-Here-Commodore-Platform-...

http://bogost.com/writing/blog/the_future_was_here/

bluedino · on Sept 24, 2016

Apple used Motorola because they were cheap compared to intel

mattnewton · on Sept 24, 2016

Because intel said no originally; they didn't think there would be the volume to justify the expense of a custom chip at the price apple was willing to pay. http://m.imore.com/intels-outgoing-ceo-explains-how-they-tur...

_ph_ · on Sept 24, 2016

Intel seems to be captured in its own x86 world. There were no big changes to the instruction set for ages, even the x64 extensions were designed by AMD. So while Intel still excels at manufacturing, the few projects for breaking out of their box failed (Ithanium, Larrabee). And also: why are there no Intel chips yet who include all the nice things USB-C can offer to the market? E.g. Thunderbolt, newest DisplayPort.

qznc · on Sept 24, 2016

Intel added for example:

FMA (3) with Haswell in 2013 https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_op...

SSE4.2 with Nehalem in 2008 https://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

AES in 2008 https://en.wikipedia.org/wiki/AES_instruction_set

SHA in 2013 https://en.wikipedia.org/wiki/Intel_SHA_extensions

MrBuddyCasino · on Sept 24, 2016

Intel even added customer specific logic to its server class CPUs, which are only usable by them to speed up certain operations:

https://news.ycombinator.com/item?id=11875255

userbinator · on Sept 24, 2016

That seems more like a vague rumour than anything, at least until someone leaks some more info. (That it hasn't occurred, makes me somewhat doubting. Maybe the various undocumented instructions which have been found over the years are part of this...)

gaius · on Sept 24, 2016

This approach was common in mainframes - you would buy a box that contained everything, but only the parts that you'd paid for would be accessible. If you wanted more, you could just apply an unlock code that you'd bought, and new features would come online, that were lying dormant all along. Even up to the level of more CPUs.

RandomOpinion · on Sept 24, 2016

To add to the other comments, Intel has already integrated a FPGA directly into some of its Xeon server CPU packages to allow for custom logic in hardware:

http://www.theregister.co.uk/2016/03/14/intel_xeon_fpga/

dman · on Sept 24, 2016

Can you buy one right now?

RandomOpinion · on Sept 24, 2016

"Intel Begins Shipping Xeon Chips With FPGA Accelerators"

http://www.eweek.com/servers/intel-begins-shipping-xeon-chip...

dman · on Sept 24, 2016

One thing I find disconcerting lately is that all of intels recent innovations appear to be on items that are very hard to buy as a solitary developer (Knights landing, Optane, FPGAs).

dogma1138 · on Sept 24, 2016

>Intel seems to be captured in its own x86 world. Eh, no.

>even the x64 extensions were designed by AMD.

AMD (not working solo) has published the spec in 2000, AMD, Intel, VIA and a few others have a pretty different implementation of that spec, Intel's x64 extension is drastically different than AMD's implementation.

Intel has been adding new extensions with virtually every generation, just look at how many generation of virtualization support extensions we've had.

>the few projects for breaking out of their box failed (Ithanium, Larrabee).

Itanium died because the industry didn't want to take a "RISC" (;)).

Larrabee isn't dead, in fact it is very much alive it is just inside your CPU, the AVX extension set and the silicon that supports are Larrabee's vector processing units. Intel just more or less had the insight to see that Larrabee won't go anywhere really and they can shrink it and implement it into every CPU within a few years.

Larrabee was interesting but it was not here and not there, Intel went to implement the good parts of it into each of their own CPU's and for HPC/GPGPU like computing it went a different way by releasing the Xeon-Phi.

>And also: why are there no Intel chips yet who include all the nice things USB-C can offer to the market? E.g. Thunderbolt, newest DisplayPort.

This just shows your utter lack of understanding of the technology landscape. I was tempted to cynically ask you about what version of Thunderbolt you have on your non-Intel machine to see if you fall for it, but alas I don't want to spread misinformation.

USB-C doesn't bring Thunderbolt not Displayport to the market, USB-C can be mechanically compatible with those 2. Thunderbolt is Intel's proprietary technology and its upto the system integrator to decide if they want to implement it or not, don't buy cheap ass laptops get the latest thunderbolt.

As for displayport could you please tell me what is the "latest"? if it is 1.4 there isn't a single desktop GPU that "technically" supports it yet, even the GTX 1080 is only certified for 1.2 (and in theory can support 1.3/1.4 but ink on the specs isn't dry yet and the certification process is well meh). Thunderbolt 3 is the only current interface that is actually certified for DP1.3, it should probably support DP1.4 but the spec was only finalized in march this year so....

qwertyuiop924 · on Sept 24, 2016

Actually, Itanium died because the perf sucked. And it had no compat.

So yeah...

dogma1138 · on Sept 24, 2016

The performance didn't suck that much, not for the RISC part. The problem was that too many IA-64 application relied on the x86/xCPU emulation which was well yeah very shitty.

userbinator · on Sept 24, 2016

The Itanium is a VLIW, not a RISC (although they have some similarities). It executes "bundles" of 3 instructions in parallel. It was excellent in certain benchmarks because the compiler could schedule sequences to make full use of this; but it turns out that general-purpose code isn't really so parallel all the time, so the compiler would have to fill 2 or 3 of the slots in each bundle with NOPs, wasting space (thus cache usage and fetch bandwidth) and leave much of the CPU's execution units idle.

It was great at highly parallel benchmarks, but much slower than contemporary x86 (the P4, which wasn't that great either) for everything else.

dogma1138 · on Sept 24, 2016

Yeah it was VLIW/EPIC but i like the puns :P

As for the performance I don't have enough experience to pinpoint exactly all the problems, but i think it really ended up being a software issue.

Most complaints I've encountered were about the "emulation" part with the dynamic translation libraries and instruction set emulation.

Overall it seems to me like they just tried to tackle too many things, have some compatibility with X86, take on SPARC/POWERPC and do HPC, mainframes and tons of other applications at the same time.

That said after all this is still a mid 90's ISA for a very niche market with all these issues its surprising it lasted so long, I never understood why HP kept paying Intel to produce these chips in the first place.

acdha · on Sept 24, 2016

The performance for native code wasn't bad but it was never the game-changer the early marketing pitch predicted and that was especially true compared with the competition after adjusting for price or power consumption.

I used to work with a group of scientists who wanted as much performance as they eke out. They had a large C codebase which had already been ported to every major architecture and it did a ton of floating point math along with a fair amount of integer work, etc. — basically the best-case scenario for IA-64. Unfortunately, even with Intel's compiler and tuning tools the performance just never panned out: the best-case scenario was that if someone wanted to divert years of staff time into tuning, IA-64 might take the performance crown but that would be an expensive gamble and probably locking ourselves into a single server vendor (HP). It was much safer to simply take that time and money and buy x86-64 boxes which were faster and available from many vendors. (The same story repeated later with Cell: neat possibilities but nowhere near enough benefit to justify the risk of locking in to a single vendor architecture with an unclear future)

Ignoring the architectural debates, Intel made two fatal implementation mistakes with IA-64: the most obvious one was never finding a deadline they couldn't miss but a less obvious problem was failing to take software, and particularly open-source software, seriously enough. IA-64 performance critically depended on using Intel's proprietary compilers. Since those were extremely expensive, almost nobody used them and thus they had compatibility issues with many large Unix or Windows codebases, and the focus was clearly on the optimizations needed to deliver good SPEC benchmark results rather than other features.

I don't know how much difference it would have made in the end but I think having a solid open-source or at least free toolchain would have made it less expensive to try the platform. It's very easy to imagine that it would have gone considerably differently had something like LLVM existed back then and Intel had simply contributed a high-quality backend, not to mention having a broad general shell server program for open-source projects to use to port and optimize.

dogma1138 · on Sept 24, 2016

Itanium came out of a spec that HP initially designed, so it most likely had a lot of impact to do with it being locked to a single vendor (i think SGI played with Itanium workstations but my memory fails me at this point).

I'm not defending IA-64 overall it had issues, but it wasn't as bad as some people claim it to be "OMG INTEL CAN'T DO X64 THEY SUCK".

Intel also took quite a long time to bring Itanium CPU's to the forefront of Intel's own technology, Itanium got DDR3 and QPI only in 2010. By the time Intel brought all the good things to Itanium it was pretty much already dead in the water and on life support, but doesn't really bare on the viability of the IA-64 as an architecture it more or less bares on the viability of the Itanium as a product line.

acdha · on Sept 24, 2016

Part of the problem was that Intel spent the late 90s hyping Itanium and VLIW as the future of computing. The idea wasn't obviously wrong – the ancestor (PA-RISC) was competitive at the time if expensive – but the heavy sales-pitch set them up for failure if they couldn't deliver and left both early vendors and purchasers feeling betrayed. Failing to hit volume cascaded to the later feature failures you mentioned since they never managed to catch up to the x86 world.

As far as the architecture goes, the main conclusion I draw is that VLIW was a plausible idea but failed because it overestimated the amount of instruction-level parallelism and perhaps because it assumed too much of compilers. The GPU world has massive parallelism, enormous market volume, and the toolchains are more advanced, but even there AMD went from the VLIW-style TeraScale design to a RISC-style Graphics Core Next to compete with nVidia's RISC-y Tesla. It definitely seems reasonable to ask whether VLIW is an evolutionary dead-end.

twoodfin · on Sept 25, 2016

I think VLIW failed because it overestimated how difficult and expensive it was to schedule instructions for a superscalar processor at runtime.

gaius · on Sept 25, 2016

Do you mean underestimated? :-)

twoodfin · on Sept 25, 2016

Runtime. With hardware.

They underestimated how difficult it would be to produce competitive execution scheduling at compile-time in software.

gaius · on Sept 26, 2016

Ah, got it. A subtle point but a pertinent one!

dragandj · on Sept 24, 2016

From the software side, GCN is much easier to program for than VLIW.

acdha · on Sept 24, 2016

I think that's generally been true across the board with VLIW: implementation difficulty serves as an effective check on the theoretical maximum.

gaius · on Sept 24, 2016

SGI fell for the Itanium hype hook, line and sinker, and announced they would be discontinuing the MIPS line with the 12000 (an excellent CPU for its day) and migrating everything to Windows running on Itanium... But Itanium was late, and MIPS had to rush the 14000 out and it wasn't so competitive, and IRIX had had no serious work done on it in years, and customers were thinking well if we are going to have to switch architectures anyway why don't we shop around... And people stopped buying SGIs.

The manager behind this decision, Rick Belluzzo, jumped ship and got a sweet job at Microsoft.

https://en.wikipedia.org/wiki/Richard_Belluzzo

gaius · on Sept 24, 2016

Everything promised about the Itanium relied on a "sufficiently smart compiler" which proved harder to write than anyone expected. It's being kept alive now solely because of legal agreements (e.g. see the recent HP vs Oracle case). Technologically it's a dead end, I don't think too many people dispute that in 2016. It's just a shame that it killed SGI.

qwertyuiop924 · on Sept 24, 2016

Ah...

_ph_ · on Sept 24, 2016

The AMD Opteron introduced x64. While Intel was still full steam on the P4 architecture and tried to push Ithanium. Only some years later Intel licensed the x64 instruction set - of course on chip they then did their own implementation of the x64 instruction set.

Ithanium failed, because first it was delayed several years, got very expensive and power hungry. For a while it had quite a good performance, but due to the price it gained only a little foothold in the very expensive workstation market. The final nail in the coffin was, that the Opteron really launched the x86 based server marked. With the Opteron companies would for the first time replace Sparc etc. in the middle range servers. And when Intel also offered x64 chips the risk server market started to fade.

I am not buying cheap ass laptops. But a lot of Apple laptops are held back by having to include extra chips for supporting thunder bolt, while this feature is promised for one of the next Intel cpu generations as being included into the cpu. Had the plain MacBook Thunderbolt via USB-C it would be more useful

dogma1138 · on Sept 24, 2016

The "AMD 64" spec was openly published in 2000, Intel did not license anything.

Intel was planning to license the AMD64 ISA directly from AMD for Yamhill however that did not happen and they've ended implementing the spec on their own with their own ISA.

Intel's X86_64 ISA is actually quite different than AMD's implementation, and funny enough neither of those are actually 100% in line with the original spec. The spec was published because it was an extension of the X86 instruction set, not a replacement, the x86 instruction set is available to AMD through the cross license deal they have with Intel.

>I am not buying cheap ass laptops. But a lot of Apple laptops are held back by having to include extra chips for supporting thunder bolt, while this feature is promised for one of the next Intel cpu generations as being included into the cpu. Had the plain MacBook Thunderbolt via USB-C it would be more useful

Thunderbolt will always require additional chips (just like displayport, usb, and any other port on your computer), this is how the spec works, it was never designed to be implemented on die on the CPU. USB-C has nothing to do with Thunderbolt and vise versa, there is a mechanical compatibility standard that allows you to use the same physical port for Thunderbolt and USB via the type C connector. This isn't any different than Thunderbolt using the Mini-Display port mechanical standard. Thunderbolt has multiple mechanical standards including over a dedicated TB electrical port, a fiber optic port, DP electrical port, and USB-C electrical port.

Nothing Intel is doing is holding Apple back, the reason why Thunderbolt 3 isn't available on Apple laptops yet is because Apple is 2-3 generations behind, the mid 2015 Macbook pro is using a 4th gen haswell CPU.

nickpsecurity · on Sept 24, 2016

Same thing I was going to say. Amiga architecture was apparently the one of the future rather than past. Too ahead.

In parallel, though, we also saw mainframes use a similar trick. Their usage was more narrow, though.

wolfgke · on Sept 25, 2016

> Amiga architecture was apparently the one of the future rather than past. Too ahead.

I wouldn't call an architecture this way if it proves very difficult to extend without strongly breaking compatibility.

nickpsecurity · on Sept 25, 2016

It's worth consideration although I don't think an architecture requires backward compatibility by definition: more a way to keep users or earnings reliably. You'd be saying that Mac OS 9 isn't an architecture since Mac OS X was different. Or PS2 -> PS3 mismatch meant they weren't architectural designs. Android API isn't backward compatible with GNOME or KDE on Linux. I'm just not seeing a lack of B.C. meaning there's no architecture or platform.

If it was, though, we might potentially resolve the issue if there's a standard API for various things where the implementation can be ignored (and improved w/ SW or HW). Modern example of an Amiga-like architecture might be Microsoft's DirectX techs where there's API's for networking, I/O, audio, and video. Any of those might have been accelerated with dedicated coprocessors with apps looking the same. At least one or two were in practice.

Likewise, the programming model might obviate the need for this consideration. One where they were tasks called as needed by a control code w/ optional acceleration might mean just the control code or OS interface needs to change per release or platform.

samfisher83 · on Sept 24, 2016

I think one of the things that is incorrect in this article is the software used to design asics (Custom Chips). Synopsys, Cadence, Mentor, Magma (when it was around) all made pretty good tools. The thing is the weren't free. Also you had to use TCL, but they could take rtl (Verilog aka the design language) to GDS (what gets fabbed). Heck there always seemed to be some startup that claimed they came up with a better routing tool or a better timing tool. They weren't super complicated to use. If you had too many gates you had to script a bunch of hacks, but it isn't too hard.

ChuckMcM · on Sept 24, 2016

Simple summary, you can use transistors in your chips to speed up software. That comes at the cost of flexibility of course but I think it was one of the secrets of the iPad's initial success.

That said, I think the ability to make inexpensive custom chips is going to power a wave of new hardware gizmos, unfortunately all of that capacity is in China and so if you don't read Chinese data sheets you're probably not going to be able to use those chips. (not that autotranslation is "bad" just that it doesn't always make technical documentation actionably understandable.)

on_and_off · on Sept 24, 2016

This might also explains why Google seems more and more interested in designing its own chips for mobile devices, just like Apple.

That and the underwhelming offering on Android compared to what Apple has been able to deliver.

colechristensen · on Sept 24, 2016

Why is it underwhelming? What do people do with phones?

I exchange text, sound, and pictures with people. Sometimes I play stupid games. The sound/pictures/text sharing long ago hit a limit where theres a big question as to why a person needs a faster processor. Increased watts per compute helps, sure. But what else? What besides bloat requires a Moore's law increase in a phone processor?

What do I want to do with my phone that I don't yet know I want?

comex · on Sept 24, 2016

> What besides bloat requires a Moore's law increase in a phone processor?

Here are some relatively boring answers:

AI stuff - including, among other things, voice recognition, other parts of virtual assistants, photo capture (which involves a lot of processing in modern smartphones to improve the image), and photo auto-tagging - is always happy to guzzle extra cycles to improve accuracy a few percent. Some of this can be (and is) done on the cloud instead, but at the cost of latency, availability (when reception is poor), and, of course, privacy. Much better to do it on-device where possible.

Games, even short of VR/AR (which is definitely a "I don't yet know I want" candidate), similarly often push the limits of the GPU to achieve graphical fidelity. True, the kinds of simple, repetitive, low production value games that tend to grace the top charts of mobile app stores today have little need for increased fidelity. But I think that phenomenon is more about market dynamics than inherent to the form factor. Dedicated portable consoles like the 3DS have many excellent games that would so benefit, and while some depend on more precise controls than available on mobile (i.e. buttons), many do not, e.g. RPGs, and of course smartphones have their own unique input methods. (Given that the 3DS has atrocious graphics, games clearly don't need good graphics to be fun, but they can still benefit from them, as shown by home consoles/PC gaming - anyway, larger and higher resolution screens increase the baseline level of detail required for acceptability.)

It's perhaps unsurprising that both of these examples tend to rely on the GPU and other specialized processors more than the CPU.

on_and_off · on Sept 25, 2016

For this use case, phones are kinda powerful enough. Not quite though : applying photo filters (even just HDR+ in order to improve the picture) can take a lot of time. I am not sure whether Prisma slowest is due to mobile SOCs or would be helped if they switched to a better implementation though (Google Photos & Snapseed rely heavily on Renderscript and have been able to deliver very fast filters that way) HDR+ is still slow though .. and I don't think it is unrealistic to ask to see the resulting picture in less than one second. So better SOCs would help there

Also, for motion design, fluidity is very important : if you miss frames, the illusion is shattered. And that's also important for a lambda user, having things move into place helps in understanding what is going on. When you delete a mail (or have a new one), having the list of emails move into place instead of blinking from one state to the other helps the user understand the event. Same thing for screen transitions. Android is not quite there yet (and to be fair neither is iOS, it is even surprisingly worse in many regards). Again, it is not entirely clear how much of this is due to software (not java, but the OS debuts as a camera OS, then a blackberry competitor and only after that as a modern mobile OS) but surely better single core performances would help accomplish this and Apple wrecks Android in that specific area.

jobigoud · on Sept 24, 2016

> What do I want to do with my phone that I don't yet know I want?

Untethered VR and AR. Rendering views in stereo for an extremely pixel-dense screen (>4K) at the highest frequency possible (>90Hz) while processing camera feeds and depth sensors for positional tracking (>90Hz).

jsprogrammer · on Sept 24, 2016

For one, support of larger interfaces.

A mobile processor from five years ago is going to have a rough time trying to run 4k GUIs at 60fps.

colechristensen · on Sept 24, 2016

We're kind of done with this now though, right? I can pick a mobile device with a tiny screen no more than a few inches to a monster the size of a largish book. I can get pixel density finer than my eyes are able to detect. There's not much room left for improvement.

And likewise not much else has improved since iOS/Android were released. What I mean by "improved" is real differences in the kinds of things you can do.

jsprogrammer · on Sept 25, 2016

I doubt many pieces dropping into a bin right now could handle something like 32k on a 16-inch tablet.

a3n · on Sept 24, 2016

> Why is it underwhelming? What do people do with phones?

What did people do with phones in the 1890s? Even your closed set is much, much more than that.

What did people do with computers in the 50s, or PCs in the 80s? We do a lot more than that now.

With greater capability comes more, even if we don't yet know what "more" will be.

colechristensen · on Sept 24, 2016

Windows 95 is 26 years old, my first computer ran it. It's been a quarter of a century and I don't do anything with a shiny new MacBook Pro that I didn't do with Windows 95. I browse the web, read news and comments, watch videos, exchange text with people. Sure, the video resolution is better but that has a lot more to do with the network than with the computer hardware. The UI is a little shinier, but I'm not enabled to do anything new that I couldn't do before.

If I don't buy a new computer every few years, what I have will be nearly worthless because software has a funny habit of finding ways to spend all available resources, even if it does most of the same things. I just don't think I would be that disappointed if I was still using Win 95.

> With greater capability comes more, even if we don't yet know what "more" will be.

If you discount network speed and pixel density, "more" hasn't been much in 25 years as far as I can see. Miniturization means I can carry a laptop around with me, ok.

Likewise with phones. Android and iOS are about 8 years old now. What can we do that we couldn't then? _not much_ as far as I can see.

We're definitely past the point of diminishing returns when it comes to resolution, and near or already at the point where physically your eye can't couldn't tell if the pixel density was higher. So what's left?

gakada · on Sept 24, 2016

Mobile devices are ideal candidates to be thin clients, because they are battery powered.

If most instances of greater capability can be implemented as a thin client, then it's not CPU performance that is the bottleneck, but mobile ISPs and aged infrastructure.

tomatsu · on Sept 24, 2016

Yea, my $100 low-budget phone is fast enough. I can use it as satnav, watch 1080p videos, browse the internet, and so on. I can even play some 3D games and things like that.

I'd like the battery to last longer. 14 days - like my dumb phone. That would be a meaningful improvement.

tluyben2 · on Sept 24, 2016

Oukitel k10000. Acquired taste but it has excellent battery life.

notalaser · on Sept 24, 2016

> What besides bloat requires a Moore's law increase in a phone processor?

Marketing departments.

tedunangst · on Sept 24, 2016

I sometimes browse the web with my phone, and it's quite often painful.

bramen · on Sept 24, 2016

How bad is debugging logic running purely on hardware nowadays? Is there anything more user-friendly than tapping specific lines and watching the output on an oscilloscope?

dnautics · on Sept 24, 2016

Usually you start at the verilog level with simulators (I've done this professionally with zero hardware experience and discovered many critical bugs in a chip that's now taped out. Of course the real engineers had to fix them.) Now that it's gone physical, after a series of very low level tests, they will take similar software and adapt it to the interface for hardware instead and repeat all the tests.

bramen · on Sept 24, 2016

Ah makes sense. Thanks for the info, will look into this.

amelius · on Sept 24, 2016

Yes, you can debug on an FPGA first, for example, allowing you to put the hardware-equivalent of a printf statement in places where you need them.

Or, you can configure all of your flipflops into a barrel shift register, so you can read out their contents serially (and shift their contents back in circularly).

But there are many more techniques.

bramen · on Sept 24, 2016

Interesting. Thanks, you've given me a few things to read up on.

klodolph · on Sept 24, 2016

How do you propose tapping a line on a piece of silicon with a 14nm feature size?

adrianN · on Sept 24, 2016

Ask this guy http://www.huffingtonpost.co.uk/rachel-preece/graham-short-t...

ismail · on Sept 24, 2016

Interestingly Sony went the other way with the PS4, Moving to x86. any ideas why?

esjeon · on Sept 24, 2016

- Heterogeneous architecture made game development a lot harder.

- Engines had to be re-implemented for PS3, because Cell is simply too different. (Note: Xbox is almost-PC, so it attracts lots of devs.)

- Cell architecture isn't good for games, honestly. Cell CPU has bunch of stream processors, but has only one (or few?) general purpose core.

snuxoll · on Sept 24, 2016

You're correct, honestly it seemed like Sony was expecting ALL the graphics work to be done on the SPU's. The whole design of the Cell Broadband Engine is basically one SMT PowerPC core (PPU) and a bunch of stream processors (SPU's). Considering their choice in an extremely weak GPU compared to the Xenos in the 360, a lot of games had to move a bunch of work off the GPU and write it for the SPU's instead (which were nowhere near as nice to work with, they couldn't just pull data from system memory without you passing a pointer to them that was translated to an address they could DMA from, and you couldn't just throw GL or RSX commands at them so you were stuck writing all the nitty gritty by yourself).

douche · on Sept 24, 2016

Possibly because developing games for the weird PS3 Cell architecture turned out to be so complicated, particularly for games that were intended to be cross-platform with x86 XBox and PC platforms.

buddapalm · on Sept 24, 2016

Most certainly this was a COGS decision (keep costs of parts and development to a minimum)

digi_owl · on Sept 24, 2016

We really do have a hardon for that Apple SoC, don't we?

peter303 · on Sept 24, 2016

"Hey Clark Kent, its time for another bienniel 'End of Moores Law' piece"

arbuge · on Sept 24, 2016

TL:DR; Moore's law can no longer be counted on for performance gains, so speeding up things will now be dependent on replacing general purpose hardware with hardware specifically designed to implement specific algorithms.

Of course work on custom chips has been going on as long as chips were a thing. The article just underlines that this is now pretty much the only way forward.

visarga · on Sept 24, 2016

> this is now pretty much the only way forward

That, and optimizing the software we use. There is so much bloatware, so much technical debt collected in incremental improvements over decades, that we could get a decent speed boost from optimizing on it.

drdre2001 · on Sept 24, 2016

Aren't optical transistors another option though?

RandomOpinion · on Sept 24, 2016

>Aren't optical transistors another option though?

It will be a minimum of a couple more decades of research and hundreds of billions of dollars of retooling before any of the potential replacements for silicon surpass what currently exists now at the current price and volume of mass production.

smaddox · on Sept 24, 2016

Optical switches are extremely large compared to contemporary transistors, and there's no obvious way to scale them smaller than the diffraction limit.

drdre2001 · on Sept 24, 2016

Hopefully this kind of problem will be solved within a decade, do you think that's a reasonable time scale or is this problem even unsolvable?

slashdev · on Sept 24, 2016

Like all research into the unknown it might prove easy (5-10 years) it might prove hard (20 years) it might always be 50 years away (hehe fusion) or it might just not be possible (faster than light travel).