Cerebras’s giant chip will smash deep learning’s speed barrier

geomark · on Jan 3, 2020

The article talks about a few things that they call inventions, like making interconnections across what would normally be scribe lines. But I personally worked on wafer scale integration about 25 years ago and we were already doing that. We called it inter-reticle stitching. The technology was ancient back then - 0.5 micron feature size on 4 inch wafers - but the wafer scale techniques are applicable to modern technologies. In particular, developing a yield model that informs your on-chip redundancy choices and designing built-in self test and selection circuitry so that you can yield large chips. The chip we developed was so large that only two would fit on a wafer. We got 50% yield on a line that was far from mature at the time. The company lacked the vision to do anything with what they had developed. To them it was just a chip for which there were few customers. The suits didn't know how to make bank with this methodology that could yield nearly arbitrarily complex chips in nearly any target process.

Edit: There were a number of papers and conference proceedings published back then but not much shows up when searching Google. Here's one discussing the issues and results of field stitching https://fdocuments.in/document/ieee-comput-soc-press-1992-in...

From 1992, so yeah, field stitching is not a recent invention.

DaniFong · on Jan 3, 2020

Great post, but I would like to add that the critical question for whether an invention because a useful innovation is usually not "is this novel" but rather "is there a currently viable project here with people who care about the thing and genuine motivation and persistence and adequate resources."

In other words, "how is this effort new to the universe?"

I would say it's certainly at a different scale and a different time. And we should be super thankful that the commercial interest is such that we can try out new chip designs in a different domain now; you can really imagine a rethink for the kinds of things that are possible once you're really at scale here.

mark_l_watson · on Jan 2, 2020

I don’t know if this mega-chip will be successful, but I like the idea. Before I retired I managed a deep learning team that had a very cool internal product for running distributed TensorFlow. Now in retirement I get by with a single 1070 GPU for experiments - not bad but having something much cheaper, much more memory, and much faster would help so much.

I tend to be optimistic, so take my prediction with a grain of salt: I bet within 7 or 8 years there will be an inexpensive device that will blow away what we have now. There are so many applications for much larger end to end models that will but pressure on the market for something much better than what we have now. BTW, the ability to efficiently run models on my new iPhone 11 Pro is impressive and I have to wonder if the market for super fast hardware for training models might match the smartphone market. For this to happen, we need a deep learning rules the world shift. BTW, off topic, but I don’t think deep learning gets us to AGI.

corporate_shi11 · on Jan 2, 2020

It's also my impression - from my modest exposure to DL over the past two years as a student taking courses - that deep learning must be overcome to reach AGI.

Specifically gradient descent is a post hoc approach to network tuning, while human neural connections are reinforced simultaneously as they fire together. The post hoc approach restricts the scope of the latent representations a network learns because such representations must serve a specific purpose (descending the gradient), while the human mind works by generating representations spontaneously at multiple levels of abstraction without any specific or immediate purpose in mind.

I believe the brain's ability to spontaneously generate latent representations capable of interacting with one another in a shared latent space is functionally enabled by the paradigm of neurons 'firing and wiring' together. I also believe it is the brain's ability to spontaneously generate hierarchically abstract representations in a shared space that is the key to AGI. We must therefore move away from gradient descent.

mantap · on Jan 3, 2020

Don't forget the human brain takes about 7 to 8 hours off every day to rejiggle itself, to use a scientific term. The brain's architecture is better than having a training stage but it's by no means able to continually learn without stops and starts.

rckoepke · on Jan 3, 2020

You see this in young puppies (3-6 month old) a lot as well. They get irritable/exhausted after 15-30 minutes of training, and usually dont seem to learn anything at all during the training activity itself. Then they pass out ("nap") for 30 minutes and when they wake up they do the trick/skill perfectly.

Same thing as humans, just more obvious/visible.

jamesblonde · on Jan 3, 2020

Commodity deep learning might be a lot closer than you think. Nvidia won't bring us there (without kicking and screaming), but AMD might. You can pick up a Radeon VII for about 600 dollars and use it in your data center without licensing issues (16 GB, about the training speed of a 2080Ti for ImageNet ConvNets). AMD use ROCm (now fully upstreamed into TensorFlow instead of Cuda - https://www.logicalclocks.com/blog/welcoming-amd-rocm-to-hop... ). Disclaimer: I worked on getting ROCm into YARN for Hopsworks.

dsign · on Jan 3, 2020

Deep learning is not going to get us to AGI. But the hardware techniques definitely are going to get us a bit closer.

I did the numbers a while ago and honestly I don't think we need smaller transistors to get the computation volume of our mushy brains -- although ofc, more and smaller transistors is always very nice. I believe the only thing stopping AGI at this point is architecture -- we really have no idea how to connect and structure something as complex as our brains -- and cognitive maturity. The last part is my way of saying "two weeks for training a NN? Wait until you have a kid and have to work on training the little human for decades....".

TBH, the ethical implications of AGI seem insurmountable to me. Life is a game --meaning the universe doesn't care about us, nor we owe anything to it--, and for now, it's our game. So, I would rather we put all that computing to improving human life -- including mind upload,-- and put AGI right there with nuclear weapons.

GaryNumanVevo · on Jan 2, 2020

Cerebras is a reaction to the recent Deep Learning trend. Larger networks, supposedly better performance. As someone doing distributed training regularly, I've seen some super inefficient models that take 3x more resources (time / compute / bandwidth) for a 2% bump. I think we'll see a big wave in NN optimization in the near future.

01100011 · on Jan 3, 2020

Artificial 'neurons' used in deep learning networks are absolutely worlds apart from real biological neurons. I don't think anyone in the field seriously believes we will get to AGI via DL or our current models.

innagadadavida · on Jan 2, 2020

> I have to wonder if the market for super fast hardware for training models might match the smartphone market.

Recently Intel acquired Habana Labs for $2B [1] and Intel could possibly integrate this into upcoming CPUs (Intel sells more CPUs than iPhone for sure). However, this was on the inference side - unlike Cerebras which is making training faster. The most likely products to benefit from this would be Azure or AWS.

1. https://newsroom.intel.com/news-releases/intel-ai-acquisitio...

sanxiyn · on Jan 3, 2020

Habana does both training and inference. Gaudi is for training, Goya is for inference.

MaxBarraclough · on Jan 5, 2020

> I get by with a single 1070 GPU

Amazon/Google/Microsoft will gladly take your money for time on their nVidia GPU instances, but they charge tens of cents per hour.

varelse · on Jan 2, 2020

I am far more excited by the underlying Wafer Scale Integration moonshot than I am by any AI benchmarks here. I know it's trendy to think there can only be one w/r to the AI Iron Throne but nope, not the case, everyone is writing bespoke code in production where the money is made. Well, almost everyone, Amazon seems to be the odd duck but they're a bunch of cheapskate thought leaders anyway (except for their offers to junior engineers in their desperate hail mary attempt to catch up with FAIR and DeepMind, but... I... digress...).

Which is to say that graphs written to run specifically on Cerebras's giant chip will smash deep learning's speed barrier for graphs written to run best on Cerebras's giant chip. And that's great, but it won't be every graph, there is no free lunch. Hear me now, believe me later(tm).

But if we can cut the cost of interconnect by putting a figurative datacenter's worth of processors on a chip, that's genuinely interesting, and it has applications far beyond the multiplies and adds of AI. But be very wary of anyone wielding the term "sparse" for it is a massively overloaded definition and every single one of those definitions is a beautiful and unique snowflake w/r to efficient execution on bespoke HW.

01100011 · on Jan 3, 2020

I just wonder about the reliability of a system that large. Sure, it's mostly used for machine learning where we don't seem to care as much, but what is the average MTBF of a chip this large? How many chips actually make it out of production?

Also, is this something that will likely scale up, or will this style of design hit a wall(power dissipation?) faster than, say, silicon-interconnect fabric?

Time will tell if this is the new path forward or just a curious footnote in the history of semiconductors.

why_only_15 · on Jan 3, 2020

They built the chip specifically so that it can tolerate failures in some of the cores. I wonder if it can do that adaptation only once or if it can automatically detect it and route around it.

foota · on Jan 3, 2020

Isn't that similar to what AMD is doing with infinity fabric? Obviously not at such a large scale.

jamesblonde · on Jan 3, 2020

Infinity fabric is closer to Nvidia's NVLink - much lower interconnect B/W. PCI 4.0 will be interesting as a commodity alternative, particularly when paired with AMD Rome chips with huge numbers of I/O lanes - for distributed training. https://wccftech.com/amd-epyc-rome-zen-2-7nm-server-cpu-162-...

bcatanzaro · on Jan 2, 2020

Reminds me of that other great prediction of a GPU killer from IEEE Spectrum back in 2009:

https://spectrum.ieee.org/computing/software/winner-multicor...

ajtulloch · on Jan 2, 2020

For the folks who are downvoting this comment, the author is absolutely a subject matter expert (and completely correct).

deepnotderp · on Jan 3, 2020

But he also works at nVidia and Larrabee versus the WSE are two entirely different things. Larrabee was an architectural approach to more general purpose parallel hardware whereas the WSE is a more special purpose and physically different than a GPU.

Google234 · on Jan 3, 2020

What did go wrong with Intel's MIC (Xeon Phi) project? I can't find a compressive account of this from HPC people. The idea seemed pretty sound: large die, simpler circuit, and much more parallelism, in the x86 line..

liuliu · on Jan 3, 2020

I vaguely remember that at the dawn of the deep learning (2013 to 2014), there were talks and hopes that Xeon Phi would smash the performance of Nvidia GPUs. However, the sample people got are too late (I believe it is at the end of 2014) and the performance figures are disappointing. It might be just the software was simply not there yet unfortunately. But then the wheels moved forward and everyone started to buy Nvidia chips in their datacenters.

desertrider12 · on Jan 3, 2020

They didn't really go that wrong, Cori and Trinity are still useful machines. But I can say for computational science, programming models have gotten way better in the last few years. Now it's as easy to get a new sparse algorithm to high occupancy on a GPU as it is to scale on a manycore CPU. So GPUs just look better now considering cost, power efficiency and software support.

raphlinus · on Jan 3, 2020

You'll probably find Tom Forsyth's blog on this to be interesting reading: https://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20...

Zenst · on Jan 2, 2020

A chip that size, imagine the yield. Equally, cooling - has to be water based as a heatsink that size would be on par to a small anvil and the weight factor would be some serious issues. Though unsure as no pictures of it in-play alas and all they say is - "20 kilowatts being consumed by each blew out into the Silicon Valley streets through a hole cut into the wall", which does somewhat beg for a picture as just raises more questions.

Why would they make a chip this big with AMD showing a chiplet design approach is cheaper and more scalable on so many levels. Let alone, yields.

Equally, arms approach to utilising the back of the chip as a power delivery :- https://spectrum.ieee.org/nanoclast/semiconductors/design/ar...

Then a wafer scale chip like this, using that approach, would save so much power. But again, yeilds will be a factor and can imagine this is not the cutting edge process node as you find as nodes mature, the yields improve. So an older node size would have a better yield and be more suitable for such wafer scale chips. But again, no mention of what is used. I have read in the past that it would use Intel's 10nm, but this article mentions TSMC. Another article I read that they used a 16nm node ( https://fuse.wikichip.org/news/3010/a-look-at-cerebras-wafer... ), which as mentioned above about node maturity, understandable.

tedivm · on Jan 2, 2020

I've seen a demo of the machine. It's about 17u in size, with the vast majority (like 15u) of that being for cooling. This was over two years ago so things may have changed.

Right now I'm hosting some DGX's, and only one datacenter in the bay area had the ability to power a full rack of them. Power density is going to be a real issue for the these systems.

Zenst · on Jan 2, 2020

Wow, that really does add some perspective upon the cooling and the aspect about power requirements datacenter wise really does highlight how out-there these type of systems are over the usual rack layouts.

Equally, the cooling capacity of the datacenter comes into play with such systems. Given the power density, the amount of heat being generated would equally be above your normal rack output.

tedivm · on Jan 2, 2020

Yeah- kind of tangental but it also plays along with how datacenters are transitioning from selling space to selling power. It used to be I'd just rent space by the rack or by the U, and then maybe pay extra for the network connection. Now the space itself is pretty cheap, and the network hookups are unbelievably cheap, but datacenters are actually paying attention to power consumption.

In the case of the DGX-1 I've had datacenters tell me I couldn't put more than two in a rack. We ended up finding a datacenter the specialized in them (Colovore, who I can not recommend highly enough)- their power and cooling systems are some of the most impressive I've ever seen.

luma · on Jan 2, 2020

In most cases the cooling capacity is in fact the actual limit you are running up against. Getting more power into a rack is a simple matter of running more cable. Getting more power _out_ of the rack is a much more complicated issue to resolve.

tedivm · on Jan 2, 2020

I think it's a little more complicated than running more cables. Most datacenters have a total capacity they can handle, based on how many connections they have to their local grid (or grids, as datacenter places like Santa Clara have multiple power grids to give datacenter redundancy). You need to make sure your internal power distribution systems can actually handle the amount you want to push through, and you need to ensure that your backup power is actually enough to get you through major outages.

AWS, as an example, tends to only have 20MW to 30MW for each of their datacenters- anything above that they say isn't worth the hassle when they can just open a new datacenter. Power is definitely a limiting factor.

luma · on Jan 2, 2020

Getting more power into a datacenter is a different problem than getting more (already available) power into a rack. I suppose I could have added "if your existing power distribution system can handle the extra power capacity". That includes service entrance, transfer switching, standby and backup power sources, and distribution to the rack level.

The point I'm trying to make is that, all things being equal, it's _much_ easier to handle un-equal power load between individual racks than it is to deal with the cooling side of the equation. Adding more power to a single rack usually just means a few more whips from your distribution. Getting that one extra-hot rack in the aisle to be effectively cooled requires a lot more infrastructure than running some cables.

wbl · on Jan 3, 2020

I'm waiting for the high pressure helium filled datacenter.

Zenst · on Jan 2, 2020

Yes the whole getting more power into a datacenter is much easier to add than the extra cooling capacity to remove that power once it has transitioned into heat. But I'd imagine they would plan and monitor that aspect and may even have redundant cooling systems. But certainly a potential gotcha and one that would soon sort out the bad datacenters when they end up seeing all there hosting overheat and offline.

cepp · on Jan 2, 2020

I think this is also a paradigm problem. Modern chipset advancements are at the crossroads of power vs. cooling. The logical extension of that fight is greater power and cooling requirements in the DC which it is not necessarily equipped to provide by default.

tedivm · on Jan 2, 2020

This is why I thought what Colovore did was pretty smart- they built liquid cooling into all of their racks. They are literally the densest datacenter I've found that actually allows people to colo with them (I'm sure there's plenty of companies who own their own datacenters that might be denser), but even with their systems you'd only be able to fit two of the Cerebras systems in a single rack (and you wouldn't be able to power both up 100% at the same time).

https://www.colovore.com/data-center/

monocasa · on Jan 2, 2020

> Why would they make a chip this big with AMD showing a chiplet design approach is cheaper and more scalable on so many levels. Let alone, yields.

They're taking a radically different approach, and hoping that they'll be able to route around defects, unlike AMD where a defect in the uncore kills the whole chiplet.

tedivm · on Jan 2, 2020

A lot of the people involved in this actually come from AMD, so I imagine they're familiar with the issues AMD ran into.

wolf550e · on Jan 2, 2020

Not 100% of the chip is enabled, they disable defective parts and don't advertise a model that has 100% parts enabled, so they don't need magical zero defect wafers.

Images of the whole computer were published, you can see the massive cooling system: https://www.tomshardware.com/news/worlds-largest-chip-gets-a...

01100011 · on Jan 3, 2020

Did they come up with an architecture which can route around any defect? Probably not. Now, granted, 90% of their chip is probably dedicated to compute, but I'd bet there's some management infrastructure where they absolutely cannot tolerate a defect.

nsteel · on Jan 3, 2020

They'll simply have redundant copies of that logic. And they'll be physically located at areas of the wafer that yield well - some areas are much worse than others and I would imagine they'll make use of that.

Zenst · on Jan 3, 2020

Interesting so on a die, there are area's which are more prone to faults and they are able to factor that into the design?

Though if there are known hotspots, wouldn't that point to the process node inducing them over silicon quality? Or is it a case of silicon production produces known hotspots that are predictable? FWIW, I'm currently learning towards process node over the silicon being the source of hotspots, given what I know about silicon production.

nsteel · on Jan 3, 2020

With normal-sized dies, at the die-level, I've not seen people design around this; other than the more obvious places e.g. the corners (bad power delivery, prone to mechanical issues, normally left vacant) and the middle (gets hotter, also sometimes bad power). However, there are many test structures placed across the die to measure/check variations and also design rules that constrain the relative placements of certain things. That also goes towards increasing yield.

But at the wafer-level, yes.

> wouldn't that point to the process node inducing them over silicon quality?

I don't see why. I would only vaguely guess it's related to the manufacturing process they follow at that particular node. Maybe it's not even directly silicon related but something else.

I'm not convinced it's worthwhile separating out the process node and the silicon quality, they are entwined when looking across a large sample size.

Unfortunately, someone that actually knows why probably isn't allowed to share why.

Accujack · on Jan 2, 2020

>which does somewhat beg for a picture as just raises more questions.

There's a picture in the article.

>Why would they make a chip this big

Did you read the article?

>this article mentions TSMC. Another article I read that they used a 16nm node

Yes, 16nm/TSMC.

Zenst · on Jan 3, 2020

> There's a picture in the article.

Yes - hardly helpful ones as you get a picture of a wafer and a box, not breakdown beyond that - hence had look and found other articles with much more detail upon this that answers the questions I raised in relation to the lack of pictures - like the cooling aspect in which you snipped my quote and removed that lovely thing we call context.

>Did you read the article?

Yes and had you read what I said you would see that the article does not answer the aspects I was asking - see what you did there.

>Yes, 16nm/TSMC

Yes - I found that in another article - which I also linked, you're welcome.

gimmeThaBeet · on Jan 2, 2020

I'm really curious about the benefits of their implementation. It's far beyond my grasp to make any serious criticisms and I don't really want to doubt them, it just seems a pretty radical departure from even the direction of innovation.

The way they paint it sounds like they're putting in redundant cores to account for failure of what seems like what I would call the 'first line' cores, i.e. there's cores that are only used if some primary ones aren't working?

But sort of intuitively that doesn't make a whole lot of sense given the parallel nature. Maybe they are just putting in 101% of specified cores, and if there's a ~1% hopefully uniform-ish core failure rate then it's all gucci?

I guess my question is probably similar to yours, what are you giving up with yield-enhancing redundancy of a behemoth die vs integrating a bunch of confirmed working chiplets together?

phonon · on Jan 2, 2020

The CEO says 1-1.5%.

"Cerebras approached the problem using redundancy by adding extra cores throughout the chip that would be used as backup in the event that an error appeared in that core’s neighborhood on the wafer. “You have to hold only 1%, 1.5% of these guys aside,” Feldman explained to me. Leaving extra cores allows the chip to essentially self-heal, routing around the lithography error and making a whole-wafer silicon chip viable."

https://techcrunch.com/2019/08/19/the-five-technical-challen...

whatshisface · on Jan 2, 2020

The article claims that keeping everything on one die raises interconnect bandwidths and lowers latencies over what would be possible in a conventional supercomputing setup. Connections are made over the silicon that is normally left aside for cutting the chips apart. Apparently that is a special process that they had to collaborate with a partner in order to get working.

frankchn · on Jan 2, 2020

Chiplet designs means that you still have to route signals either onto an interposer or onto a PCB. If you have a silicon interposer you have the same issue of making a really large silicon die. If you route into the PCB, then you may need SerDes depending on what you do and bandwidth will be lower and latency will be higher due to signal integrity issues.

Maybe something like Intel's EMIB technology where they have small interposers along edges of chips rather than having a giant interposer might help here.

Yields are probably fairly good if they design for manufacturing by placing extra cores / wires to route around failures as I am sure they are.

agoodthrowaway · on Jan 2, 2020

The future of these interconnects is to make them optical. Once the interconnects are optical lots of problems get solved. Chips don’t have to be in same enclosure, simplifying cooking etc.

baybal2 · on Jan 2, 2020

I will dissent. Organic interposers are dirt cheap, and nearly as good unless all you want is density.

ivalm · on Jan 2, 2020

> A chip that size, imagine the yield

From discussion at a demo the yield is good, since they are using a large node. Their hardware rerouting also mitigates defects on most chips.

BooneJS · on Jan 2, 2020

Many single-chip processors contain redundancy or ability to route around bad units. Yield isn’t an issue if it has programmable datapaths, even at this scale.

wbhart · on Jan 2, 2020

From the perspective of an outsider, I can't see how a company like this could survive. They claim on the one hand to have done something really amazing and are at the stage where they are looking for customers. Normally, you'd expect them to be touting performance figures to secure such investment. Instead, they've decided to keep the performance secret. And they've managed to find some "expert" who says this is normal.

Does anyone here have expertise in this area? Is this the model for a successful company in this area?

recursivecaveat · on Jan 2, 2020

As someone who works for another startup in this area, building the chip is only half the battle. The other half is tooling for compiling benchmark networks onto the chip in a performant manner. With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made. It probably stacks up absolutely terribly in every metric right now. That's not to say it will necessarily get better, most of the people I've talked to don't think the megachip will ultimately amount to much more than a clever marketing ploy.

Veedrac · on Jan 2, 2020

A bit baffled by this because on every axis I look this seems like a dream of a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

jcranmer · on Jan 3, 2020

Having worked on compilers for pretty weird architectures, it's generally the case that the less like a regular CPU your architecture is, the more difficult it is to compile.

In particular, when you change the system from having to worry about how to optimally schedule a single state machine to having to place operations on a fixed routing grid (à la FPGA), the problem becomes radically different, and any looping control flow becomes an absolute nail-biter of an issue.

Veedrac · on Jan 3, 2020

Remember that you aren't compiling arbitrary programs. Neural nets don't really have any local looping control flow, in the sense that data goes in one end and comes out the other. You'll have large-scale loops over the whole network, and each core might have a loop over small, local arrays of data, but you shouldn't have any sort of internal looping involving different parts of the model.

tachyonbeam · on Jan 3, 2020

It's pretty common to have neural networks that have both recurrent nets processing text input and convolutional layers. A classic example would be visual question answering (is there a duck in this picture?). That would be a simple example involving looping over one part of the model. Ideally you want that looping to be done as locally as possible to avoid wasting time having a program on a CPU dispatching, waiting for results and controlling data flow.

Having talked to someone at Cerebras, I also know that they don't just want to do inference with this, they want to accelerate training as well. That can involve much more complex control flow than you think. Start reading about automatic differentiation and you will soon realize that it's complex enough to basically be its own subfield of compiler design. There have been multiple entire books written on the topic, and I can guarantee you there can be control-flow driven optimizations in there (eg: if x == 0 then don't compute this large subgraph).

Veedrac · on Jan 3, 2020

I would be surprised if Cerebras was trying to handle any recurrence inside the overall forward/backward passes. It seems like a lot of difficulty (as mentioned) for peanuts.

I don't get your point about training. Yes, it's backwards rather than forwards, and yes it often has fancy stuff intermixed (dropout, Adam, ...), but these are CPUs, they can do that as long as it fits the memory model.

IshKebab · on Jan 2, 2020

I'm afraid recursivecaveat is right. This is an insanely difficult compilation target. I think you're possibly talking about a different kind of "compilation" - i.e. the Clang/GCC bit that converts C++ to machine code. That is indeed trivial. But "compilation" for these chips includes much more than that.

The really complicated bit is converting the tensorflow model to some kind of computation plan. Where do you put all the tensor data? How do you move it around the chip. It's insanely complicated. If anything kills Cerebras it will be the software.

Veedrac · on Jan 2, 2020

It's model parallel, so the first thing you do is lay out your floorplan for the model, which looks like this.

https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...

Then you put your data next to the core that uses it. Simples.

(Optimal placement is tricky, but approximate techniques work fine.)

IshKebab · on Jan 3, 2020

When you consider the things that that diagram doesn't show, it doesn't look at all simple. Does that graph even have training? It'll have to be pipelined too. Probably will have to use recomputation due to the shortage of memory. What about within the boxes? You can't nicely separate a matmul into pieces like that.

I work on something similar but less ambitious, trust me it is crazy complicated.

Veedrac · on Jan 3, 2020

Could you be more explicit? What about the naïve approach to training (same graph but backwards, computing gradients) is going to fail?

Wrt. matmul, if you couldn't split them up, today's AI accelerators wouldn't work full stop. But regardless, even if it was much more complex on CS-1 than on all the other sea-of-multipliers accelerators, it's obviously a problem they've solved and so irrelevant to the compilation issue.

jhj · on Jan 3, 2020

It's not like there is one SRAM, there are many SRAMs, so you get the same problem as NUMA but a thousand fold. Some computations you can map to a regular grid/hypercube/whatever quite easily, but it is unclear what the interconnect between the PEs is here, or what this thing has for a NOC or NOCs, how routing is handled, etc., and further complicating the issue is compensating for any damaged PEs or damaged routes.

Veedrac · on Jan 3, 2020

No, you don't have all the issues with traditional NUMA because you aren't doing the same sort of heterogeneous workloads. You're always working on local data, and streaming your outputs to the next layer. This isn't a request-response architecture; such a thing wouldn't scale.

jhj · on Jan 11, 2020

It is more or less the same, it's just that in NUMA you have a limited number of localities, except here it is in the thousands. The issue is one of scheduling that locality. Some process still needs to determine what data is actually local and where it should "flow". Because it can't all fit in one place, the computation needs to be tiled (potentially in multiple ways) and the tiles need to be scheduled to move around in an efficient manner.

dnautics · on Jan 2, 2020

Is it not the case that the defect identification and rerouting happens at the hw level in a QA phase post production? If not I'm even less bullish on cerebras.

HereBeBeasties · on Jan 2, 2020

Yes, that's what their web site says.

joe_the_user · on Jan 2, 2020

With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made.

While I'd be generally skeptical, it seems like the compilation for the rerouting could be done on a single low level, below whatever their assembler is, and so the could just look like a regular array of cores - just a single array that translates from i to the ith "real" core and similar structures seems like it could be enough.

Edit: I mean, if they're smart, it seems like they'd make the thing look as much as possible like a generic GPU capable of OpenCL. I have no idea if they'll do that but since they have size, they won't have to sell their stuff an otherwise custom approach.

Veedrac · on Jan 2, 2020

They have customers already, one (Argonne National Labs) is given explicitly.

The issue with using ‘industry standard’ benchmarks is that it's like measuring a bus' efficiency by shuttling around a single person at a time. The CS-1 is just bigger than that; the workloads that it provides the most value on are ones that are sized to fit, and specifically built for the device.

This does make it hard to evaluate as outsiders (certainly for similar reasons I never liked Graphcore), but I don't think it means anything as grim as you say. The recipe fits.

typon · on Jan 3, 2020

They could always release figures for larger networks - they don't have to target Resnet50 (which is the MLPerf standard). I don't think anyone would hold it against them if they show massive improvements in something like GPT-2 training time (a network 37000x the size of Resnet)

Veedrac · on Jan 3, 2020

GPT-2 uses attention, which is very memory hungry to train, so probably won't work well. But I agree with your overall point.

m0zg · on Jan 2, 2020

That' sounds like horseshit to me. Very large public datasets and models are available to test training on a chip or system of any size. ImageNet is large enough for this. But if that's not sufficient, OpenImages is also available.

To me as a practitioner a meaningful metric would be "it trains an ImageNet classifier to e.g. 80% top1 in a minute". If it's not suitable for CNNs, do BERT or something else non-convolutional. Even better if I can replicate this result in a public cloud somewhere. They know this, and yet all we have is a single mention of a customer under an NDA and no public benchmarks of any kind, let alone any verifiable ones. If it did excel at those, we'd already know.

tynpeddler · on Jan 2, 2020

> Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons. Instead the company prefers to let customers try out the CS-1 using their own neural networks and data.

> This approach is not unusual, according to analysts. “Everybody runs their own models that they developed for their own business,” says Karl Freund, an AI analyst at Moor Insights. “That’s the only thing that matters to buyers.”

Sounds like instead of benchmarks, prospective customers get a chance to run a workload of their choice on the core before purchase. Assuming support is good, that's way better than looking at benchmarks, because you're guaranteed that the performances you're comparing are for workloads you care about.

Veedrac · on Jan 2, 2020

The appropriately large models with public recognition I know of use attention, which is too memory-hungry to work effectively on the CS-1. The datasets aren't the issue.

I'm fine with skepticism. It's certainly plausible that they don't actually do all that well.

phonon · on Jan 2, 2020

There are probably only a few hundred prospective customers. (Some may buy several units). Each unit will cost millions. They can discuss the expected workloads/performance with each prospective customer individually.

jandrese · on Jan 2, 2020

Keeping the performance figures a secret is a red flag on the level of "run, don't walk, away from this company".

At best their solution is on par with GPUs in a performance per watt/dollar sense. At worst they're scammers looking for a sucker.

privateSFacct · on Jan 2, 2020

I'm also very curious about their performance per watt / dollar for the standard ML datasets out there (facial recognition etc). We have a reasonable sense of both training time and runtime for these in the cloud (and it is falling FAST).

tedivm · on Jan 2, 2020

I got a demo of this two years ago, and honestly I don't think it matters that they aren't sharing these numbers. Any company that is going to consider this is going to want to benchmark it on their own models and systems, and as long as Cerebras allows that they aren't going to have trouble finding customers (assuming their claims line up with reality).

Even if that doesn't work out most of the people on these time have built companies that were acquired by either AMD or another chip maker.

star-trek-fleet · on Jan 2, 2020

Mass market customers are just going to skip without benchmarks.

Although, at this stage, Crebras does not care about mass market yet.

michelpp · on Jan 2, 2020

The members of the GraphBLAS forum have discussed this chip a couple of times. There's a lot of research on making deep neural networks more sparse, not just by pruning a dense matrix, but by starting with a sparse matrix structure de novo. Lincoln Laboratory's Dr. Jeremy Kepner has a good paper on Radix-Net mixed radix topologies that achieve good learning ability but with far fewer neurons and memory requirements. Cited in the paper was a network constructed with these techniques that simulated the size and sparsity of the human brain:

https://arxiv.org/pdf/1905.00416.pdf

It would be cool to see the GraphBLAS API ported to this chip, which from what I can tell comes with sparse matrix processing units. As networks become bigger, deeper, but sparser, a chip like this will have some demonstrable advantages over dense numeric processors like GPUs.

giacaglia · on Jan 2, 2020

I've wrote about the challenges that Cerebras went through and what is next: https://towardsdatascience.com/why-cerebras-announcement-is-...

rsp1984 · on Jan 2, 2020

This fits perfectly into the narrative of yesterday's discussion on HN [1].

Deep Neural Nets are somewhat of a brute force approach to machine learning. Training efficiency is horrible as compared with other ML approaches, but hey, as long as we can trade +5% of classification performance for +500% of NN complexity and throw more money at the problem, who cares?

I see a dystopian future where much better and much more efficient approaches to ML exist, but nobody's paying attention because we have Deep Neural Nets in hardware and decades of infrastructure supporting it.

[1] https://news.ycombinator.com/item?id=21929709

star-trek-fleet · on Jan 2, 2020

Well, if a better algorithm cannot beat DNN in a realistic product setting, then how can you say its better after all?

If the algorithm is indeed better, how can DNN dominates and turn into a dystonia...

someguyorother · on Jan 2, 2020

What economists call path dependence.

The alternative algorithm would be better than DNN if the same amount of effort was put into creating special-purpose hardware, libraries, and so on; but in the dystopia, it's not fully refined DNN vs fully refined alternative algorithm, but fully refined DNN vs alternative algorithm with hardware and software optimized for DNN.

The alternative algorithm always looks unappealing because the playing field historically favors DNN, and so doesn't take off in the dystopia.

star-trek-fleet · on Jan 3, 2020

You are referring back to OP's own reasoning fallacy...

DNN emerges out from being an underdog. Its superiority was proven by technology and economy.

What you said is of course not wrong, but they can never be proven right. As immediately you switch the role, your argument then favors the other one.

sbierwagen · on Jan 3, 2020

One example would be ternary logic, which more efficiently represents numbers: https://en.wikipedia.org/wiki/Radix_economy

m0zg · on Jan 2, 2020

They did build some valuable tech, no question there, but be sure to account for the typical startup hyperbole. By the time you can get your hands on this (if that ever happens), the hyperbole will converge a bit closer to reality, the tradeoffs will become apparent, etc, and you'll discover that it is not, in fact, going to "smash" barriers of any kind in any practical sense.

From TFA: "Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons."

That's all you really need to know.

ZhuanXia · on Jan 2, 2020

Them shunning benchmarks is pretty lame.

baybal2 · on Jan 2, 2020

The guy who runs Cerebas has history of quick selling companies that then went nowhere. He bets all on wow-effect, and sells to trend chasing suckers.

Less than stellar benchmarks will ruin the "magic"

green-eclipse · on Jan 2, 2020

The Cerebras chip really stands out in terms of the chip industry's relationship to Moore's law. Look at the graphs in this article for reference:

https://medium.com/predict/cerebras-trounces-moores-law-with...

atq2119 · on Jan 2, 2020

That article is hogwash. Sure, the Cerebras "chip" is impressive. But the idea that it will accelerate Moore's law and usher in the singularity is just nonsense. Nobody has even made serious efforts to use deep learning for physical design, and its scope for improving designs is limited at best even in theory.

If this was trying to aim at solid state physics and materials research, then maybe one could be carefully optimistic about a genuine breakthrough via something like room temperature, standard pressure super-conducting. As it stands, I call blind hype.

wpietri · on Jan 2, 2020

Agreed. One of the things that fascinates me about technology is how the new thing is always treated as magic. I'm hoping we're almost out of that phase for ML, as the hype is exhausting.

And for the curious, a good example of this is radium. For a while it was a miracle cure-all, put in everything from lipstick to jock straps. That did not work out well: https://www.theatlantic.com/health/archive/2013/03/how-we-re...

Veedrac · on Jan 2, 2020

> Nobody has even made serious efforts to use deep learning for physical design

DeepMind have for place&route IIRC.

mlyle · on Jan 2, 2020

Yah, it's BS. But it may be teaching TSMC a whole lot about making larger chips with good yield, and the across-reticle interconnect technology is impressive too and may find some general applicability (e.g. it sounds like something AMD might like).

atq2119 · on Jan 2, 2020

Oh, for sure there are things to be learned from this. The responsibility for yield doesn't lie with TSMC though, but with the logic design: to make this kind of integration work, your design has to be able to tolerate a fault essentially anywhere on the wafer surface.

This isn't magic, of course: keep in mind that we already have SRAM with extra capacity for fault tolerance, and multi-core chips which are binned based on the number of functioning cores has been standard for a long time.

mlyle · on Jan 2, 2020

Design to tolerate failures is only one variable in yields. Wafer-scale integration exercises to the limit both our ability to tolerate defects and our ability to minimize them.

ThrowawayR2 · on Jan 2, 2020

That article is utter balderdash. Yes, it's obvious that you can fit more transistors on a "chip" if you make the chip be much, much larger than what we ordinarily think of as a chip. No, it does not mean that Moore's Law has been invalidated or some new "AI Moore’s Law" (quoting from the post) has come into being.

dnautics · on Jan 2, 2020

> Yes, it's obvious that you can fit more transistors on a "chip" if you make the chip be much, much larger than what we ordinarily think of as a chip.

Without defending the article, it is however the case that simply scaling a chip size has nontrivial problems. For example, Will the piece of silicon warp or shatter if one side happens to get hotter than the other?

ThrowawayR2 · on Jan 3, 2020

Possibly. Wafer scale integration has been investigated before though and there were even a couple of attempts at commercial products; it's not a brand new technology. Nevertheless, it might be interesting to examine Cerebras' patents to see if anything of significance relating to WSI is there.

gfodor · on Jan 2, 2020

I’m a know-nothing when it comes to this area, but I shouted expletives at least twice when I read this article. This is crazy.