Unless you need to multiply large matrices, where you need access to very large ...

Veedrac · on Aug 20, 2019

That's what the absurdly fast interconnect is for. You send the data to where the weights are.

dmitrygr · on Aug 20, 2019

Absurdly fast != Single cycle

It will be physically impossible to access that much memory single cycle at anything approaching reasonable speeds. I suppose you could do it at 5Hz :)

Veedrac · on Aug 20, 2019

A core receives data over the interconnect. It uses its fast memory and local compute to do its part of the matrix multiplication. It streams the results back out when it's done. The interconnect doesn't give you single-cycle access to the whole memory pool, but it doesn't need to.

dmitrygr · on Aug 20, 2019

I think it is telling that in one sentence there is a claim that it is faster than nvidia, and in another, a claim it does tensor flow. I do not think this architecture could do both of these at once. It could not do tensor flow fast enough (not enough local fast mem) to compete even with a moderate array of GPUs