It will be physically impossible to access that much memory single cycle at anything approaching reasonable speeds. I suppose you could do it at 5Hz :)
A core receives data over the interconnect. It uses its fast memory and local compute to do its part of the matrix multiplication. It streams the results back out when it's done. The interconnect doesn't give you single-cycle access to the whole memory pool, but it doesn't need to.
I think it is telling that in one sentence there is a claim that it is faster than nvidia, and in another, a claim it does tensor flow. I do not think this architecture could do both of these at once. It could not do tensor flow fast enough (not enough local fast mem) to compete even with a moderate array of GPUs