Nonlinear Computation in Deep Linear Networks

dbcurtis · on Sept 29, 2017

? It sounds like the author is ignoring denormals?

-edit-

Yes, the author is ignoring gradual underflow and the resulting denormal numbers.

So as you move from one binate to the next, the spacing between floating point numbers doubles or halves depending on whether you are increasing or decreasing the exponent. When you reach the binate with the most negative possible exponent, you have two choices: a) round toward zero, which leads to a huge non-monotonic jump in the spacing of numbers on the floating point number line. This is a great annoyance to numerical analysts and leads to convergence instabilities. That is why any modern computer used for numerical work incorporates choice b) gradual underflow, which implies that you must allow non-normalized numbers in the two binates (the two being + and - sign bit) of the most negative exponent, which has the effect of creating another pair of binates around zero. This keeps the spacing of numbers on the floating point number line the same in the four binates around zero. Numerical algorithms are then much more stable.

I haven't looked at what GPU's do, I strongly suspect that they round toward zero, because first of all it doesn't matter much to graphics applications, and secondly, the typical method of handling denormals is to take a trap and drop into software emulated floating point because the cost of the additional hardware to handle denormals is very large and the hardware complexity is crazy-making. A GPU isn't going to want to break the pipeline for a denormal.

aray · on Sept 29, 2017

It looks like nvidia GPUs treat denormals as zeros for single-precision floating point math: http://developer.download.nvidia.com/assets/cuda/files/NVIDI... (sections 4.1 and 4.2)

dbcurtis · on Sept 29, 2017

In the context of graphics processing that trade-off totally makes sense.

Thanks for doing the homework that I was too lazy to do :)

It seems to me that in the context of NN computations, using the lack of gradual underflow as a non-linear element is going to severely limit the dynamic range of the neurons. On the plus side, the non-linear element is a computational freebie. But in addition to limited dynamic range, it makes the NN ridiculously non-portable across hardware implementations.

scott-gray · on Sept 29, 2017

Actually if you read section 4.6 of that paper you'll see that denormals are the default on sm_20 and above. But you can see in that same section this this can easily be disabled with the ftz flag.

I had to give Jakob custom gemm kernels to do this research. Not sure why the denormal point was left out of this blog as it's pretty critical to the whole experiment.

scott-gray · on Sept 30, 2017

So a minor correction here. We did explore placing ftz on various instructions inside the matmul ops, but it turns out you don't need anything more than what is already baked into tf by default. All tf gpu primitives are built with -nvcc_options=ftz=true. This means you have an implicit non-linearity after any non-matmul op (provided the scale of computation is near 1e-38). Matmul ops are called through cublas and have denormals enabled.

joe_the_user · on Sept 29, 2017

I'm not sure why you say "ignore".

As I read this, the author claims to have created a naive "linear" network akin regular deep learning networks but without the added (explicitly) non-linearity and shows it's trainable. He acknowledges it has to operate through non-linearity (indeed underflow) and so the mechanisms you mention sound compatible with his findings.

The point I'd see for the article isn't some magic non-linear to linear transformation but that for all we know, incidental underflow effects might operating in regular "non-linear" networks as well.

zaphos · on Sept 29, 2017

They added a note at the end clarifying that they have flush to zero mode enabled.

quote from the article: "EDIT: This blogpost assumes that we enable flush to zero (FTZ) which treats denormal numbers as zeros. It’d be interesting to see reseachers try without FTZ!"

dnautics · on Sept 30, 2017

you could (in theory) do the same thing anywhere but disabling denormals is the fastest (cpu count) way to create a big (relative displacement from linear) nonlinearity in the IEEE - and family fp representations.

Evolutionary methods to trap nonlinearities is already hard, I imagine it would be even harder to find functions which exploit even more subtle nonlinearities.

jmvalin · on Sept 29, 2017

The main problem here is that you're depending on implementation-specific behaviour. If you train on a device, you have to run on a device with exactly the same behaviour. On top of that, some FPUs have very slow (trapping) denormal handling. I'm also unsure how accurate the gradient computation can be when the signal itself has numerical issues.

I don't deny it's a cool hack, but beyond that I don't think I see the point or the problem this is trying to solve.

dnautics · on Sept 30, 2017

no gradients =D of course, that makes it even harder to train.

_cmfi · on Sept 29, 2017

These findings seem to be at odds. The former says that deep linear nets are useful, non-linear and trainable with gradient descent. The latter says that the non-linearity only exists due to quirks in floating point and that evolutionary strategies must be use to find extremely small activations that can exploit the non-linearities in floating point.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

https://arxiv.org/abs/1312.6120

"We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions."

Nonlinear Computation in Deep Linear Networks

https://blog.openai.com/nonlinear-computation-in-linear-netw...

"Neural networks consist of stacks of a linear layer followed by a nonlinearity like tanh or rectified linear unit. Without the nonlinearity, consecutive linear layers would be in theory mathematically equivalent to a single linear layer. So it’s a surprise that floating point arithmetic is nonlinear enough to yield trainable deep networks."

Plough_Jogger · on Sept 29, 2017

The arxiv paper here is analyzing the nonlinearities in a network's learning dynamics; exploring why training time / error rates are not do not vary linearly throughout the the training process.

They note: "Here we provide an exact analytical theory of learning in deep linear neural networks that quantitatively answers these questions for this restricted setting. Because of its linearity, the input-output map of a deep linear network can always be rewritten as a shallow network."

dahart · on Sept 30, 2017

It's super interesting to think that any non-linearity at all can make it work. This particular non-linearity is surprising since it's clamping to zero at the center of the response curve. I'd have thought that's right where you want the linear response, and that clamping in the middle would cause bad things to happen. Sigmoid and RelU (and others) clamp at the foot/shoulder. Perhaps this network just learns negative weights, compared to the traditional activation functions??

dnautics · on Sept 30, 2017

there's a theorem that any nonlinearity works (for sufficiently sized networks).

deepnotderp · on Sept 30, 2017

The universal approximation theorem actually assumes that the nonlinearity is monotonically increasing, nonconstant and continuous. I don't think floating point nonlinearities technically satisfy that.

dnautics · on Sept 30, 2017

0) nonconstant. Yes, for most cases the floating point nonlinearities map x => x, so it is not a constant.

1) bounded. Yes, the nonlinearites are bounded by the range of the FP.

2) monotonically-increasing. Yes. Consider a + b, where fp(a + b) < a + b, in other words, it's been rounded down. examine fp(a + (b - db)), cannot be rounded up to a number higher than fp(a + b), so the the floating point rounding functional fp must be monotonic for the operation +, a similar argument applies for multiply, and thus for any linear function.

3) continuous function. No. Well, you can't win at everything, no computer representation can be truly continuous, but it's reasonable approximation of the approximation theory, otherwise ML on computers in general would be hopeless.

SomeStupidPoint · on Sept 29, 2017

So this exploits the fact that floating point numbers have finite precision (and perhaps uneven spacing) to generate non-linear operations?

That's actually a really cool usage of the specification!

dbcurtis · on Sept 29, 2017

Well, except that the author misunderstands the specification and how it is typically implemented on modern computers.

gdb · on Sept 30, 2017

(I work at OpenAI.)

TensorFlow by default is built with denormals off (ftz=true), so denormals aren't relevant for the applications we're interested in. We have updated the post to indicate this — thanks for the feedback!

Nokinside · on Sept 29, 2017

This is one cool hack.

You can't construct deep neural network from only linear parts because consecutive layers can be always combined into single transformation matrix. That's why you need alternating linear and nonlinear operations.

I wonder if it's possible to design special purpose low resolution floating point circuit that maximizes this effect while preserving enough linearity. Then you have fast DNN network pipeline constructed from just summation and addition.

darkmighty · on Sept 29, 2017

It's also possible to binarize DNNs to use the faster bitwise operations

https://arxiv.org/pdf/1602.02830.pdf

dnautics · on Sept 30, 2017

> I wonder if it's possible to design special purpose low resolution floating point circuit that maximizes this effect while preserving enough linearity.

At that point, you are probably better off just building circuits with "power this wire to ReLU at the end", which is not very many extra transistors.

blt · on Sept 29, 2017

This is cool, but it seems like a crazy hack with no benefit besides "because it's there". Relu is already such a cheap function to compute.

gugagore · on Sept 29, 2017

I didn't understand how the gradients are produced to honor this underflow behavior. Is that the reason why they use "ES" instead of symbolic (or probably they meant automatic) differentiation?

dnautics · on Sept 30, 2017

correct.

mark_l_watson · on Sept 30, 2017

Well, that is cool and not what I would expect. That said, for practical applications make training easier using leaky-relu, tanh, etc. activation functions.

adrianbg · on Sept 30, 2017

I wonder if this / gradient descent would work with integers, and with mod 2^n as the nonlinearity.

tlarkworthy · on Sept 29, 2017

I guess the hope is that a round of computation can be shaved off.

amelius · on Sept 29, 2017

Wasn't this already obvious from simple networks which implement e.g. the XOR function?

NhanH · on Sept 29, 2017

Simple network without non linear function (eg sigmoid) can’t learn XOR

nerdponx · on Sept 29, 2017

This is completely different. It seems to be about exploiting floating point underflow to have linear nodes perform nonlinear computation.