Introduction to Artificial Neural Networks – Part 1

gamegoblin · on Dec 7, 2013

I have been messing with ANNs since I learned how to program. For years, understanding of ANNs eluded me. Here is the breakthrough I had some time ago that finally let me understand ANNs:

Think about them like circuits. It is easy to see that a circuit full of NAND, AND, OR, etc can carry out computations. An AND gate is a gate that requires all of its inputs to be true and then it goes high. An OR gate requires at least 1 of its inputs to be true and it goes high.

Think of a neuron as being somewhere in between that (I understand that in the article he shows how to simulate gates, but doesn't quite describe neurons as being fancy gates (which is where my breakthrough happening)). It requires N of its inputs to be true and it goes high. Then imagine that the "wires" in our neural network don't carry information in the form of bits, but instead in the form of real numbers. Each wire has a "weight" associated with it, such that when the wire is turned on, it outputs that value, rather than a simple binary 1.

Now imagine that the neurons take all of these real numbered inputs, and apply some function to them to decide if the neuron wants to turn on or not. It might simply sum them, multiply them, or something more complex, but based on its inputs, it turns on. Its on signal then gets sent to all of the neurons that it points to, etc. The same way you can extract answers from a circuit of logic gates by seeing the output of the gates, you can extract answers from an NN by examining the output of certain neurons.

This description is quite simplified and doesn't go into the architecture of ANNs, but if you are really having a lot of trouble grasping how ANNs work, this description should give you some intuition. The hardcore ML people will probably dislike it, but you have to start somewhere. After understanding it like this, I branched out quite a lot and now all of my academic research involves machine learning. But it took that initial breakthrough!

pmelendez · on Dec 7, 2013

Actually that's the traditional analogy, and if I recall correctly in the process of thinking it as a circuit was how Marvin Minsky realized that the original perceptron is unable to mimic a XOR gate.

gamegoblin · on Dec 7, 2013

Really? In all of my perusing of various tutorials online, I never encountered it explained in this intuitive fashion, that neurons are just a generalization of logic gates.

jonsen · on Dec 7, 2013

Threshold Logic - a kind of in-between.

agibsonccc · on Dec 7, 2013

What are your thoughts on the vectorized represenation? I understood the object oriented representation after a while, but I've noticed that, despite what looks like magic at first, when there's a lot less code to look at with singular weight vectors for the whole network vs different kinds of neurons, setting up graph structures, etc.it's easier to grok.

This is probably just a personal preference though.

I find if you're tackling neural networks in the first place, it's great to have your fundamentals down in the first place though. I think a lot of the problem with people learning machine learning is that they just dive in to it without going back and understanding the moving parts that make up the whole.

I made that mistake personally when I was starting out a few years ago. I've seen that with others as well.

dhammack · on Dec 7, 2013

I like to think of the vectorized representation as just a nonlinear transformation to a higher dimensional space with a classifier afterwords. If you're familiar with linear algebra, then z = Wx, where W is a matrix of weights and x is a feature vector maps x (which could be something like 5 dimensional) to a new space (which could be like 50 dimensional). z is the representation of x in that new space. After this linear mapping, we apply a nonlinear transform (sigmoid, rectifier, etc). If we didn't have the nonlinear transform, then the entire model would just be linear! This follows from the fact that the composition of linear functions is itself linear.

The final layer is just a standard logistic regression classifier in the new (usually higher dimensional) space.

agibsonccc · on Dec 7, 2013

Haha I get that! I was just saying for someone learning it at first vs approaching it from an object oriented angle with individual neurons and having it be a graph data structure. It seems easier to tackle when you can summarize it exactly as you said there vs, oh there's these neurons with these connections and you forward propagate each individual weight vector then backpropagate etc...

dhammack · on Dec 7, 2013

I've managed to get an intuition for backpropagation (the way gradients are computed for neural networks) using a similar analogy. The basic idea is that it's just a signal moving in the opposite direction in the network - it starts with computing the derivative of the loss, and goes back through each layer, like a breadth-first search.

sitkack · on Dec 7, 2013

Feynman would be proud of your explanation.

Douglas Hofstadter, "Analogy as the Core of Cognition" http://www.youtube.com/watch?v=n8m7lFQ3njk

Dewie · on Dec 7, 2013

Using neural networks to solve AND, OR etc. was how the book[1] we used in our course introduced and compared neural networks. The simplest NN were just logical circuits.

[1] http://books.google.no/books/about/Fundamentals_of_Neural_Ne...

11001 · on Dec 8, 2013

People! If you're interested in learning Neural Networks, please do yourself a favour, go learn them from a good textbook (e.g. http://www.cs.toronto.edu/~mackay/itprnn/book.html is free online), not these "online tutorials". I know, math-y books don't often make it sexy "for hackers", but it is really the only way to learn a math subject.

matsemann · on Dec 7, 2013

Does anyone here have any good guides/articles about the structure of the artificial network? At university I solved simple problems in a course with ANs, but the structure of my network (hidden layers, how they were connected etc.) were basically just try/fail until I achieved ok results.

agibsonccc · on Dec 7, 2013

I'm just going to suggest this as well. When working with neural networks, you can see the learning curve for them go down massively if you understand linear algebra.

Numpy based neural networks are a lot more digestable as far as understanding all of the moving parts due to the easy syntax.

Neural nets are usually verbose if you use an object oriented representation. Things like feedforward and backpropagation don't require loops if you know how to think about individual neurons and weight matrices.

Also, for working with them, here's something that cleared up a lot for me: when initializing neuron weights, it's randomly initialized. It seemed like magic to me at first and kept tricking me up when getting started with them.

For those who do the JVM, I'm working on finishing up a library for myself to use in various bits of work I do.

Java based, but you get access to fortran matrix routines, and it's interopable with scala and the like as well.

https://github.com/agibsonccc/sda-jblas

For those of you who need hadoop or want to scale out neural networks, I'm going to suggest a friend of mine's iterative reduce project for neural nets on hadoop.

https://github.com/jpatanooga/Metronome

I just finished up a parallel training setup I'm going to release based on this for neural nets as well. When working with them, you'll discover crazy long training times. Multicore tends to make things easy if you know how to balance out the weights though.

I'd be happy to answer any other questions as well.

bravura · on Dec 7, 2013

Quick and dirty guide:

- Use the tanh activation function.

- Use many hidden nodes, e.g. 100 or 1000 for complex tasks.

- Tune the learning rate.

- Add a penalty of k*w^2 to each hidden node. Tune the parameter k to minimize the held-out error.

- If this isn't working, plot the error on the training set as you train. It should decrease to 0.

Read Yann LeCun's efficient backprop work for more tips.

dave_sullivan · on Dec 7, 2013

There aren't any I'm aware of that tie this question together nicely, but here are the general structures in use right now. Google will lead you to more information on each.

Standard feed forward deep net: like the ones you used at university but with a few important features. One, you can stack layers on top of each other then train the whole network with back propagation. The nonlinearity you use can be important (rectified nonlinear units--Max(0,x) are popular now). Regularization can be important (with dropout being a popular method). Pre-processing your data can also be important (eg scaling the inputs, subtracting the mean, ...). How you initialize your weights is important. Using tricks like momentum and learning rate decay are important too.

Autoencoder: A standard feed forward deep net that tries to output an "uncorrupted" sample of a "corrupted" input. So basically, take an image, add noise, ask it to output the denoised image. Why you ask? Well, in doing this, the hidden units (the layers in between the input and output) tend to discover new features which can then be used elsewhere in machine learning pipelines. The big win here is automated feature engineering.

RBM: Restricted boltzman machine, sort of like an autoencoder but not. I'm not convinced one is better than the other, but there are definitely differences.

Recurrent neural networks: Like standard feed forward deep nets but extended to time series problems. The hidden units at each timestep feed in as additional input to the next timestep (along with the new input). So basically, each step it gets this as input: [whatever came before described as hidden units] + [input at T]. These are my personal favorite and currently hold state of the art in speech recognition, at least academically.

Convolutional neural networks: State of the art in vision, this one is kind of hard to explain quickly, but you can think of it kind of like this: I take some image where I want to recognize objects in it. Then I train a bunch of neural nets that represent different aspects of an image (one net might look for vertical lines, another might look for horizontal. It figures out what it's looking for automatically and distributes what it thinks are important across these little networks. You can think of this as "feature detection") You then take a window on the image (400x400 image, maybe you've got a 20x20 window) which you then slide across the image and check each section for the presence of whatever the net is looking for (vertical/horizontal lines, etc.) There's several layers of this (specifically a convolution operation followed by a pooling operation) before the result is fed to a standard fully connected feed forward net which then outputs a prediction.

While lower layers look for low level aspects of an image, each layer progressively looks for higher and higher level phenomena--for instance, when you hear about "the cat neuron" it comes from probing a neuron in the higher levels of a convolutional net and finding that pictures of cats happen to turn this neuron "on" while pictures of anything else don't. What is "high level" exactly? In this case, it really just means composed from "lower level" components... Some also say "higher level = more abstract" but I don't know this is quite the case--abstract means something different to me.

An important consideration is that the conv net architecture allows you to cut down on the total number of parameters (good when you've got something high dimensional like image data) and it also takes advantage of the fact that you can make certain assumptions about images and how objects move/appear in space basically.

Conv nets are probably the most difficult architecture to grasp and my explanation is extremely high level.

Ha, that was a bit longer than I had hoped, but I hope it's helpful to someone...

derefr · on Dec 7, 2013

> Conv nets are probably the most difficult architecture to grasp

I might be totally off-base, but is Numenta's HTM (Hierarchical Temporal Memory) model a conv net? If so, there are some really good introductory/TED-talk-like explanations out there for the idea.

dave_sullivan · on Dec 7, 2013

Heh, I still can't figure out what an HTM is exactly, but I think it's sort of similar to a recurrent convolutional network. It handles learning completely differently (allegedly in a more biologically plausible way).

Personally, I think there are two big issues with what numenta was doing: they diverged too far from mainline neural net research (which, when they started in ~2005 was just before things started getting interesting--it was also a time when "mainline neural net research" was widely assumed to be at a dead end) and they tried too hard to come up with something that was biologically plausible rather than mathematically expedient. Sort of how airplanes need wings but don't need feathers. As far as I am aware, HTMs really just don't work well in practice (and by that, I mean they are not anywhere near competitive with any of the architectures I listed above).

derefr · on Dec 7, 2013

From what I remember, I think HTM learning is isomorphic to that of a recurrent conv net. You'll get roughly the same change given the same training data, but with a parallel-agent approach to learning, instead of the kind of monolithic computation you can throw a GPU at. (In other words, you flip the "data" and "instruction" streams from SIMD to get Multiple-Instruction-Same-Data processes.)

You could, in other words, see a recurrent conv net as an opmitization of an HTM given a von Neumann architecture, or the reverse -- an HTM as an optimization of a recurrent conv net given a biological substrate (where it's much less costly to build tons of crummy processors and link them into an arbitrary graph, than it is to build a single fast processor.)

Again, though, I'm not an ML person, so I might be way off.

chmike · on Dec 7, 2013

What do you mean by things getting interresting ? Did I miss something after 2005 ? Do you have some reference on the web ?

dave_sullivan · on Dec 7, 2013

2006 is commonly cited as the year when "deep learning" started becoming practical. Pretty sure it was Hinton's group, but they used greedy unsupervised pre-training to get a good initialization of the weights, then followed by supervised finetuning of said weights. That result kicked off a lot of renewed interest in NNs, which then led to using GPUs for a 40x speedup, which then led to many more impressive results (and they just keep coming). It turns out the unsupervised pre-training isn't even necessary, go figure...

chmike · on Dec 8, 2013

Thank you: http://en.wikipedia.org/wiki/Deep_learning

nrmn · on Dec 7, 2013

Would you have any good resources to conv nets? I've done the tutorial on theanos site and the related stuff on Andrew ng random wiki (ufdl?). I'm still confused on the filters part.

howeman · on Dec 7, 2013

I think that's basically it. If you want to do a lot of work you can do somethings with cross validation (see which parameters perform the best), but that's a form of automatic try/fail

pilooch · on Dec 7, 2013

hyperparameter search is a solution. See http://jmlr.org/papers/v13/bergstra12a.html and http://jaberg.github.io/hyperopt/

TrainedMonkey · on Dec 7, 2013

I've been messing with Neural Networks for years. Including doing AI college project and taking class on them later. What I learned - unless you structure and query your neural networks in a very special way all they do is approximate higher order functions. It is not trivial to correctly structure neural network and input/outputs.

TL:DR Neural networks are super powerful, but really hard to use properly for anything slightly harder than simple functions.

cjauvin · on Dec 7, 2013

I have written a small series of posts which show (with some simple Python code and math) the relationship between the training procedure of an ANN (backpropagation) and some simpler, more basic machine learning algorithms (like logistic regression):

http://cjauvin.blogspot.ca/2013/10/neural-network-101.html

howeman · on Dec 7, 2013

If anyone is interested, I have a working neural net implementation https://github.com/btracey/nnet . I don't make any stability promises (especially on the training routines, that's a mess at the moment), but sometimes it's nice to see code. The core of the net is https://github.com/btracey/nnet/blob/master/nnet/functions.g... and https://github.com/btracey/nnet/blob/master/nnet/nnet.go

howeman · on Dec 7, 2013

On the stability, the point is that if you'd like to use it you should fork it.

rdlecler1 · on Dec 7, 2013

It becomes a lot easier to think about these circuits if you realize that many of the connections are actually spurious. That is, if you remove them and systematically test the function over a range you'll realize that many of these weights have no information bearing value. For instance, if you have a w_i_j=0.01 this may be totally insignificant for any sigmoidal function and for all intents and purposes should be w_i_j=0 in which case this complex web of neural connections is trimmed, resulting in the underlying circuitry.

If you're interested I wrote a paper in Nature Systems Biology that shows this for artificial gene networks (just another branch of ANNs): Survival of the Sparsest: Robust Gene Networks are Parsimonious

amazedsaint · on Dec 7, 2013

Cool, here is another one I wrote few years back - http://www.amazedsaint.com/2008/01/neural-networks-part-i-si...

funky_vodka · on Dec 7, 2013

That guy did another nice tutorial on a simple evolutionary algorithm and how to implement it to solve the travelling salesman problem. His explanations are useful for beginners like me.

justplay · on Dec 7, 2013

Please please do not stop writing about this. I really want to know more. thanks.

Anon84 · on Dec 7, 2013

You should check out the Coursera ML class videos.

pilooch · on Dec 7, 2013

A very good source of information, and tutorials with source code is http://deeplearning.net/tutorial/contents.html

stillsut · on Dec 7, 2013

Also, Coursera partnered with Geoffrey Hinton (the big name in Deep Learning right now) to put together a pretty good course:

https://www.coursera.org/course/neuralnets

foobarqux · on Dec 7, 2013

Anyone have information about understanding how inputs affect outputs in a trained ANN? Specifically to better understand the system, which inputs have the most impact and how they interact with out inputs.

tlarkworthy · on Dec 7, 2013

Its called the Jacobian. The differential of the outputs by the inputs. Tells you how each input affects the output vector at a specific location in input space. You can calculate it manually. Use inbuilt Matlab functions or finite difference it.

foobarqux · on Dec 7, 2013

So what is all the talk about how ANNs yield solutions that don't have a straightforward interpretation unlike regression?

tlarkworthy · on Dec 8, 2013

Well the jacobian is a numerical matrix unique at each point in the input space. So its kinda hard to visualize the changing Jacobian matrix over the input space.

People do interpret ANNs. Normally by visualising the weight matrices on the input layer, which have a 1-1 mapping to input vector attributes (so you can label them). It gets a bit hard on subsequent layers though.

decision trees are an order of magnitude easier to interpret though, they are very compact and consist of a number of binary decision in the input space. For a given classification you only need to look at log(d) decisions to work out how it got to its conclusion. Its not usually that hard to how it arrived at the tree from training.

The relationship between the Jacobian and the training is fairly convoluted, given back propagation and cross validation.

PS so back-prorogation uses weights differentiated against outputs. Jacobian is input differentiated against outputs. You can do smart things with Jacobian aware neural architectures. See "Forward models: Supervised learning with a distal teacher" which trains two networks in parallel, one a forward model, one an inverse model, and uses the Jacobian to circumnavigate the problematic non-convexity of the inverse model.

foobarqux · on Dec 8, 2013

> Well the jacobian is a numerical matrix unique at each point in the input space. So its kinda hard to visualize the changing Jacobian matrix over the input space.

Yeah it doesn't seem like it would be easy to gain much insight since the partial derivatives are a function of the (other) inputs.

> People do interpret ANNs. Normally by visualising the weight matrices on the input layer, which have a 1-1 mapping to input vector attributes (so you can label them).

What do the first-layer weights tell me?

tlarkworthy · on Dec 8, 2013

If the weight is positive that feature helps the neuron fire and vice versa.

If the neural network is processing images, the weights form and image too. You can tell what a lot of the units are "looking" for by plotting the weights as an image.

page 14, Figure 3: jmlr.org/papers/volume11/erhan10a/erhan10a.pdf

You can clearly see various digit deformations in the weights. (white is positive weight, black is negative weight typically)

Lots of papers do this, this one was happened to be the first one I managed to google.

So the first layer is normally readable because weights are in the same space as the input feature space. The second layer is normally a jumble as its randomly initialised before convergence. But in the paper's example you can imagine the 2nd layer output unit representing a final classification of a 4, is probably summing up all the 4 deformation detectors in layer 1 and negatively summing up everything else.

tlarkworthy · on Dec 8, 2013

Actually I just read that paper properly and I see they did visualise the higher level layer weights too:

"For visualizing what units do on the 2nd and 3rd layer, we used the activation maxi- mization technique described by Erhan et al. (2009)"[1]

So there you have it! You can visualise all of a neural network! But you have to implement some specific numerics to get the data out. Reading the first layer is trivial, and subsequent layers need some analysis.

[1] Erhan 2009 Visualizing higher layer features of a deep network

(good find, thanks for asking the right questions!)

foobarqux · on Dec 9, 2013

> If the weight is positive that feature helps the neuron fire and vice versa.

But that doesn't tell you much does it? It's effect could be reversed or minimized in subsequent layers.

Haven't read it yet but thanks for the paper.

tlarkworthy · on Dec 9, 2013

Just look at the one figure in the paper and you will see.

Because your final layer is positively encoded, that polarity trickles down through back propogation. I also think the positive weights thing is to do with the way features normally work. It makes sense to look for corners, not anti corners for object detection.

TrainedMonkey · on Dec 7, 2013

I would check Neural Network class on Coursera. Subject is so complex that you need to put time to get anywhere.

chmike · on Dec 7, 2013

What is so complex ? Neurons compute a sum of its input. Neurons are connected by synapse. A synapse has a oefficient that weights the input. If the synapse coefficitent is the real value c and the value of the input neuron is i, then The sum will be computed with c*i. One can represent this by vector and matrix multiplication.

In the most simple model, this sum is the output. In more sofisticated models the result of the sum is modified by a function with a sigmoid shape. One of these sigmoid function is infinitly derivable. This allows to fit any continuous function with a sum of these sigmoids. Different strategies have been suggested to do the fit.

But all these ANN don't explain the biological neural networks found in our cortex.

TrainedMonkey · on Dec 7, 2013

It is complex as soon as you try to do anything but simplest things with Neural network. What I really liked about Coursera's NN class is theory of how RBM's can be used to model associative memory.

chmike · on Dec 8, 2013

Ah yes. I agree. Doing things with these simple neurons is complex. On the other hand a 3 year old kid can sing and dance at the same time. How can it be so complex? ;)

It is a fascinating problem which is still lacking an answer.