Toward a deeper understanding of the way AI agents see things

sgt101 · on Nov 19, 2018

I've been playing with the MAC network demo (https://github.com/KnetML/MAC-Network for the julia version) and this study chimes with what it's got me thinking. I was really impressed when I saw it, but getting it running and experimenting with got me thinking.

The dictionary shows that the numbers in the response system are labels, it's hard to evaluate the spacial reasoning and I (and other people looking at the demo) tend to attribute "near" misses as good efforts rather that mislabelling.

The images in CLEVR are artificial, and the mappings are the grammar that the scene graphs generate, and these graphs are what are being learned and mapped via the images.

Also there are systematic issues; large metal spheres generate reflections that then deceive the classifier, metallic cylinders are often classified as spheres. Blocks that don't present a "square" aspect are often ignored.

The big issue for me is that what seems like a significant capability that might map into a useful tool turns out to be essentially an "exploit" of my psychology, and even a small step towards a "mind" in a machine ends up looking to me to be yet another set of complex transformations over the intents of the humans that built it.

chriskanan · on Nov 20, 2018

How well does deep learning work in Julia and how mature are the toolboxes?

There is a new model VQA model for CLVR published in Neural information processing systems this year that gets better results than MAC, but they used specialized visual features. They didn't really explore the impact of using those, but I wonder if some of their performance is due to using them to overcome the systematic issues you brought up.

sgt101 · on Nov 20, 2018

Interesting - I'll have a look for that paper.

There's some good DL Packages in Julia - Knet is very elegant but I have to admit (as a Julia fan) that there are challenges in getting things to run properly. I had a long battle with the linker to get the MAC demo's working properly - resolved with a single line of code - but still... I've used the MXNet wrapper to build a LSTM in the summer as well, and that was much smoother (the network itself didn't perform well on the task I had for it, but that's another issue).

sgt101 · on Nov 20, 2018

Did you mean : https://arxiv.org/pdf/1810.02338.pdf ?

zamalek · on Nov 20, 2018

> without determining, for example, that pictures of a Boston terrier and a Chihuahua both represent dogs.

I wonder if that is an artifact of how the training occurs. If you look at how humans generally learn, they learn what a dog is first. For many years they don't learn about breed (possibly ever). Cats and dogs eventually earn the shared labels "pet" and "animal".

However, at least in a heterosexual unit, we learn "mama" and "dada" before "woman", "man" and "person." However, we eventually learn to apply many labels to people: "name", "gender", "species" and other, possibly horrible, things.

In order for networks to share representation about hierarchical labels, assuming that humans are doing nothing novel in this problem space, they would either have to:

* Learn general labels first.

* Provide hierarchical labels as output.

* Provide multiple labels as output ("Labrador", "dog", "animal").

As a guess.

En_gr_Student · on Nov 19, 2018

I don't know that the distrust in similarity between images of noise is unwarranted. They are pseudo-random, not purely random. Each is not independent, but is a pile of sequential draws in a row, perhaps 16384. There are thought to be shortcuts that the NSA uses to quickly short-circuit encryption, and is alleged to have salted public methods so as to make that job easier. Random-number-generation and encryption are related to each other:a properly encrypted chunk of data looks nearly exactly like pure random noise, as does a good pseudo-random number. I would not be surprised if there were similarities that the mathematical methods find that the human eyeball does not.

This feels like a minimum description length problem. I think that if the agents had to use hierarchical descriptors, thinking of cat as some assembly of tail, legs, body, head, eyes, ears, mouth, and all, that an internal hierarchy would show up in the communication, and a divergence between the training and communicated hierarchies would have a better defining contrast at showing an inferred structure.

moneil971 · on Nov 19, 2018

Like many studies, in retrospect, it makes sense that agents would go to the simplest method of 1:1 comparison, rather than "learning" what is actually depicted in the image. Like a difference between how your brain tackles those games to find the differences between two nearly identical photos vs. a game where you try to name all the distinct items shown in an image.