Hacker News new | past | comments | ask | show | jobs | submit login
Understand what AI sees (hackernoon.com)
100 points by dirtPUNK on Aug 25, 2018 | hide | past | favorite | 42 comments



To state the obvious, AI doesn't see anything. What people call AI is simply a statistical model, a system of equations. By solving it, we find a solution that solves the equations simultaneously. The model doesn't see anything, because it doesn't exist as such. It simply is a series of calculations done on a computer or on a piece of paper.


Then biological brains don’t see anything either.


I’m not sure your conclusion follows. We agree that we “see” things because we have the conscious experience of doing so and we talk about it.

Nothing remotely analogous happens with current AI algorithms.


>We agree that we “see” things

That's precisely the epistemic problem. We agree that we see things. Since "we" have no experience of being an AI, we have no way of affirming or denying what "their" experience may or may not be (apart from mere chauvinistic dismissal, ie "'they' [and their hardware+software] aren't like us [and our brains], therefore 'they' can't be conscious").


They don't have a conscious experience.


I happen to agree with your conclusion, but since we're not part of "they," neither of us can know for sure. That's my point.

Merely by examining the hardware, the human brain doesn't look like it should support consciousness either. Our inability to identify consciousness by inspection does not deny consciousness in vivo, why should it do so in silico?

Re: burden of proof, let me be clear what I'm saying here. In silico consciousness has not been proven or ruled out, because we have not yet developed a material definition of consciousness, ie one that can be applied merely be examining the hardware. It's not that we've disproven it (as "they don't have conscious experience" suggests), it's that we don't yet even know what we should be disproving!

It's not like we have a Consciousness Detector Box ala C&H, which flawlessly classifies all human brains as conscious merely by examining the configuration of the atoms in the box. All we have are functional definitions which look at human behavior.

If we don't even know conscious hardware+software when we see it (namely our brains), how can we say for sure whether X is or isn't conscious, for any arbitrary X?


> because we have not yet developed a material definition of consciousness, ie one that can be applied merely be examining the hardware

It's not a metaphysical puffy thing (unexplainable or inscrutable) or a property of the brain itself - it's the ability of an agent to act in an environment in a way that maximises its rewards. Biological rewards are tied to survival and self reproduction. So consciousness is what happens when there is an agent, an environment and a stream of rewards to be gained, where the agent learns to understand its situation and acts in an intelligent way, learning from its past mistakes and experiences. All these concepts are covered by unsupervised learning and reinforcement learning. That's a material definition of consciousness.


Of course they do. Our neurons generate conscious experience, and, assuming there's nothing superphysical about our own biology, it's natural to believe that all matter has conscious experience in some shape or form.


Can you define that?


So self awareness is a prerequisite for seeing? Animals and infants can’t see?


It follows that, because insects see things and are not generally seen as conscious beings, conscious experience of seeing is not a requirement of seeing.

"Sì, abbiamo un'anima. Ma è fatta di tanti piccoli robot – "Yes, we have a soul, but it’s made of lots of tiny robots." Dan Dennet [0]

[0] https://www.goodreads.com/quotes/191761-some-years-ago-there...


Not sure what your point is. Humans can see. So can most animals. We have plenty of evidence of that. We have absolutely no evidence that inanimate objects can see. If they could, it is reasonable to expect that we would have some evidence of it by now. The fact that we haven't got any, leads us to conclude pretty solidly that they can't.


What vision processing task can most animals do that no computer can?


Since seeing means perceiving things in space, for starters it requires having a notion of 'thing' and of 'space'. Do computers have such notions? Again, based on the evidence that we have, or lack thereof, regarding the cognitive capabilities of inanimate objects, I say the answer is a resounding no. Inanimate objects do not think and therefore cannot have concepts. They cannot do anything, strictly speaking, since they lack agency. We use them to perform certain things, which is different.


I mean to answer your first question: yes, modern computer vision algorithms do have a concept of spatial and object tracking. But if your definition of “seeing” requires metacognition and agency... I suppose no algorithm I know of does that. Although that rules out most animal visual systems from “seeing” too.


This is absolutely true.


I was actually giving an example to my friend on ML vision applications over the hotdog app. The underlying network probably only recognises a red shape surrounded by lighter parts, as opposed to understanding conceptually what a hotdog is. The USB example of course is an extreme case of that.


Eh? What does conceptually mean?


A human sees a 3D scene with objects, creatures, texture and lighting (and it evaluates the scene based on these concepts and how they related to each other even if it's never seen green fields, sheep, dry stone walls or fog before).

The computer generally sees a set of pixel values, and takes a plenty of training to distinguish between "sheep" and "field the same shade of green as usually found in images containing sheep" because it doesn't have an innate concept of animate objects and habitats, how they relate to each other and which is more important. Whilst the computer's busy seeing a white pile of stones as a false positive for the presence of sheep, the human's looking at the way the stones are piled as possible evidence of human activity and noting the presence of droppings in the foreground might mean sheep were here recently.

(of course, it's not entirely impossible for computer vision systems to deal with higher levels of abstraction: autonomous vehicles model the world in 3D and classify objects as vehicles in order to predict their near future behaviour and signals in order to regulate their own behaviour, but that goes well beyond mere learning processes. And of course a pixel-by-pixel understanding of the world has its use cases in spotting changes in colour and texture which are so subtle humans abstract away from, like crop discolouration on satellite images or cracks in rough surfaces)

We're much better at abstraction than other animals too: show us a 20,000 year old cave painting and we'll easily grasp that it was produced by humans and the lines represent shapes of animals broadly similar to today's livestock. Same goes for 2000 year old marble bas reliefs. You might well be able to train an algorithm to recognise "paintings of animals" and "carvings of animals", but you'll struggle with a training set consisting purely of photographs of real world livestock


The human visual system is a lot more hacky than you might intuitively expect. It's really, really easy to mess with.

https://en.wikipedia.org/wiki/Optical_illusion


"A human sees a 3D scene objects, creatures, texture and lighting (and it evaluates the scene based on these concepts and how they related to each other even if it's never seen green fields, sheep, dry stone walls or fog before)."

Our eyes are no that different from cameras - they also have a set of pixels that can get some values, they are not that regular and maybe the values are not that discrete but it is not that retina sees objects or textures - there are just some neural layers that do the pixels->objects computation.


That's actually not at all how the eye works. We saccade a tiny spot around the scene based on our semantic intent (it's how, for example, you can see a hole to the sunny outside from inside a cave, while no camera will be able to manage the white balance). Then we have specific hardware doing feature extraction in the early visual system and feeding into the vision process.

Finally the semantic interpretation feeds deeply into the vision system. For example, though we have binocular vision you can only get stereopsis via parallax basically as far as you can reach -- after that you use semantic clues in the scene to understand that a barn is bigger than a person so that the person must be closer.


> Our eyes are no that different from cameras

Our eyes are plenty different. One above all they are driven by the neurons behind to scan the scene as the brain tries to figure out the details whereas neural network take whatever feed the camera captures passively.

See for example an owl head movements as it’s triangulating a prey’s distance.

There’s a lot more going on than just the vision part like a cascade of neural structures and not just a big uniform net, with region dedicated to detecting edges and understanding depth separated from and feeding into the classification region.

And we have structures to pick up differences from one scene to another somewhere, and dedicated neurons that react to changeand movement in a scene independently from the brain classification

Oh and it is also apparent that some superstructure does innate detection and supercedes learning, i.e. tests say mammals scared by serpents even if they were never exposed to one, while the same doesn’t happen with spiders, hinting serpent detection and fear is hardwired and not learned. Or ar least learned by evolution and not brain neurons’ plasticity.


The part about the movements - agreed (maybe we need to add this to the machines to improve them) - but the rest is just about additional layers - machines are not restricted to just one layer neither.


Yeah, obviously human visual inputs are a finite set of data points from rods and cones which might be considered roughly akin to pixels. But by "seeing" I'm clearly referring to what takes place in the visual cortex which is incredibly efficient at converting those inputs to geometry and objects/creatures/expressions with qualitative associations in a lossy manner, making heavy use of hardwired priors which are evolved rather than learned through evaluation against past sensory input (whilst at the same time apparently being entirely incapable of processing or storing the original sensory input values in a sufficiently discrete manner to replicate the pixel by pixel evaluation a computer vision system can achieve).


Nope, we gave two eyes and the ability to change focus. This essentially means we are dealing with video with added depth perception. Recent motion really pops out because we are comparing what we see with what we just saw.

Self driving cars with liar are a much better representation of human vision than a single image. We also do well with photos, but that’s a significant step down.


The AI sees data/markers/patterns that look like something it's seen before, as opposed to actually comprehending that it sees a tube of meat that people call a hot dog.

The best metaphor I can think of is the cognitive difference between navigating a transit station that has signs in your native language, and one that you spent a couple of hours learning on Duolingo - with the latter, you aren't really understanding anything, just associating a:b::x:y.


This might be another formulation of the "Chinese Room" argument: https://en.wikipedia.org/wiki/Chinese_room

If every action is the same -- that is, if you produce some actions which would have been produced if you "conceptualized" it rather than merely "memorized" it -- isn't that identical?

The only thing we can do in life is make decisions. Regardless of how they're derived, if those decisions are identical to yours, isn't that entity "you" in some sense?


>if those decisions are identical to yours, isn't that entity "you" in some sense?

If by "decisions" you mean every single nerve impulse in response to every possible set of stimuli, then that's pretty exacting. Every wobble while standing, every mouth movement answering any possible question, etc.

Also, how do you determine if the responses are "identical?" It's not like we can rewind reality and play it back, substituting yourself for an AI. And due to quantum nondeterminism, even if you played it back with no substitution your actions will diverge over time! If you're not considered identical to yourself, how is that a useful definition/test of "identicality"?

At the required fidelity, this thought-experiment is problematic both in theory and in practice. It obscures more than it illuminates imo.


Transit is probably not extreme enough because a:b::x:y is fine.

Figures of speech are probably a better example (at least if translated literally, Duolingo teaches the equivalent phrases and is therefore easy to forget it doesn’t teach the meaning).

“Der Tropfen, der das Fass zum Überlaufen brachte”. What is the origin story that makes the English equivalent about camels, anyway?


Anyone can try to step into neural network shoes on https://rach0012.github.io/humanRL_website/

It is still much easier for us, because we just need to connect new data to existing concepts.


I don't know what "conceptually" means on such a level that I can program it. But it probably has something to do with a network of differing representations we have for a hotdog. A low-level visual cortex representation (probably not too much different from artificial NNs). Representation as parts arranged in particular spacial order. A word. Related representations, like a process of making a hotdog. And so on.


I was always puzzled by human's experience of "seeing". When you "see redness" for example. It was very hard for me to explain it to other people before I found that there is special term called "qualia":

https://en.wikipedia.org/wiki/Qualia

And I have absolutely no idea how to make machines experience "qualia". Any sophisticated image/motion recognition is rather trivial stuff compared to achieving mysterious "qualia".


I believe models with attention are more appropriate for ‘telling’ what they re focusing on , at least if the question is which part of the image they re focusing on


Attention is only (edit: typically) for sequential models (e.g. time series), not image-based convolution models.


Untrue! Three are a number of valid strategies for using attention in non sequential models. Attention is really just generating a mask to apply on a feature representation; it generalized perfectly well to classical models.


Isn't this pretty far behind the state of the art for white box neural nets? Which would use the first layer's activation to do something similar?


Could you give some references for an ML novice to learn these techniques? (Papers, books, anything.) I've been studying ML, but I'm not advanced enough yet to know effective ways to learn state of the art methods.


I just try to follow along, I don't really get into the details. The bit I was referencing was from a blog post that was on here a while ago: https://ai.googleblog.com/2018/03/the-building-blocks-of-int.... That said, I've seen the following listed as good resources:

fast.ai

The Google ML crash course

Andrew Ng's Coursera course

If reddit's your thing, it also looks like there's a sub, https://www.reddit.com/r/learnmachinelearning/, that might be helpful.


We need to invent a quad type machine ( opposed to the current binary model) where we have a "maybe" type of data structure.

Yes and a No and a MaybeYes and MaybeNo.


Other kinds have existed.

Ternary computers [0], where the most famous examples come from the Soviet Union.

And whilst I'm not aware of any computer that used the quaternary numerical system, it should be possible.

However, none of that necessarily means it's a good idea or necessary to perform quaternary logic.

[0] https://en.m.wikipedia.org/wiki/Ternary_computer


Fuzzy logic is a superset of conventional (Boolean) logic that has been extended to handle the concept of partial truth.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: