The article makes a big unfounded assertion in one of the first few sentences "We learn them at a very early age, without being explicitly instructed by anyone and just by observing the world." - we do not learn these inferences "just by observing the world", we learn these inferences by acting on the world and observing the results, and arguably they are simply not learnable by pure observation alone. There's some theoretical grounding for that (IIRC Judea Pearl's book Causality might be a good source for that) and some experimental grounding for that (e.g. the two kitten experiment for which I can't quickly find the actual paper but it's described at https://io9.gizmodo.com/the-seriously-creepy-two-kitten-expe... ), also other fields like medicine have a good understanding of limitations of what knowledge can be gained from purely observational studies versus experimental interventions. So it seems a bit misleading to write a whole article about "why machine learning struggles with causality" without even addressing the key difference of learning from observations vs learning from interaction, which IMHO is a much more fundamental obstacle than everything the article menstions.
> we do not learn these inferences "just by observing the world"
We most certainly can, just not as well or as strongly as when we're able to influence the system under observation. You're speaking way too strongly and simplifying a complex mechanism down past anyone's expertise.
I don't think they're speaking too strongly. I think a lot of the time when we correctly infer causality without empirically interacting with the system, it's because we have built up significant categorical experience about more atomic systems we were able to interact with.
In my view, a lot of things that are noninteractively inferred are compositions of more fundamental things that required empirical experience. When you've had the causality of gravity thoroughly beaten into you at a young age, a lot of other things seem intuitive that would otherwise completely fall outside a framework for being unempirically learned.
Do you have a specific counterexample of causality you can infer without interaction or empirical experience of something related?
Caveats: I'm not a neurologist or psychologist, so this is mostly philosophical speculation on my part.
We think we're so smart because we causally understand the world, but it took us a very long time to collectively discover these principles. A human alone would not be so smart.
I think this confuses (fact) knowledge and the ability to recognize causal relations.
A human can never be smarter than said human: Our brains are not connected and we can't share capacity with others.
So, discovering causality is always an individual experience. And that happens likely by "playing" with "the world".
I think it's noticeable that smarter animals are more playful. Which is also a hint that points to the fundamental importance of interaction with the world as a prerequisite for "smartness". Additionally the capabilities of the "sensors" and "actors" that make interaction with the world possible in the first place seem to be crucial to develop "smart behavior".
The part about the "sensors" seems quite obvious. I think one can gain a better general understanding of some thing if one can "experience" it in more than one "dimension".
And the "actors" allow one to perform "experiments" with the things around one, and find out this way how that thing "works" or is supposed to be "used".
That's actually the behavior that can be observed in children of all kinds of "smarter" species. So it seems to be at least linked somehow to "smartness".
Empirical evidence is a special case observation. If you observe the whole universe in its entirety, you could separate moments that effectively followed whatever conditions you might set in a lab. You can't act on the system, but you can forever tune your models to match the infinite observations you could record. At that point the separation between observation derived knowledge versus experiential knowledge is meaningless (it's just hard to imagine a universal model manifesting without having used experiental knowledge along the way).
Whether you swing the bat or just watch it hit the ball into the sky, you have the prerequisites needed to reason about the interaction.
A more entertaining question is how a system comes to believe causality (i.e. comes to believe that things can and must have causes)
I agree. To say we're able to influence the system is to assume free will, which I believe is an undecidable problem. Ultimately this is just a matter of semantics, which makes this thread rather pointless.
It does not need to assume free will, which indeed is a much harder philosophical issue - it needs to assume that the decisions of the intervention are caused by factors outside of the system you're studying (which usually isn't the whole world), which is easy to get if you have some actual external influence, and very hard to determine if you're purely observing.
For example, if you're allocating patients in control and treatment group based on a roll of dice - the dice don't have free will, but they are influencing the "system" of the patients and their treatment, while that system is not causually influencing the dice rolls.
For another example, if a baby is "experimenting" by babbling (a key part of language acquisition, https://en.wikipedia.org/wiki/Babbling), it's not necessary or relevant to decide whether free will is involved, the baby obtains useful experimental results of what sound experiences are caused by which attempts to move the tongue/mouth/etc, even if there's no free will and the attempts to move the tongue/mouth/etc are also deterministically caused by the sensory experiences of the baby.
I think we have a hard time disconnecting our personal experience with the observations we make. When we look at a photograph of a person riding a bicycle down a path, even if we've never ridden a bicycle we've likely been outdoors, stood on a path, felt wind when we moved, etc. We may not be able to accurately simulate the experience in our mind but we can get close.
On the other hand, the starting point for an ML system interpreting that same image is essentially a stream of scalar values that tend to demonstrate multiple layers of periodicity (3-4 byte intervals for RGBA and then another layer per line of rasterized image data and yet another per frame if it's a video).
Here's a quick experiment. Let the video linked below play for a five count (sound is essential but a bit intense so maybe moderate the volume first) so you have some confidence it's not just me playing a rude trick, then close your eyes for a five count. There's going to be a major change in the sound when you get near 'five', now try to imagine how the scene changed before opening your eyes again:
I think without any reference from an embodied perspective, we're asking ML systems to understand the sounds (which are also streams of scalar values that demonstrate periodicity) the same way we interpret the representation visually.
(Also if you enjoyed the example above check out these two channels, some of them are mindblowing)
An interesting related idea is inverse reinforcement learning. We watch how other people interact with a system to estimate a reward function and then we later test it out ourselves. Either directly with an environment or inside our own mental model of the environment. Simply "observing the world" can give us data about how to learn in an environment we've never interacted with before; only by watching others interact with it.
Maybe it's semantics but interacting with a simulation of the world is actually more important. In a pure sense, this doesn't require any actual real world interaction. This concept is usually referred to offline and off-policy reinforcement learning.
If you're saying that interaction in any sense is important, I'd very much agree that unsupervised learning and supervised learning aren't equipped to handle reinforcement learning problems. Correct framing of a problem is necessary to achieve a desired property like causality.
That is a very, very strong statement that requires some proof to go with such a strong statement of certainty.
While we can learn that the ball's change of movement just fine without swinging the bat, we can only do that because we are generalizing from a large body of knowledge that we devoloped by experimenting on the world using our body.
I am not aware of a single piece of evidence that an agent can use purely observational learning to ever aquire the causal knowledge of the real world to sufficient level to make those sorts of inferences with any sort of reasonable accuracy.
I'd link studies about children aping things they have only seen through a window, but I suspect you'll quickly argue out of the bounds of those studies and so I won't.
Right. "World" includes us. When we perform actions we observe ourselves doing them and then we observe the consequences.
That is the only way we can learn anything, by observing. And what else is there to observe than the "world"? We can observe our own thinking process but that is part of the world too. I would say. It is definitely not "out of this world" :-)
After interacting with the world, we can learn from passive observation. No human has ever learned anything about causation without first spending a while experimenting in the world.
I've read Pearl's book too and it wasn't clear if it was possible to systematically "observe interventions" instead of doing them yourself and what the rules would be for that.
I mean there was some stuff about instrumental variables but it seems the theory is a bit incomplete in that area.
What distinguishes an intervention vs just normal observable randomness? Does it have to do with the complexity of the entity performing the "intervention", with the fact that this entity can observe and act on knowledge? I guess it's kind of the debate about where determinism ends and free will begins. Are there mathematical bounds to help us sort it out though? Maybe there is something information theoretic? It's very unclear in my mind.
This is all about fitting generative models that are robust to counterfactual changes, that remain predictive even if you run your models with data you've never observed, this beyond simple interpolation/extrapolation. Are there priors in model structure that tend to naturally make models much more robust to counterfactual changes and that make models work well beyond the data? Do these priors get more effective when they include some latent variables that distinguish between observations and interventions? How to you train these latent variables?
I think the problem with observing interventions is it will result in impractically large sample complexity to derive the same causal conclusion as causal calculus armed with assumptions of DAG structure.
I think metaphysically speaking the approaches (observing interventions vs causal calc) aren’t meaningfully different in terms of inferences you can make with infinite data, see my similar observations to yours: https://vladfeinberg.com/2019/12/01/metaphysics-of-causality...
But if you can presume a fixed DAG you can get away with fewer observations bc then you can derive some minimal/cheap set of vars to randomize over such that the resulting experiment measures a causal effect. All causal calc does is give you a framework for clarifying assumptions necessary to derive such a set.
In high dims performing randomization is exponentially costly.
> What distinguishes an intervention vs just normal observable randomness?
The difference is your knowledge that value of some variable is controlled and independent from anything else in this situation. If causal model is some kind of a puzzle, then you could see how it would work if some parts of model are made dysfunctional by your intervention.
X causes Y means that changes in X lead to changes in Y. If you changed X than Y would change also. By observation you can witness correlation, which says nothing about causes. You could add temporal element, like "X happened earlier than Y" it means that Y couldn't be the cause of X. But it doesn't answer the question about cause, it just rejects some hypothesis about cause, and is all uncertain: we could easily devise an example when such a rejection would be a mistake. For example people could act because they are able predict something will happen. So something is not happened yet, but already is the cause of some events.
But when changes to X were made by you, then you know exactly when X was changed, you know that it was no changed to changes of Y or some Z which is a confounder. Here we also could make a mistakes, for example, because we do not know exactly how we make decisions and our behavior could be a confounder or mediator or something like. So we devised randomized controlled trials, double-blind experiments to make experiment really independent.
> I guess it's kind of the debate about where determinism ends and free will begins.
No need to get to that length. If your behavior is independent from the process which you research, than you'll get a "practical free will": free from influences of studied variables. This practical and situational free will definition is enough for learning about causes.
It also depends on your model. If you do not need to estimate the entire distribution, but are fine with getting some moment estimates (aka - an average effect, the marginal of the expected output and so forth) then you do not even need independence. What you need is no correlation with confounders. Since this includes latent factors, the philosophical issue is, of course, the same.
Pearl has a particular pet model for causality, but there are other approaches.
I have yet to see someone actually apply Pearl's model in prominent research, but we are probably getting there.
I think the question that is really on your mind is the one of (causal) identification.
For ML researchers, this may be explained as follows:
How, and to what degree, can we find a "deep" parameter that is a general or constant relationship between our inputs and outputs?
Is this possible? When? Under which assumptions?
If we do this, THEN we can run counterfactuals with our model.
But as it turns out, to convince yourself that your model really generalizes, you will have to prove to yourself that there exists a set of general relationships (parameters) in the data - which may for example be general because they are causal parameters - and that your model is sufficient to "identify" these parameters given the data source.
You will quickly find that you need to examine both the model, and the data (and assumptions that you have about it). Generally, no "model" itself is ever causal. It's always the combination of model and DGP.
One way to have a causal model is to simply assume some physical relationship exists, and estimate parameters from data. This is called structural modeling in econometrics and scientific ML in, well, physics mostly.
You bring in prior science and knowledge - like how weather fronts move, or how prices relate to costs - and use this structure to contrain your model. If you do this well, then this may be enough to identify causal parameters! Or ranges thereof.
Here is a technical treatment of such issues:
Another way is to find some measure of interest that you can derive without a complete model. For example, you may be able to "identify" so-called average marginal/treatment effects of some input variable without a parametric model. Then, as long as the changes in your inputs are "exogenous" in some sense, you can get a causal parameter.
In experiments, the differences in inputs are randomized. What remains as group differences between placebo and treatment are the causal effects.
In observational data, you may look for "natural experiments". Here, some exogenous variation can be found that identifies a similar treatment effect between individuals that are affected, and those that are not. Depending on your issue at hand, further techniques may be necessary to identify causal effects.
Let's make an example:
You built your start up to a huge and successful company, made up of programming teams. As CEO, you finally think about realizing your dream: Firing every non-technical team lead (say, everyone with an MBA). What effect would this have on productivity in the short term?
Well, you have lots of data about your teams, and have certainly fired a lot of people, so you build a model.
That model tells you, that firing such team leads is super good for productivity.
However, after you do the firing, you find that productivity drops. What happened?
Your model did not identify a causal effect. In this case, the firings you have observed probably occurred because the team lead was simply a bad manager. However, that data does not identify the "casual effect" of firing team leads. The "intervention" or the "treatment" you observed was not "exogenous".
Okay, let's do more science here. Let's say you try to find out the effect of technical ability of the team lead on team productivity. You run your fancy ML models, and again conclude that more technical ability of the manager makes for a better team.
But when you implement your effective training measures (or hiring measures, for that matter), the benefit is less than you expect. Again, what happened?
Well, you might have a selection effect. Technically competent team leads may select into productive teams. Or, there is endogenity: productive teams mean that the team lead learns technical stuff, instead of managing chaos. The effect is actually reversed: good teams make leaders more competent. Your estimation - your model - overestimates the causal effect. It is not sufficient to create a counterfactual, simply because you did not find the real causal chain.
Some key words to look into that are probably simpler than Pearl's full framework (and more practical, as they are applied every day):
As someone who has only read 1/3rd of Pearls book (for fun, just an engineer by day), this was a great answer and convinced me to pick it back up after surveying the non-Pearl world - thank you!
I should add that if you are interested in some other perspective, but maybe not in mostly technical discussions, then the authors Angrist and Pischke have a couple of books - one if even a comic book - about causal analysis.
TBH the idea that humans learn anything on a blank slate pure learning architecture is messed up. Our brains are literally evolved to interpret causality into the world. It isn't randomly just "learned" any more than walking is.
exactly, besides ignoring the innate structures, heuristics and biases hardcoded via evolution, the whole notion of "learning" became highly intertwined with reinforcement kind of learning, i.e trial & error, stimulus and response behaviorist terms popularized by Pavlov and Skinner a century ago, which is just one type in a large repertoire of adaptation mechanisms.
Memory in these models is used as afterfact, or some side utility for complex iterative routines based on calculus of function optimization. While in living organisms memory and its "hardcoded" shortcuts allow to cut through the search space quickly as in a large database index.
Speaking in database terms we have something like "materialized views" on acquired and genetically inherited knowledge, built from compressed and hierarchically organized sensory data and prior related actions and associations, including causal links. Causality is just a way to associate items in the memory graph.
Error correction doesn't play as much role in storing and retrieving information and pattern recognition, as current machine learning models may lead you to believe.
Instead, something akin to self-organized clustering is going on, with new info embedded in the existing "concept" graph via associations and generalizations, through simple LINK and JOIN mechanisms on massive scale.[1] The formation of this graph in long term memory is tightly coupled with sleep cycles and memory consolidation, while short term memory serves as a kind of cache.
Knowledge is organized hierarchically starting from principal components [2] of sensory data from e.g. visual receptive fields, and increasing in level of abstraction via "chunking", connecting objects A and B to form a new object C via JOIN mechanism, or associating objects A and B via LINK mechanism. Both LINK and JOIN outputs are "persisted" to memory via Hebbian plasticity.
All knowledge including causal links are expressed via this simple mechanism. Generating a prediction given a new sensory signal is just LINKing the signal with existing cluster by similarity.
Navigation in this abstract space is facilitated via coordinate system similar or perhaps identical to the role hippocampal place & grid cells play in spatial navigation. Similarity between objects is determined as similarity between their "embeddings" in this abstract concept space.
It's possible that innate structures are genetically pre-wired in this graph which represent high level "schemas", such as innate language grammar which distinguishes e.g. verb from noun, visual object grammar which distinguishes "up" from "down", etc. It is also possible these are embodied, i.e. connected to some representation of motor and sensory embeddings. And serve to bootstrap the graph structure for subsequent knowledge acquisition. I.e. no blank slate.
The information is passed, stored and retrieved via several (analogue) means both in point-to-point and broadcast communication, with electromagnetic oscillations playing primary role in synchronization in neural assemblies, facilitating e.g. speech segmentation (or boundary detection in general), and coupling an input signal "embedding" to existing knowledge embeddings in short term memory; while neural plasticity/LTP/STDP as storage mechanisms on single neuron level.
This is something I discussed with a friend recently. I talked about how the key to unlocked more potential for AI was to stop sandboxing it away from the world and let the AI start interacting with it. And the immediate pushback to that is that that would cause immediate chaos. Can you imagine AI-driven cars that out of nowhere decide to brake-check just to see what happens?
Kids are often an unpleasant annoyance in restaurants, and many people that don't like that annoyance try to convince restaurants and lawmakers to ban them from restaurants. The problem with those ideas is that by banning kids from restaurants, you are just going to create annoying adults in restaurants over time. Kids are annoying in restaurants, but they are also learning how to interact with the world. If you don't find a way to let them explore boundaries, they never learn, and they'll become obnoxious restaurant patrons even as fully grown adults.
Which kind of goes back to ethical AI. You can't unleash unbounded AI on the world, or else you'll cause chaos. And you can't sandbox AI, or it will never truly learn. What are you supposed to do then? I don't know, but the answer isn't firing the ethical AI department because you don't want them criticizing your ad empire ;)
Only, AI doesn't even have object permanence, something that babies pick up in a few months, years before they can make causal inferences longer than a few seconds. We don't let babies drive cars, we give them toys that they can't hurt themselves with. We only allow that when they're smart enough to learn from verbal/written instruction, have the fine motor skills to operate a vehicle, and the situational awareness to operate a car safely. People making self driving cars are basically putting infants in the drivers seat.
Makes you wonder how life operated near the beginning, before "how to behave sustainably" evolved. Before individual death, before sex, before speciation, before children, maybe even before genes.
The world must have seen some wild, explosive action in its day.
Immortal organisms still exist (two-headed planaria!), but on the whole the ecosystem seems pretty well calibrated by now.
Except AI isn't being 'sandboxed away' in any discernible way. A child being an unpleasant annoyance isn't comparable to a an AI that can't drive in either risks of failure or capability because the child is far more capable of moderating his behavior to conform to social norms than an AI.
There are a lot of simulators out there. It would not be out of the realm of possible to set up a learning AI to play them over and over again and record the results.
I think besides the sheer amount of work put in to program something like that, the main limitation would be processing power, that sort of thing would take an immense amount of it.
Now that I think about it, isn't this exactly what Tesla and other self-driving companies is trying to do?
So, it's not like it's sandboxed, it's just very hard to make this "kid" play with things.
I saw a study a while back that kids only a few months old will preferentially go to a person who helped their mum open a jar rather than one who refused.
There’s a hell of a lot of observation happening in kids minds very early on.
No doubt it’s easier once you can pick up a baseball bat yourself, but I have no doubt a young kid would understand basic objects connected together without ever having used them.
Exactly. People aren't born a blank slate. We have brain structures for understanding objects in the world and also language. A baby can automatically detect speech from other sounds. Evolution is a learning process it turns out.
I suspect that we have some of it built in, even before we observe or act. Something in us starts us on the road of the hypothesis that time is a thing we can reason about.
The illusion of causality is incredibly strong -- so much so that it's really hard to get people to give it up, even when faced with paradoxes (First Mover, quantum mechanics, multiple causation, etc.)
We don't come into the world with a fully formed sense of causality, and it appears over the course of months. But at least some of it may be wired into the hardware, like the language instinct. It's just that the wiring isn't done the day we leave the womb, and is influenced by what comes after.
Indeed, but it comes really early in development. A 2 months old baby understands that if you smile at other people they will interact with you. And they'll try to use the to get more interaction / playing (in a very simple way) which also makes sense evolutionary because more attention from adults means higher chance of survival.
Of course it makes a difference whether you observe someone acting or you are the actor yourself, but why could not you learn from someone else acting and observing the results?
You can think of "acting on the world" as resolving questions about the world, like "what happens when I do X"?
You could observe the results of others acting, but it means that the questions you're getting answers to are outside of your control. So if you need to know the answer to a particular question, you either need to test it yourself, or hope that whoever you're watching will test it for you.
It's possible to learn from someone else acting and observing the results only if you understand what the observed person is doing and could map it to yourself. Which babies for example cannot do. They like you making crazy faces but when very young they don't learn how to do the same thing from that. I think for two reasons, first they don't know what you did to make that face (which in the "experimenting yourself" case they would have known even if it was a random action) and second they can't map what they see to actions with their own body. Both need to be learned first before you can learn from observation. While learning from your own (random) interactions with the world is possible from day 1.
A key part there is agentive intent. Pure passive observation lacks the information about which parts of the observed result were intended and expected (by the actor) and which are irrelevant noise or side-effects. In some sense the equivalent of looking at the data of a study and knowing which column represents the "intervention" and which column represents the outcome of intervention - as opposed to pure observation of those data points without knowing which is which.
We can learn well from a demonstration, where someone else is acting and we correctly understand the intent and expected results (perhaps they've explicitly communicated that, perhaps we just know implicitly through our previous shared experience) and observe what happens.
However, if we observe someone else - at an entirely different skill level - who is essentially testing a hypothesis (and we don't know that, we have no observation of the agent's internal) and getting useful data about where their world model diverges from reality, then it's not nearly as helpful for us to determine where our world model diverges from reality, that would require different actions and different data.
"...we do not learn these inferences "just by observing the world", we learn these inferences by acting on the world and observing the results..."
The article: "“Machine learning often disregards information that animals use heavily: interventions in the world, domain shifts, temporal structure — by and large, we consider these factors a nuisance and try to engineer them away,” write the authors of the causal representation learning paper. “In accordance with this, the majority of current successes of machine learning boil down to large scale pattern recognition on suitably collected independent and identically distributed (i.i.d.) data.”"
The key words are "interventions in the world." The article goes on to say, "“Generalizing well outside the i.i.d. setting requires learning not mere statistical associations between variables, but an underlying causal model,” the AI researchers write." The point being that, whether or not acting in the world is an essential condition for learning causality, current machine learning approaches are not even trying for causality.
> we learn these inferences by acting on the world and observing the results, and arguably they are simply not learnable by pure observation alone.
Also crucially, we learn these inferences by acting on the world and knowing something about why we acted. "I was playing with it" is a conditional independence statement that we use all the time while learning how things work, we just usually don't use the mathy language to describe it. We're running randomized controlled trials constantly, but implicitly.
Coincidentally, it's a common anecdote that people who seem to learn things quickly and deeply have this curiosity and will play with a thing/twiddle the knobs while they're learning how it works. When you're playing a video game and someone says "hang on, let me figure out the controls for a sec" they're changing the conditional independence structure of their observations and running an RCT.
Another thing which is often forgotten, we have evolutionarily adapted biases built into our learning machinery. That helps us to learn some things that tended to be essential for survival (much) more quickly, but it can also hinder us in learning some other things that we are not adapted for.
Held, R., & Hein, A. (1963). Movement-produced stimulation in the development of visually guided behavior. Journal of Comparative and Physiological Psychology, 56(5), 872–876. https://doi.org/10.1037/h0040546
Not saying you are wrong but I learned to ride a bike by watching my classmates be taught in preschool.
Maybe direct acting is a quicker learning method. Or maybe seeing others' learning processes and instructions allows one to leapfrog ahead by not duplicating mistakes.
Maybe always learn off a good dataset if it exists first?
> I learned to ride a bike by watching my classmates be taught in preschool
I'm confused, you're saying you watched others learn and then managed to get on a bike for the first time and properly ride it like an experienced rider with no practice whatsoever?
> First, you make a mental model. Then, you test it.
Except that isn't how that really works right? A mental model explains part of the reasoning that led to an outcome but never all. After all, the map is not the territory. A mental model is grounded in trivial assumptions. Ultimately, your brain produces inferences in a way that is incredibly hard to couple to a specific logical processes. There has been a lot of research on how experts think and reason and none of it is compatible with having a mental model whatsoever.
I'm not sure what you mean by "without having it coupled to the system under observation" - could you clarify?
I do agree that observation is a type of experience, but a model that is meant to guide action (basically any useful model) needs to be tested in action. I can't learn to juggle only by watching other people juggle. I can only develop a hypothesis about how one juggles, but to test (and refine) it is to try the hypothesis out.
A model being coupled to a system == a model that can influence the system's state through a means
> a model that is meant to guide action (basically any useful model) needs to be tested in action
No, it doesn't. For example, the vast majority of work on modeling the stock market is done on machines completely sandboxed from any ability to make trades and are owned by companies who will never make a trade themselves but instead return an API response with a yes/no. Whether that is fed directly into some sort of automated action is largely irrelevant as the ability for an individual trade to cause a measurable impact on the market is negligible until it isn't. So, these systems are built separate from the system they model and learn entirely through observation.
tl;dr: weather forecasting models don't have an action to take and also can't influence their system. And yet they learn and grow more accurate.
OK that's a fair criticism. Then perhaps we can divide models into those that influence the system they observe (regulatory systems) versus those that only measure, or whose influence is negligible. Models that aim to influence a system do indeed need to be used to test their efficacy.
It's not just about testing their efficacy it's about the theoretical limits of pure observation when doing causal reasoning. We know that we are better served by avoiding causal certainty when using purely observational studies. It seems like the base assumption should be that similar epistemic constraints apply to machine learning.
Yes the best way to understand a system is to interact with it. But there are scenarios where that simply isn't possible and yet we can still model causality, like the weather example musingsole gave.
And you can gain experience/understanding without needing to either observe or directly experience something, merely by thinking about it or being told about it. Not every kid has to be or see a burn caused by touching a hot stove to learn it's a bad idea to try it.
If you require actual experience or direct observation to learn, then you're not using your brain to its full potential.
You can certainly generalized previously gained knowledge without direct experience in that particular instance.
Would a child who has never experienced the human body's pain response be able to infer the causal connection between the heat of a stove and the response after touching it?
Arguably, language is a tool that allows us to generalize the direct experience of other agents. It is unclear if it is possible to remove direct interaction from a learning system and still reach the same level of understanding.
We do learn these by recognizing what actions usually comes before this outcome. That's how we learn the causality. That's how we learn the wrong causality as well (required outcome after praying, that's how we created all Gods).
We see what did we really do to get this. We often don't have any idea. And we put any label we find most appealing and plausible based on our previous assumptions.
Machines can learn to do the same by going back one step by answering "What could have happened before" in terms of probability.
Both the article and the paper mention interventions and refer to Pearl’s work, so I think we can assume this is just a matter of poor phrasing and a child’s observations of its own interventions aren’t meant to be excluded.
But for the purposes of machine learning, it seems like it should be possible to learn from observations of someone else’s interventions?
Yeah, Judea Pearl's book is the definitive work. His formalism also has an aspect that you could say is similar to imagining an intervention in a counterfactual way, and computing its likelihood given the data. His paper "Why I'm only half Bayesian" lays out how he sees the epistemology of this.
I disagree here. We DO "learn them at a very early age, without being explicitly instructed by anyone and just by observing the world" the majority of information that a human ingests doesn't include direct actions upon the world (in any more than a philosophical sense anyway). We may feel as though actions are more memorable, but I claim that we observe than take actions, not the other way around. We learn things all the time that are not "consciously learned". Addition (e.g. 2 apples are more than one apple) is learned by most humans in an unsupervised manner far before labels are added later. I claim that early childhood development and thus much of our "foundation" is primarily rooted in having no explicit information AND not enough power to directly label data ourselves (through experiments, play, etc)
This is important because "learning without explicit instructions" in ML speak is Unsupervised learning (clustering and dimensionality reduction). There are no labels except the ones that you decide upon yourself (cluster membership). Unsupervised Learning is still far in its infancy in effectiveness compared to supervised systems, and it's no surprise that its algorithms are generally extremely easy to implement from scratch (e.g. K-means or DBSCAN) compared to relatively difficult work like automatic-differentiation in neural networks.
Learning by reading information in a book or by direct didactic teaching would be supervised learning. Learning through a dialectical format would be reinforcement learning . Self-supervised learning would be equivalent to autodidactic learning and the creative act upon the world. (maybe the distinction between self-supervised and reinforcement is arbitrary)
The point is that we want to learn as much as we can given the information available to us. We should not rule out the role that the biological analogue to unsupervised learning plays in human development.
It is for all of these reasons that I become far more excited when a new clustering or dimensionality reduction algorithm comes out than I am when a new neural network architecture becomes state of the art.
Children do not learn completely "unsupervised" and recieve frequent feedback from the agents. I would argue that a significant amounts of childhood development, (especially around "labels"), is due to our hardwired attachment to human faces.
I have always felt that the signifance of the "social software of human culture" in our general intelligence and learning capacity was underestimated by the AGI community.
So personally, I see more potential in communities of learning agents than any developments in the underpinnings.
> we learn these inferences by acting on the world and observing the results, and arguably they are simply not learnable by pure observation alone.
Tangentially, wouldn't sampling based ML methods like particle filters / kalman filters or other randomized state space exploration algorithms be analogous to the person learning by acting on the world? In this case, the "action" would be bouncing the radar off the object being tracked.
Of course these models are far more limited than a child in the way they can act on the world, and in the number of aspects of reality they can model.
And furthermore, they have no concept of causality, and represent only the current state of knowledge they are modeling.
Humans struggle with causality. Even those whose professions are dedicated to understanding causality struggle with it. That's the reason we have heuristics like the "5 whys" that only occasionally work. Consider the following philosophical problem:
An inattentive headphone-laden jaywalker wearing black crossed a road at night and was killed by a drunk driver speeding in a large SUV. What was the root cause?
The amusing thing about this question is that if you actually have an answer, it reveals more about your biases than it does about the situation and potential solution(s). Various people will chime in about the latest thing that annoys them...whether it is inattentive pedestrians, jaywalkers, pedestrians wearing black at night, drunk drivers, speeders, or people driving cars that are too big and dangerous. But all of them are wrong because there is no discernible root cause.
The problem is that we don't have multiple realities in which we can control each factor involved. And even if we did, it is entirely possible that if you can isolate and control each factor individually, that the accident still wouldn't have happened. Sometimes it takes a confluence of factors for an event to actually occur. And people's obsession with finding a single cause for complex phenomena hinders their ability to actually find fixable solutions.
The problem isn't the lack of multiple realities for counterfactual testing - it's that the idea that there's a basic root cause for any particular outcome is ill-founded.
The basic idea of 'look beyond immediate causes' is reasonable, but the cult of the root cause analysis is a bit out of control.
Perhaps you'd be happier with an idea like "when looking at a bad outcome, invest some effort looking at the proximal causes of the bad outcome to see if there's a place you can invest less net effort to fix the problem and for greater gain."
Obviously all root cause analysis terminates at "Because the Big Bang and subsequent quantum fluctations had this result" or something similarly utterly unactionable, but if you use the metric above, it is a common observation that such analysis can reveal higher bang-for-the-buck engineering outcomes than simply fixing the immediately obvious, and that there is some typical patterns that emerge such as the root cause analysis eventually getting back into things that are infeasibly expensive or impossible to fix (e.g. "because human culture" will show up in a lot of them at some point but you aren't going to fix that just because two holes were misaligned on the factory line), meaning such analysis also meaningfully terminates.
I tend to operate on this myself because you don't actually get "a" root cause analysis. I can always find a tree of causes as I go back, not "a" series of causes. But it's a fairly frequent occurrence that if you look over such a tree, there's at least one node with a highly favorable cost/benefit tradeoff that you can find with surprisingly minimal effort.
The problem is that there are atomic causes, not a root cause. The problem is we're acting like a singular force causes an event instead of several factors and a long chain. We need to dispell this myth that "x leads to y". It is only true in the most simple of cases and in restricted domains.
Funnily enough, this is what our "collective consciousness" is useful for. Humans act a lot like an ant colony, except individuals have much more autonomy.
So, you ask, say, 1000 people what the root cause is. The result will be decided either by majority or the most plausible argumentation. Say 700 people agree the jaywalker was at fault. That will become "reality" for the group, which will then likely spread throughout the hive.
The main lesson will be "don't jaywalk at night", the secondary one likely "don't wear black clothing at night" and probably a third one "beware of drunk drivers".
Sorry if that sounds weird, I'm also trying to wrap my head around what human intelligence is and how it can be applied to AI (Approximate Intelligence :)).
The problem with this is that there isn't a "root" cause. But there are atomic causes. If you do a component analysis you'll find that some components contribute more than others, but we humans LOVE to pin things down to singular causes. This even gets more difficult because some people are talking about the primary cause and just naturally shortening to "the cause" and other people that don't recognize there are multiple causes and actually mean "_the_ cause". This is one of the fun limitations of language and how we're bad at encoding.
Ironically, I would argue that just about everything that happens is an objective event, but any and every interpretation/understanding of causality is itself, subjective.
In the hypothetical scenario the objective reality is: A pedestrian and a car collided.
Other factors come into play, which all have potential influence on the causality of this objective reality, and said factors are frequently subjectively asserted as more relevant to causality than others.
Strong assertions to causality may include: Driver was drunk, pedestrian was jaywalking, pedestrian was wearing dark clothing, driver was speeding, driver was texting, driver and pedestrian had a prior altercation, driver was tired, etc
Weaker assertions to causality may include: It was a Tuesday, someone sneezed 3 miles away, a butterfly flapped its wings, pedestrian knocked over salt during dinner earlier, etc
Causality is hard... We're getting better at it, but superstitions are a thing that exist, even if they are a bit odd.
Yeah, I'd say that's true of most things that are too complicated to be understood by a single individual.
In the end, it's all about the collective. It needs a consensus to move forward, and that doesn't need to be perfect, just good enough.
The whole human civilization works mostly on subjective conclusions, sometimes a minority brings up enough facts/proof to change the established consensus, but often we just pile layers upon layers on top of things that are not objectively true or not completely true. But they're good enough, and we're very adaptable.
The article is about much simpler forms of causality though, like what happens when you hit a ball with a bat. Learning enough about causality to handle everyday physics would be an important advance.
> The amusing thing about this question is that if you actually have an answer, it reveals more about your biases than it does about the situation and potential solution(s). Various people will chime in about the latest thing that annoys them...whether it is inattentive pedestrians, jaywalkers, pedestrians wearing black at night, drunk drivers, speeders, or people driving cars that are too big and dangerous. But all of them are wrong because there is no discernible root cause.
Causality is usually complex and often complicated. Let's consider another case:
An attentive person without any sensory impairment crossed at a crosswalk in broad daylight and was killed by a sober bicyclist going ten miles an hour[1]. What was the root cause?
All of the same objections apply. If you want an ultimate root cause you need to turn to theology, in which case as a Christian I can say that all death is ultimately caused by sin. While I personally find that to be a philosophically sound position, it isn't especially useful for answering the pragmatic question of how to reduce preventable evils like (some?) pedestrian deaths. My personal bias in this case is to take a pragmatic approach. That means identifying factors that can be changed with a high rate of compliance at a cost that is less than that of the evil being prevented.
Mature safety systems take this into account. As an example, take basic firearms safety[2]: 1) Treat all guns as if they are loaded. 2) Never point a gun at anything you don't want to destroy. 3) Keep your finger off the trigger until ready to shoot. 4) Be sure of your target and what's behind it. In order to negligently discharge a firearm and cause harm, all four of these rules must be disregarded. In this case, even though any number of events from mechanical failure to muscle spasm could have caused the discharge, if someone causes unintentional harm I can say the local root cause was failing to observe the safety rules and hold said person morally responsible[3]. Many other examples of safety systems should come to mind, including relevant to the pedestrian safety scenarios.
[1] One way that this could plausibly happen is that the pedestrian is knocked over and hits his head. Falling and hitting one's head is a surprisingly common way to die.
[2] There are variants, but they all aim to achieve safety in much the same way.
[3] This holds regardless of one's position on civilian ownership of firearms, because police and military personnel are also human and need to follow a safety system.
The parent is clearly invoking the "uncaused cause" model of God. Physicalism doesn't have an originating cause except maybe the big bang depending on what you think could have happened before it.
So your options in terms of the originating cause are either to turn to religion for a philosophical answer, or to say "I don't know and we may never know".
That's not quite what the uncaused cause means. The uncaused cause is the answer to the question "if every effect has a cause, then where did the cause-effect chain begin?" - or in simpler terms, "what was the first thing that happened?"
Christians would say God, scientists would say the big bang (or something even before that, it's basically unknowable). They're not even mutually exclusive as you could say that the big bang is the means by which God created the universe, or any other origination story you can come up with.
The uncaused cause says nothing about later cause-effect pairs, only the originating cause.
I mean I don't want to put words in somebody else's mouth, but my interpretation as a non-practicing Christian knowing just a bit about Christian theology would be that death originates from humans being cast out of the garden of Eden, which was due to the original sin of eating from the Tree of Knowledge.
I think the parent edited their original comment as there was some stuff in there directly referencing God as the original cause before and that seems to be gone now.
Causality overall is hard. Humans fail at dealing fully with casual perfectly fairly often. But human tend to do far better than computers (partly failing rather than totally failing, etc).
But your examples involve human's linguistic expression of casualty, which is an entirely different question.
The amusing thing about this question is that if you actually have an answer, it reveals more about your biases than it does about the situation and potential solution(s). Various people will chime in about the latest thing that annoys them...whether it is inattentive pedestrians, jaywalkers, pedestrians wearing black at night, drunk drivers, speeders, or people driving cars that are too big and dangerous.
A lot of human language is "socially significant noise", which some might object usually isn't true. But it often serves an entirely different purpose than an accurate modeling of reality.
People walk through complex society, putting their socks on before their shoes and otherwise doing the basic things but articulating positions that ... other people would consider "nuts" but this situation has nothing to do with a failure of casual modeling, which happens on a different logical level entirely.
We can make progress in solving this specific inference problem with a carefully constructed experimental design & human challenge trials. E.g. repeat variations of the scenario where we keep all factors the same but intervene and force one or more factors to something else, e.g. force the pedestrian to listen to a boombox, not headphones. Or force the driver to drive a trolleybus not an SUV.
Consider a full factorial design across the following dimensions:
inattentive | attentive ; headphone | none | boombox ; jaywalk | non jaywalk ; wearing black | wearing dark blue | wearing high viz reflective gear ; night | day | dawn | dusk ; driver sober | drunk | half as drunk | twice as drunk | on meth ; driver speeding | driving at speed limit | driving at half limit ; driving large SUV | hatchback | pushbike | trolleybus | road train .
A mere 2 * 3 * 2 * 3 * 4 * 5 * 3 * 5 = 10,800 different scenarios. Since the outcome in each scenario is stochastic we might need to do 10-fold replication per scenario. So a total of 108,000 trials. Approval from research ethics review committee pending.
Arguably one reason causality research in the machine learning community hasn't boomed is there is no framework/ease of access to quickly code up the current heuristics/patterns/graph models like you can do with a deep learning idea using pytorch. Deep learning has reached today's stage because of early frameworks like Theano and Caffe. Access to beginners in a field is crucial for SOTA development which although feels a little counter-intuitive is nevertheless true. If you search for causality you get a bunch of books and papers from Pearl and Scholkopf which are fun for reading but what do I do something actionable with that quickly.
Arguably one reason causality research in the machine learning community hasn't boomed is there is no framework/ease of access to quickly code up the current heuristics/patterns/graph models like you can do with a deep learning idea using pytorch.
Ironically enough, it seems like you're confusing cause and effect here. The reason that there's little causation based reasoning isn't because there's no automation for it. Rather, the reason there's no automation for it is because it hasn't boomed.
The reason deep learning based ML for image recognition boomed is because you could take a fairly database of images and categorizations and produce an impressive and testable system using straight forward if challenging optimization procedures. Because this approach has boomed, huge amounts of money have flowed to it, lot of people have been hire, and it's been semi-automated and so you have a combination of data and frameworks that let you things quickly. Some high percentage of all the achievement of current ML is leveraging the original static ability to sort images (or sort buckets of bits) into different areas (Alpha go - sorts moves into "good" and "bad" and adds tree pruning, etc). Which isn't to discount it, it's the first sort of system can seem "as good as human" in certain areas.
But when there's no similarly easy and impressive procedure for taking, say, a time series, and getting the next result better than human or traditional statistics can predict, there's no boom, no gathering of public data sets, no easy automation of the standard procedures and so-forth.
The 2012 Imagenet results which jumpstarted DL did not benefit from Theano or Torch frameworks. Alex Krizhevsky had developed his own GPU accelerated framework (cuda-convnet), and it remained quite popular for a couple of years after the competition, until Theano and Torch caught up with it.
You could argue that reinforcement learning policies are already causal models insofar as they relate state-action pairs to the rewards and penalties that they lead to. The trial and error that RL performs in a simulation is an exploration of counterfactuals to establish cause.
But the policies lack introspection. One of the most powerful things we could do is somehow extract causal models from those policies, to see what they learned that lead them to behave more intelligently. That would increase both our knowledge as well as our trust in applying RL.
There's a paper that used a random convolutional filter when training the agent and they found that it manages to generalize very well, they evaluated the CNN and found that the model put emphasis on where the enemies were at each frame, which indicates some form of understanding.
However, I don't think that there is any form of causal relationship to be extracted for model-free agents. I don't believe that what we are seeing is anything more than changing action likelihoods in some very high dimensional function.
^^ If you really want to understand the issue from both the historical AI research & philosophical perspective, this is the article to read, and has the benefit of being an entertaining read. Much of Dennet's work is similarly entertaining, and he's absolutely brilliant.
(sorry for commenting on my own comment, I should have added this detail in the first but was too late to edit.)
I wonder if there are intermediary steps to getting to better at causality in ML. Causality is an abstraction of a whole set of a lot of different levels of problems.
In terms of concrete problems earlier than causality. E.g. toddlers I think get object permanence before causality, and I think ML might struggle with that too.
Edit: then the next interesting thing after permanence is maybe obj path prediction, then you have an interesting basis for some level of causality inference because you have a prediction and some set of conditions that might disrupt the prediction.
Isn't it simple. There is no single root-cause ever. What causes something is elementary particles moving in certain ways together always affecting each other. There is never a single "root cause".
A separate question is "Who is guilty?". "Who deserves credit?"
Here's my usual rant about ML not doing it because we don't give it the right data. You can't learn causation from a table of floats. You can learn causation from a sufficiently annotated table of floats.
If I tell you that here's a column of people's weights and a column of their diets, no model can learn the causal connection between the two. If I tell you that here's a column of peoples weights and here's a column of their diets which were randomly assigned under intervention, then suddenly a model can do it.
All causal inference with observational data requires assumptions about conditional independence structure. It's so crucial that we always explain it all in prose in any writeup of any given causal investigation. We put none of that in tables themselves, despite it being entirely crucial. If we started making postgres tables that stopped looking like "height: float, weight: float" and started looking like "height: float, weight: do(float)" (as in pearl's do(...)) then we could start to automate causal inference much much more easily.
Not to say the types would be nearly so simple. You'd need a full DAG for your database, and even then, it's not that simple: our AB testing platform (v3.1) intervened here according to this 1k line python script (git commit 191284794) that took in these columns and employed a model trained on this other entire table as of date X, before we migrated the db. Also this one column's meaning changed in november when we removed a button from the home page.
But without some structured encoding of the structure that an analyst is going to need (structure they're absolutely going to be encoding in natural language in their writeup), we're trying to do it with one arm tied behind our backs.
If normal humans struggle making a difference between correlation and causation, you can expect ML will struggle even harder.
Same goes for intuition, inference, induction, deduction.
I would surmise that scientists trying to implement causality are aiming to accomplish general intelligence.
I remember that in math, a teacher would use the "imply" relationship to explain causality, with rain, grass and a garden hose.
Rain or using the garden hose would always wet the grass, and the grass cannot be dry. The grass being wet would imply the hose or rain, but it's impossible to know which.
There is some deep philosophy behind this. Actually reminds me about Anathem which is a science fiction novel by Neal Stephenson about alternate realities. He goes out of his way to create a world that has a very similar mathematical and philosophical history as earth. But everything is of course very different. Philosphers have different names, history is completely different, etc. A lot of the fun part of reading the book is figuring out what is what.
It touches a lot of topics; including causality, quantum physics (and some popular interpretations of that like parallel universes), which is of course a fun topic to explore in a novel like this.
One of the premises in the book is that consciousness is basically a form of quantum computing guided by causal relations between "nearby" universes. Which would explain why the brain is so quick at taking decisions. Probably not true but it's a fun thing to explore in a novel like this.
But the notion that events are connected in time and that our brains are very good at spotting causal patterns between past, current, and future state of the observable world is a useful mental model.
Most programmers are familiar with the notion of solving problems in their sleep or have experienced the magical "solved it in the shower" kind of moment that comes from having immersed yourself in the problem and your brain adapting to fit the problem and produce a solution while you were sleeping. We don't understand how it happens; it seems to be connected to our brains just doing some kind of garbage collecting and processing while we sleep. But somehow after that happens we have a clear notion of what caused what and why and what to do to rectify the situation.
Once computers will be able to do this, the singularity might be a little bit closer.
How well do differential neural nets perform in casual inference ? There seems to be a pretty good library from sciml that claims to learn models from limited data but wondering if they generalize well
>For instance, convolutional neural networks trained on millions of images can fail when they see objects under new lighting conditions or from slightly different angles or against new backgrounds.
The big picture is humans use a multi-task network for depth, segmentation (and background removal), lighting source estimation (and shadow removal), material extraction, SLAM (and geometry reconstruction), optical flow, etc.
Papers and their networks only look at a small part of what humans do; we are not just using a single "neural network."
This was a surprisingly well-informed article. It covers the i.i.d. assumption, in-sample vs out-of-sample data, and causal modeling. And it avoids "AI" hyperbole.