By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.
Yes, you need to let the robot play (interact with the environment) to learn the vision-versus-touch correlations, but you can do so in an unsupervised way (as long as you choose the environment wisely).
This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/