You can usually come up with an explanation for why you did something. You don't have to "get to the bottom" of your own thought processes to do this: you just need to be able to reconstruct the symbolic manipulation part. This seems like a good thing for an AI to be able to do -- especially a truly "hard" AI that you'd trust to run things at a high level.
Humans often come up with an explanation for why they did something that's simple, consistent and also completely untrue. They sometimes even believe their own explanations. We may run into the same problem with an AI here - it may be, purposefully or not, providing us with untrue explanations.
> participants fail to notice mismatches between their intended choice and the outcome they are presented with, while nevertheless offering introspectively derived reasons for why they chose the way they did
I, for one, think that would be a much more fascinating problem than implementing neural nets where the representation of the underlying data is heavily obscured.
If only it is possible. The only powerful enough model for reasoning that comes to my mind are bayesian networks, and those will suffer from the same problems as neural nets - nodes, edges and values may not be related to any meaningful symbolic content that would be useful to report. It again seems in line with human mind; we invent symbols to describe some groups of similar concepts, with borders being naturally fuzzy and fluid.
If the AI tells us: "I did it because #:G042 and #:G4285 belong to the same #:G3346 #:G4216, while #:G1556 and #:G48592 #:G4499 #:G22461 #:G48118", I don't think we'll have learned anything useful.