Yeah the Mixture of Experts might have not been called out by name, but it was pretty obvious you were getting different models depending on the question.
It goes to show how LLMs are nothing like AGI. I think combining it with a calculator is just a bandaid. A useful bandaid, but its not going to be able to do science ever.
Sparse architectures are a way to theoritcally utilize only a small portion of a general models parameters at any given time. All "experts" are trained on the exact same data. They're not experts in the way you seem to think they are and they're certainly not wholly different models. The "experts" work at the token level. An expert for one token could be different from the expert chosen for the very next.
GPT-4 isn't "nothing like AGI" any more than its dense equivalent would be.
I dont see how LLMs using many experts means it's very different from AGI. Why would anyone assume that human AGI isn't based on multiple models running in a similar architecture? At minimum humans are operating with a left and right brain, which process data very differently.
It goes to show how LLMs are nothing like AGI. I think combining it with a calculator is just a bandaid. A useful bandaid, but its not going to be able to do science ever.