I am fairly certain that agents are the same thing as prompts (but also could be different).
Only the chat prompt/agent/whatever is connected to RAG; the image generator is DALLE and the router is a one-off call each time.
eg, it could be the same model with a different prompt or a different model + different prompt all together. AFAIU it’s just serving a different purpose than the other calls
Currently all the tools on the market just use a different prompt and call it an agent.
I've build a tool using different models for each agent, e.g. whisper for audio decoding, llava to detect slides on the screen, open cv to crop the image, ocr to read the slide content, llm to summarize everything that's happening during the earnings call.