They don't remember all the previous scenes, just ones which are "novel" enough.
But how do we decide whether the agent is seeing the same thing as an existing memory? Checking for an exact match could be meaningless: in a realistic environment, the agent rarely sees exactly the same thing twice. For example, even if the agent returned to exactly the same room, it would still see this room under a different angle compared to its memories.
Instead of checking for an exact match in memory, we use a deep neural network that is trained to measure how similar two experiences are.