This is basically how I respond to requests myself. Sometimes a single short sen...

wolpoli · on Jan 31, 2024

Early on, I noticed that if I ask ChatGPT an unique question that might not have been asked before, it'll split out a response slowly, but repeating the same question would result in a much quicker response.

Is it possible that you have a caching system too so that you are able to respond instantly with paragraphs of technical information to some types of requests that you have seen before?

gtirloni · on Feb 1, 2024

Yes, search for LLM caching and semantic searches. They must be using something like that.

clbrmbr · on Jan 31, 2024

I cannot tell if this comment was made in just or in earnest.

As far as I understand, the earlier GPT generations required a fixed amount of compute per token inferred.

But given the tremendous load on their systems, I wouldn’t be surprised if OpenAI is playing games with running a smaller model when they predict they can get away with it. (Is there evidence for this?)