Hacker News new | past | comments | ask | show | jobs | submit login

This is basically how I respond to requests myself. Sometimes a single short sentence will cause me to slowly spit out a few words. Other times I can respond instantly to paragraphs of technical information with high accuracy and detailed explanations. There seems to be no way to predict my performance.



Early on, I noticed that if I ask ChatGPT an unique question that might not have been asked before, it'll split out a response slowly, but repeating the same question would result in a much quicker response.

Is it possible that you have a caching system too so that you are able to respond instantly with paragraphs of technical information to some types of requests that you have seen before?


Yes, search for LLM caching and semantic searches. They must be using something like that.


I cannot tell if this comment was made in just or in earnest.

As far as I understand, the earlier GPT generations required a fixed amount of compute per token inferred.

But given the tremendous load on their systems, I wouldn’t be surprised if OpenAI is playing games with running a smaller model when they predict they can get away with it. (Is there evidence for this?)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: