Hacker News new | past | comments | ask | show | jobs | submit login

The token limit is 100% an artificial limitation. When ChatGPT first released last november I took the opportunity to try pasting 3k-line codebases into it to get it to walk me through them and it worked perfectly fine, putting that same code in the OpenAI tokenizer tells me it's ~33k tokens, way above the limits today. The reason they do this is because every token takes up ~1mb of video memory and that adds up real quick. If you had infinite video memory there would be no "fundamental limit" to how long a LLM can output.

OpenAI then has two limits on inputs. The first artificial one ensures that people don't get overzealous inputing too much, otherwise they'll hit the second hard limit of how much vram their cards have. To the LLM itself there is no difference between characters from the chatbot and human, the only hard limiter is the total number of tokens. I tried this out by inputing a 4k-token string into ChatGPT as many times as I could and it failed on the 20th input, meaning that the hard limit is >80k tokens. Converting this to vram gives us >80gb which is the exact amount of ram the Nvidia a100 card has.




> When ChatGPT first released last november I took the opportunity to try pasting 3k-line codebases into it to get it to walk me through them and it worked perfectly fine.

A common technique to work around the limitations in context length is to simply pull the most recent context that fits into the length. It can be difficult to notice that this happens because oftentimes the full context isn't actually necessary. However, specific details from the context are actually lost. For example, if you ask the model to list the filenames back in the same order, and the context was truncated, it would start from the first non-truncated file and the others would be dropped.

> If you had infinite video memory there would be no "fundamental limit" to how long a LLM can output.

Well, you've certainly got me there. One of the big limits with the transformer architecture today is that the memory usage grows quadratically with context length due to the attention mechanism. This is why there's so much interest in alternatives like RWKV <https://news.ycombinator.com/item?id=36038868>, and why scaling them is hard <https://news.ycombinator.com/item?id=35948742>.


FlashAttention has memory linear in sequence length. https://github.com/HazyResearch/flash-attention


> The token limit is 100% an artificial limitation.

My understanding is the token limit is an immutable property of the neural network once it has been trained, so definitely is not an artificial limitation - unless you’re suggesting OpenAI trained the NN with a higher token count then released it with a limit on the input to only allow smaller ones? Which I guess is plausible but I’m not sure why they’d do it, as they’d still be “executing” the same NN for every input so wouldn’t save any compute.


I think you misunderstood the token limit. LLMs don't block your buffer, they simply take the final n tokens of all of the input you've shared. Plenty of users have seen this.

It will still function, but anything you previously shared and referenced above it will lose context on. And if you ask it about that earlier content, it will do its best to hallucinate a reasonable answer of what might have been in your buffer before the cutoff.

Separately you may have found a physical hard limit with a bug that crashes the system, but that's not what's meant by a token limit in LLMs. It's a limitation of the architecture itself of any LLM.


  When ChatGPT first released last november I took the opportunity to try pasting 3k-line codebases into it to get it to walk me through them and it worked perfectly fine
Are these private codebases, or open source ones? If public, would you mind sharing a link to the ChatGPT session(s)?


The token limit depends on the way the training data was encoded, usually its using some sort of sin/cos function which has the problem where longer inputs than it was trained on cause the accuracy to plummet. Its also very likely that it just took the last part of your input as context.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: