Actually adding the numerical value I think is not the right way to do it becaus...

kadushka · 2024-10-24T18:32:35 1729794755

Which tasks would you test the expected improvements on from this addition?

mattnewton · 2024-10-24T18:44:10 1729795450

maybe make a metric of "count letter $L in work $word" - if you want to game it, you can choose words so that they are all tokenized into 1-2 tokens and each token has multiples of letter $L

And then use something like helloswag to measure how much you've lost on general text completion compared to a vanilla LLM with the same embedding size all dedicated to just the token.

kadushka · 2024-10-24T18:52:49 1729795969

Which model would you recommend to try it on? Would you train it from scratch, or finetune an existing one?

mattnewton · 2024-10-24T19:38:14 1729798694

You would have to train the new model from scratch since it would be all new token embeddings with whatever character encoding scheme you come up with. It would probably make sense to train the vanilla gpt from scratch with the same total embeddings size as your control. I would start with https://github.com/karpathy/nanoGPT as a baseline since you can train a toy (GPT2 sized) llm in a couple days on an a100 which are pretty easy to come by.