Actually adding the numerical value I think is not the right way to do it because of what happens when you matmul those values - usually the right way to do it would be to have low dimensional character embeddings that are learnable in addition to the token embeddings.
The problem with pure numerical values representing a class as input to nueral network layer is that the byte encoding number is going to be very hard for the transformer to memorize exact values especially when relatively close numbers to each other often do not share much meaning. Catagories are usually encoded somehow, like a one-hot embedding layer, or more recently a learned embedding, so that these different categories can be easily distinguished (different categories are close to orthogonal).
My prediction would be that using the numerical value directly would not work at all, and using learnable embeddings would work but You would have to reserve that part of the token embedding for each token which would hurt performance a lot on non-spelling tasks relative to just letting that whole embedding represent the token however the model sees fit.
But, IDK! It would be easy enough to try! You should on a small toy model. And then try a small learnable re-usable character embedding. And write a blog post. Would be happy to coach / offer some gpu time / answer any other questions you have while building it.
maybe make a metric of "count letter $L in work $word" - if you want to game it, you can choose words so that they are all tokenized into 1-2 tokens and each token has multiples of letter $L
And then use something like helloswag to measure how much you've lost on general text completion compared to a vanilla LLM with the same embedding size all dedicated to just the token.
You would have to train the new model from scratch since it would be all new token embeddings with whatever character encoding scheme you come up with. It would probably make sense to train the vanilla gpt from scratch with the same total embeddings size as your control. I would start with https://github.com/karpathy/nanoGPT as a baseline since you can train a toy (GPT2 sized) llm in a couple days on an a100 which are pretty easy to come by.
The problem with pure numerical values representing a class as input to nueral network layer is that the byte encoding number is going to be very hard for the transformer to memorize exact values especially when relatively close numbers to each other often do not share much meaning. Catagories are usually encoded somehow, like a one-hot embedding layer, or more recently a learned embedding, so that these different categories can be easily distinguished (different categories are close to orthogonal).
My prediction would be that using the numerical value directly would not work at all, and using learnable embeddings would work but You would have to reserve that part of the token embedding for each token which would hurt performance a lot on non-spelling tasks relative to just letting that whole embedding represent the token however the model sees fit.
But, IDK! It would be easy enough to try! You should on a small toy model. And then try a small learnable re-usable character embedding. And write a blog post. Would be happy to coach / offer some gpu time / answer any other questions you have while building it.