Has anyone tried to combine a token embedding with some representation of the ch...

mattnewton · 2024-10-23T21:45:03 1729719903

I'm not following - spell out the word how? Like put the actual bytes as numerical input to the transformer layer?

p1esk · 2024-10-23T23:19:53 1729725593

mattnewton · 2024-10-24T18:22:42 1729794162

Actually adding the numerical value I think is not the right way to do it because of what happens when you matmul those values - usually the right way to do it would be to have low dimensional character embeddings that are learnable in addition to the token embeddings.

The problem with pure numerical values representing a class as input to nueral network layer is that the byte encoding number is going to be very hard for the transformer to memorize exact values especially when relatively close numbers to each other often do not share much meaning. Catagories are usually encoded somehow, like a one-hot embedding layer, or more recently a learned embedding, so that these different categories can be easily distinguished (different categories are close to orthogonal).

My prediction would be that using the numerical value directly would not work at all, and using learnable embeddings would work but You would have to reserve that part of the token embedding for each token which would hurt performance a lot on non-spelling tasks relative to just letting that whole embedding represent the token however the model sees fit.

But, IDK! It would be easy enough to try! You should on a small toy model. And then try a small learnable re-usable character embedding. And write a blog post. Would be happy to coach / offer some gpu time / answer any other questions you have while building it.

kadushka · 2024-10-24T18:32:35 1729794755

Which tasks would you test the expected improvements on from this addition?

mattnewton · 2024-10-24T18:44:10 1729795450

maybe make a metric of "count letter $L in work $word" - if you want to game it, you can choose words so that they are all tokenized into 1-2 tokens and each token has multiples of letter $L

And then use something like helloswag to measure how much you've lost on general text completion compared to a vanilla LLM with the same embedding size all dedicated to just the token.

kadushka · 2024-10-24T18:52:49 1729795969

Which model would you recommend to try it on? Would you train it from scratch, or finetune an existing one?

mattnewton · 2024-10-24T19:38:14 1729798694

You would have to train the new model from scratch since it would be all new token embeddings with whatever character encoding scheme you come up with. It would probably make sense to train the vanilla gpt from scratch with the same total embeddings size as your control. I would start with https://github.com/karpathy/nanoGPT as a baseline since you can train a toy (GPT2 sized) llm in a couple days on an a100 which are pretty easy to come by.

stephantul · 2024-10-24T04:09:59 1729742999

Not that I know of, but encoding orthography in a fixed width vector usually carries the assumption that words with the same prefix are more similar. So there’s an alignment problem. You usually solve this using dynamic programming, but that doesn’t work in a vector.

For example “parent” and “parents” are aligned, they share letters in the same position, but “skew” and “askew” share no letters in the same position.

p1esk · 2024-10-24T05:12:39 1729746759

The other 500 values in the skew/askew vectors will be similar though. The 12 character values don’t need to be aligned, their function is to provide spelling. Adding such info will probably help LLM answer questions requiring character level knowledge (e.g. counting ‘r’s in ‘strawberry’).

RicoElectrico · 2024-10-23T21:25:08 1729718708

Well, fastText uses character n-grams to compute embeddings for out-of-vocabulary words. This is pre-transformers work BTW.

p1esk · 2024-10-24T04:07:40 1729742860

IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.

stephantul · 2024-10-24T04:12:31 1729743151

It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.

p1esk · 2024-10-24T05:25:25 1729747525

Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.