Well, fastText uses character n-grams to compute embeddings for out-of-vocabular...

p1esk · on Oct 24, 2024

IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.

stephantul · on Oct 24, 2024

It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.

p1esk · on Oct 24, 2024

Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.