Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, fastText uses character n-grams to compute embeddings for out-of-vocabulary words. This is pre-transformers work BTW.


IIRC, overlapping ngram vectors are summed to form the token embedding - doesn’t it effectively destroy any character level representation of the token? Doesn’t really make sense to me.


It works because they use really large ngram values, up to 6. So most character-level information is in these subwords.


Let’s say we want to use 6-grams and build an embedding vector for the word “because”: we add integer vectors for “becaus” and “ecause”, right? For example: [1,2,3,4,5,6] + [2,3,4,5,6,2] = [3,5,7,9,11,8]. Obviously we cannot use this resulting numerical vector to spell the input word. Pretty much all character level info is lost.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: