With all the eyeballs on word2vec and gensim how did this not get picked up befo...

gojomo · on Jan 2, 2021

Gensim considers the `word2vec.c` code, from the original authors of the Word2Vec paper, as canonical and seeks to match its behavior exactly, even in ways it might deviate from some interpretations of the paper.

If there's an actual benefit to be had here, Gensim could add it as an option - but would likely always default to the same CBOW behavior as in `word2vec.c` (& similarly, FastText) - rather than this 'koan' variant.

logram · on Jan 2, 2021

Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873

gojomo · on Jan 2, 2021

While I still need to read this paper in detail, I'm not sure their only change is to this scaling of the update.

The `koan` CBOW change has mixed effects on benchmarks, and makes their implementation no longer match the choices of the original, canonical `word2vec.c` release from the original Google authors of the word2vec paper. (Or, by my understanding, the CBOW mode of the FastText code.)

So all the reasoning in that issue for why Gensim didn't want to make any change stands. Of course, if there's an alternate mode that offers proven benefits, it'd be a welcome suggestion/addition. (At this point, it's possible that simply using the `cbow_mean=0` sum-rather-than-average mode, or a different starting `alpha`, matches any claimed benefits of koan_CBOW.)

mlthoughts2018 · on Jan 2, 2021

The paper itself says the only change is normalizing by the context window size C.

gojomo · on Jan 2, 2021

Ah, but I've now looked at their code, and it's not the only change! They've also eliminated the `reduced_window` method of weighting-by-distance that's present in `word2vec.c`, Gensim, and FastText.

What if that's the real reason for their sometimes slightly-better, sometimes slightly-worse performance on some benchmarks? Perhaps there are other changes, too.

This is why I continue to think Gensim's policy of matching the reference implementations from the original authors, at least by default, is usually the best policy – rather than using an alternate interpretation of the often-underspecified papers.

polm23 · on Jan 3, 2021

The word2vec implementation has many details that are unmentioned or at least not emphasized much in the paper. The source is also not very commented if memory serves.

This is another paper that's basically just about some details of word2vec and GloVe and their effects on the results:

Improving Distributional Similarity with Lessons Learned from Word Embeddings - ACL Anthology https://www.aclweb.org/anthology/Q15-1016/