Koan: A word2vec negative sampling implementation with correct CBOW update

polm23 · on Jan 2, 2021

This is an interesting surprise about good old word vectors. From the README:

> Although continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec [1] and also in subsequent work [2]. However, we found that popular implementations of word2vec with negative sampling such as word2vec and gensim do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly.

The upshot is that they get similar results with CBOW while training three times faster than skipgram.

Given the popularity of Transformers, and that Fasttext exists, I'm curious as to what inspired them to even try this, but it's certainly an interesting result. There's so much word vector research that relies on quirks of the word2vec implementation.

nmfisher · on Jan 2, 2021

I'm not surprised that industry prefers W2V over Transformers, given how heavy-duty the latter can be at inference time.

It's been a few years since I looked at it, but IIRC fastText is basically just w2v with subwords, so it's also possible this negative sampling fix applies to w2v and fastText equally.

riku_iki · on Jan 2, 2021

One can also use shallow transformer models if inference throughput is important.

jarym · on Jan 2, 2021

With all the eyeballs on word2vec and gensim how did this not get picked up before?

gojomo · on Jan 2, 2021

Gensim considers the `word2vec.c` code, from the original authors of the Word2Vec paper, as canonical and seeks to match its behavior exactly, even in ways it might deviate from some interpretations of the paper.

If there's an actual benefit to be had here, Gensim could add it as an option - but would likely always default to the same CBOW behavior as in `word2vec.c` (& similarly, FastText) - rather than this 'koan' variant.

logram · on Jan 2, 2021

Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873

gojomo · on Jan 2, 2021

While I still need to read this paper in detail, I'm not sure their only change is to this scaling of the update.

The `koan` CBOW change has mixed effects on benchmarks, and makes their implementation no longer match the choices of the original, canonical `word2vec.c` release from the original Google authors of the word2vec paper. (Or, by my understanding, the CBOW mode of the FastText code.)

So all the reasoning in that issue for why Gensim didn't want to make any change stands. Of course, if there's an alternate mode that offers proven benefits, it'd be a welcome suggestion/addition. (At this point, it's possible that simply using the `cbow_mean=0` sum-rather-than-average mode, or a different starting `alpha`, matches any claimed benefits of koan_CBOW.)

mlthoughts2018 · on Jan 2, 2021

The paper itself says the only change is normalizing by the context window size C.

gojomo · on Jan 2, 2021

Ah, but I've now looked at their code, and it's not the only change! They've also eliminated the `reduced_window` method of weighting-by-distance that's present in `word2vec.c`, Gensim, and FastText.

What if that's the real reason for their sometimes slightly-better, sometimes slightly-worse performance on some benchmarks? Perhaps there are other changes, too.

This is why I continue to think Gensim's policy of matching the reference implementations from the original authors, at least by default, is usually the best policy – rather than using an alternate interpretation of the often-underspecified papers.

polm23 · on Jan 3, 2021

The word2vec implementation has many details that are unmentioned or at least not emphasized much in the paper. The source is also not very commented if memory serves.

This is another paper that's basically just about some details of word2vec and GloVe and their effects on the results:

Improving Distributional Similarity with Lessons Learned from Word Embeddings - ACL Anthology https://www.aclweb.org/anthology/Q15-1016/

steve_g · on Jan 2, 2021

To save a few clicks, here's the paper that describes the fix and gives some comparisons with the supposedly broken implementation.

https://arxiv.org/pdf/2012.15332.pdf

gojomo · on Jan 2, 2021

While I'm unsure of this paper/implementation's main claims without a closer reading, the Appendix D 'alias method for negative sampling' looks like it might be a nice standalone performance improvement to Gensim (& others') negative-sampling code.

utopcell · on Jan 2, 2021

Indeed it is a significant performance boost. I believe I was the first to suggest using the alias method for w2v. I had extended it to work in a distributed setting a few years back [1] which allowed me to scale the problem to a dataset of 1 trillion words on a lexicon of 1.4BN words (roughly, the top 40 billion web pages of Yahoo's search engine).

[1] https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14956

mlthoughts2018 · on Jan 2, 2021

I’m not convinced by this paper. I’ve trained a lot of from-scratch word, sentence and query embeddings in my career, it’s probably the single main thing I’ve done. I’ve never observed rescaling the average context vector to have an impact on application performance. It amounts to rescaling gradient terms, but most of those are being backprop’d from layers with batch normalization, strict activation functions, clipping, etc. There are many, many non-linear effects contributing to how that rescaling constant plays a role, and in anything other a completely shallow word2vec model with no further layers and where you just want to extract the embeddings in some application-agnostic way, that normalizing constant is not going to matter.

Der_Einzige · on Jan 2, 2021

This is awesome! Especially because word2vec and it's derivatives are much more useful in some cases than transformers are.

For instance, they store the vocabulary. I can query for similar words, or do vector math and convert it back to words. That is much harder to do with transformers.

Also, not surprised at all that this kind of bug made it through inspire of how popular word2vec is. NLP is chalk full of tiny bugs like this and there is all sorts of low hanging fruit for interested enough researchers...

piker · on Jan 2, 2021

Noticed in the code no Huffman tree. Then FN2 in the paper: "In this work, we always use the negative sampling formulations of Word2vec objectives which are consis-tently more efficient and effective than the hierarchical softmax formulations." Is that consensus?

gojomo · on Jan 2, 2021

It's the default most places. Especially with larger training corpora or vocabularies, the negative-sampling tends perform better - & I don't recall notable situations where the HS mode is better.

(I recall seeing some hints in early word2vec code of an HS-based vocabulary that wasn't based on mere word-frequency, but some earlier or perhaps iterated semantic-clustering steps, that I think managed to give similar words shared codes. But I've not seen more on that recently.)

gwenzek · on Jan 2, 2021

Why not a pull request?

polm23 · on Jan 3, 2021

Several reasons come to mind.

1. The original word2vec is considered a reference implementation, used for benchmarks, so some people might not want it to change much.

2. The original hasn't been updated since 2017, and development was never very interactive.

3. word2vec has been widely reimplemented, and some reimplementations may be more widely used than the original (particularly Gensim).

4. The original had a paper from the very start (rather than being an open source project where a paper came later), so other papers reference it. For future papers that use the koan form, having a paper makes it easy to use in a similar way.

mlthoughts2018 · on Jan 2, 2021

Why do work when you can milk it for status?