It’s not pure chance that the above calculus shakes out, but it doesn’t have to be that way. If you are embedding on a word by word level then it can happen, if it’s a little smaller or larger than word by word it’s not immediately clear what the calculation is doing.
But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”
But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”