You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
A lot of people here are trying to describe to you that no, this is not at all the starting point of modern embeddings. This has none of the properties of embeddings.
What you're describing is an idea from the 90s that was a dead end. Bag of words representations.
It has no relationship to modern methods. It's based on totally different theory (bow instead of the distributional hypothesis).
There is no conceptual or practical path from what you describe to what modern embeddings are. It's horribly misleading.
There is no conceptual or practical path from what you describe to what modern embeddings are.
There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way.
In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way.
Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i.
One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI).
So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec.
Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context).
Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec.
Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR.
Eh, I disagree. When I began working in ML everything was about word2vec and glove and the state of the art for embedding documents was adding together all the word embeddings and it made no sense to me but it worked.
Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector.
BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW.
Well, you've fooled yourself into thinking you understand something when you don't. I say this as someone with a PhD in the topic, who has taught many students, and published dozens of papers in the space.
The operation of adding BoW vectors together has nothing to do with the operation of adding together word embeddings. Well, aside from both nominally being addition.
It's like saying you understand what's happening because you can add velocity vectors and then you go on to add the binary vectors that represent two binary programs and expect the result to give you a program with the average behavior of both. Obviously that doesn't happen, you get a nonsense binary.
They may both be arrays of numbers but mathematically there's no relationship between the two. Thinking that there's a relationship between them leads to countless nonsense conclusions: the idea that you can keep adding word embeddings to create document embeddings like you keep adding BoWs, the notion that average BoWs mean the same thing as average word embeddings, the notion that normalizing BoWs is the same as normalizing word embeddings and will lead to the same kind of search results, etc. The errors you get with BoWs are totally different from the errors you get with word or sentence or document embeddings. And how you fix those errors is totally different.
No. Nothing at all makes sense about word embeddings from the point of BoW.
Also, yes BoW is a total dead end. They have been completely supplanted. There's never any case where someone should use them.
> as someone with a PhD in the topic, who has taught many students, and published dozens of papers
:joy: How about a repo I can run that proves you know what you're talking about? :D Just put together examples and you won't have to throw around authority fallacies.
> if you add 2 vectors together nothing happens and its meaningless
In game dev you move a character in a 3D space by taking their current [x, y, z] vector and adding a different [x, y, z] to it. Even though left/right has nothing to do with up/down, because there are foundational concepts like increase/decrease in relation to a [0, 0, 0] origin, it still affects all the axes and gives you a valuable result.
Take that same basic idea and apply it to text-embedding-ada-002 with its 1536 dimensional embedding arrays, and you can similarly "navigate words" by doing similar math on different vectors. That's what is meant by the king - man = queen type concepts.
I think what the person before you meant about it not making sense is that it's strange that something like a dot product gives you similarity. To your point (I think) it only would assuming the vectors were scored in a meaningful way, of course if you did math on nonsense vectors you'd get nonsense results, but if they are modeled appropriately it should be a useful way to find nodes of certain qualities.
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.