Hard to say, as a native speaker eloquence here doesn’t seem to refer to the authors own eloquence but rather the amount of words and roundabout explanations they’ve dedicated to the topic. Ie, in contrast to a simple explanation for the obvious things. If anything it reads a touch self deprecating.
Tbh starting the history doesn't at 1948 doesn't really help much. Around 1948 is when the Jewish and Palestinian narratives are already in full swing and have some of the strongest evidence for each side. On the Jewish side you'd have their invasion by the broader Arab world and their subsequent mentality that they have to defend themselves at all costs. On the Palestinian side you have the Nakba and the resulting trauma that forms their identity.
If you go back earlier you start to see the roots of the conflict more clearly. From Russian Jews being expelled out of Russia and looking towards the Ottoman empire and historic Judea as a place from their home, Ottoman policies around 1900s when they are hyper paranoid about quashing nationalism in their multi-ethnic empire and how that plays into their treatment of Jews they allow to immigrate into their lands, then the British control over Ottoman lands and how they tried to administer to it and designated Judea as a vague "Jewish National Home" which gave neither Jews nor Arabs clarity on the political reality of the region. Then there was the early immigration of Jews to Israel in the 1910s and 1920s which meets some Arab resistance and sparks violence, but the migration stops in the mid 1920s, so tensions cool a bit. Then in the 1930s the Nazis rise to power and immigration of Jews to Palestine _really_ starts. At this point the US has imposed strict quotas on Jews immigrating to the US (in around 1924), so for many Jews Palestine isn't their first choice, but it is a choice for them and the British seem serious about this "Jewish National Home" thing, so they go for it. Then comes the great Arab revolt in 1936 where Arabs in Palestine respond to the large influx of Jewish migrants, demand independence from Britain, and seek to end the Jewish National Home project. Somewhere around the early 1940s WW2 is in swing and the Nazis invade Egypt and establish a SS unit there whose purpose is the genocide the Jews in Palestine. This was a pretty close call, if it wasn't for a British victory in Egypt that held the Nazi's back, odds are the Middle East would look very different today.
And that brings us to 1948 when the British withdraw and the stage is set for the resulting chaos. So there you have it. On one side a population struggling to determine their national identity after the fall of a 600 year old empire and dealing with a large influx of immigrants that they don't want to their lands (and also still struggling to figure out what "their" means). On the other hand you have a group of people trying to flee their home country and having no option but this weird "Jewish National Home" that totally isn't a state according to the British despite it being maybe possible a state for Jews? Then in the middle of a clearly bad situation, Britain withdraws and the Jews and Palestinians have been dealing with the situation sense.
None of this is to say that one side is right or wrong. Rather, if you look around 1948 you're already missing a lot of the buildup to the conflict and the reason there is no trust between the two groups in 1948.
Agreed, the history by it self is not enough, except for the context it is put in. In the case of Israel Palestine this context matters. I like to put it in the context of colonialism and colonial warefare/resistance. Knowing the history is important, and 1948 is certainly a pivotal moment in history, but if you list events as if one thing leads to another (like my parent did) you, at best, end up with a simplistic view, a weird nonsensical justification and, at worst, wrong timeline and a confusing conclusion. Lying with history is after all a pretty known device to provide justification which otherwise wouldn’t be possible.
I like the context of colonization here because in 1948 Europe is in the middle of recolonizing regions lost during WW2 and the whole period after WW2 has been dominated by a) the cold war and b) decolonization. The cold war here seems irrelevant but decolonization seems important. Many post WW2 decolonization efforts were done via colonial warfare and resistance. Many of the colonial nations spent most of their military activity fighting resistance movements in their newly re-established colonies. Examples include the British army fighting the Mau Mau in Kenya, and the IRA in Northern Ireland, the French fighting FLN in Algeria, and the Americans fighting the Viet Cong in Vietnam. Even before WW2 the British were fighting Irish Republicans in Ireland.
In Palestine we have the British handing the colony of Palestine over to the UN in the hopes of a peaceful decolonization, which became moot when Israel declared independence in 1948. After 1967 it becomes obvious that Israel is the colonizer and Palestinians are the colonized. In 1967 Europe is desperately trying to keep colonialism alive so it makes sense they support another western colonizer in the Middle East. After 1967 we also have established resistance movements in Palestine such as the PLA (which later became superseded by Hamas and PIJ). By the 1990s, after the Cold war has ended, it becomes obvious that European colonialism is all but dead. However Israel remains as a beacon of hope, the last remaining colony of European settlers. But everyone knows it can’t stay like that. Decolonization is inevitable. And I think the hope is that the two-state solution is a way for the settlers to keep theirs while Palestine remains subjugated while not a proper colony (not learning the lesson from the partition of Ireland). This however failed, largely because Israel wished to keep colonizing Palestinians.
Of course this is simplistic. However this is the context in which I like to put this history.
I think you are reading too much into my response. The list of events isn't an attempt to justify anything. It is mainly to establish that what happens in 1948 isn't coming out of nowhere, but rather has been brewing and building up for a while in the region. If someone is only familiar with 1948, they're going to really struggle to understand the motivations of either side and will likely end up with a simplistic view of one side or the other as evil depending on what information they get.
As for the colonialist interpretation... that confuses more than it clarifies. There is enough substance to the conflict itself that we don't need to resort to grand sweeping historical narratives.
One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.
Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.
The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.
Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".
> I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.
In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.
You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.
But that doesn't seem to be what you are describing in terms of using incrementing indices and adding occurrence counts.
If you want to create a bag of words text embedding then you set the number of embedding dimensions to the vocabulary size and the value of each dimension to the global count of the corresponding word.
Heh -- my explanation isn't the clearest I realize, but yes, it is BoW.
Eg fix your vocab of 50k words (or whatever) and enumerate it.
Then to make an embedding for some piece of text
1. initialize an all zero vector of size 50k
2. for each word in the text, add one to the index of the corresponding word (per our enumeration). If the word isn't in the 50k words in your vocabulary, then discard it
3. (optionally), normalize the embedding to 1 (though you don't really need this and can leave it off for the toy example).
initialize an embedding (for a single text) as an all zero vector of size 50k
Are you talking about sentence/text chunk embeddings, or just embeddings in general?
If you need high quality text embeddings (e. g to use with a vector DB for text chunk retrieval), they they are going to come from the output of a language model, either a local one or using an embeddings API.
Other embeddings are normally going to be learnt in end-to-end fashion.
I disagree. In most subjects, recapitulating the historical development of a thing helps motivate modern developments. Eg
1. Start with bag of words. Represent words as all zero except one index that is not zero. Then a document is the sum (or average) of all the words in the document. We now have a method of embedding a variable length piece of text into a fixed size vector and we start to see how "similar" is approximately "close", though clearly there are some issues. We're somewhere at the start of nlp now.
2. One big issue is that there are a lot of common noisy words (like "good", "think", "said", etc.) that can make the embedding more similar than we feel they should be. So now we develop strategies for reducing the impact of those words one our vector. Remember how we just summed up the individual word vectors in 1? Now we'll scale each word vector by its frequency so that the more frequent the word in our corpus, the smaller we'll make the corresponding word vector. That brings us to tf-idf embeddings.
3. Another big issue is that our representation of words don't capture word similarity at all. The sentences "I hate when it rains" and "I dislike when it rains" should be more similar than "I hate when it rains" and "I like when it rains", but in our embeddings from (2) the similarity of the two pairs is going to be similar. So now we revisit our method of constructing word vectors and start to explore ways to "smear" words out. This is where things like word2vec and glove pop up as methods of creating distributed representation of words. Now we can represent documents by summing/averaging/tf-idfing our word vectors the same as we did in 2.
4. Now we notice there is an issue where words can have multiple meanings depending on their surrounding context. Think of things like irony, metaphor, humor, etc. Consider "She rolled her eyes and said, 'Don't you love it here?"'" and "She rolled the dough and said, 'Don't you love it here?'". Odds our, the similarity per (3) is going to be pretty similar, despite the fact that its clear these are wildly different meanings. The issue is that our model in (3) just uses a static operation for combining our words, and because of that we aren't capturing the fact that "Don't you love it here" shouldn't mean the same thing in the first and second sentences. So now we start to consider ways in which we can combine our word vectors differently and let the context affect the way in which we combine them.
5. And that brings us to now where we have a lot more compute than we did before and access to way bigger corpora so we can do some really interesting things, but its all still the basic steps of breaking down text into its constituent parts, representing those numerically, and then defining a method to combine various parts to produce a final representation for a document. The above steps greatly help by showing the motivation for each change and understanding why we do the things we do today.
This explanation is better, because it puts things into perspective, but you don't seem to realize that your 1 and 2 are almost trivial compared to 3 and 4.
At the heart of it are "methods of creating distributed representation of words", that's where the magic happens. So I'd focus on helping people understand those methods. Should probably also mention subword embedding methods like BPE, since that's what everyone uses today.
I noticed that many educators make this mistake: spend a lot of time on explaining very basic trivial things, then rush over difficult to grasp concepts or details.
> We now have a method of embedding a variable length piece of text into a fixed size vector
Question: Is it a rule that the embedding vector must be higher dimensional than the source text? Ideally 1 token -> a 1000+ length vector? The reason I ask is because it seems like it would lose value as a mechanism if I sent in a 1000 character long string and only got say a 4-length vector embedding for it. Because only 4 metrics/features can't possibly describe such a complex statement, I thought it was necessary that the dimensionality of the embedding be higher than the source?
OK, sounds counter-intuitive, but I'll take your word for it!
It seems odd since the basis of word similarity captured in this type of way is that word meanings are associated with local context, which doesn't seem related to these global occurrence counts.
Perhaps it works because two words with similar occurrence counts are more likely to often appear close to each other than two words where one has a high count, and another a small count? But this wouldn't seem to work for small counts, and anyways the counts are just being added to the base index rather than making similar-count words closer in the embedding space.
Do you have any explanation for why this captures any similarity in meaning?
> rather than making similar-count words closer in the embedding space.
Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.
The problem of multiple meanings isn't solved by this approach at all, at least not directly.
Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.
Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.
1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.
2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.
3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.
These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.
It’s not pure chance that the above calculus shakes out, but it doesn’t have to be that way. If you are embedding on a word by word level then it can happen, if it’s a little smaller or larger than word by word it’s not immediately clear what the calculation is doing.
But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”
I'm trying to understand this approach. Maybe I am expecting too much out of this basic approach, but how does this create a similarity between words with indices close to each other? Wouldn't it just be a popularity contest - the more common words have higher indices and vice versa? For instance, "king" and "prince" wouldn't necessarily have similar indices, but they are semantically very similar.
You are expecting too much out of this basic approach. The "simple" similarity search in word2vec (used in https://semantle.com/ if you haven't seen it) is based on _multiple_ embeddings like this one (it's a simple neural network not a simple embedding).
This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:
act: [0.1]
as: [0.4]
at: [0.3]
...
That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like:
The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.
So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.
So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".
King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.
It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.
Yeah, that is a poorly written description. I think they meant that each word gets a unique index location into an array, and the value at that word's index location is incremented whenever the word occurs.
Is that really an embedding? I normally think of an embedding as an approximate lower-dimensional matrix of coefficients that operate on a reduced set of composite variables that map the data from a nonlinear to linear space.
You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
A lot of people here are trying to describe to you that no, this is not at all the starting point of modern embeddings. This has none of the properties of embeddings.
What you're describing is an idea from the 90s that was a dead end. Bag of words representations.
It has no relationship to modern methods. It's based on totally different theory (bow instead of the distributional hypothesis).
There is no conceptual or practical path from what you describe to what modern embeddings are. It's horribly misleading.
There is no conceptual or practical path from what you describe to what modern embeddings are.
There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way.
In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way.
Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i.
One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI).
So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec.
Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context).
Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec.
Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR.
Eh, I disagree. When I began working in ML everything was about word2vec and glove and the state of the art for embedding documents was adding together all the word embeddings and it made no sense to me but it worked.
Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector.
BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW.
Well, you've fooled yourself into thinking you understand something when you don't. I say this as someone with a PhD in the topic, who has taught many students, and published dozens of papers in the space.
The operation of adding BoW vectors together has nothing to do with the operation of adding together word embeddings. Well, aside from both nominally being addition.
It's like saying you understand what's happening because you can add velocity vectors and then you go on to add the binary vectors that represent two binary programs and expect the result to give you a program with the average behavior of both. Obviously that doesn't happen, you get a nonsense binary.
They may both be arrays of numbers but mathematically there's no relationship between the two. Thinking that there's a relationship between them leads to countless nonsense conclusions: the idea that you can keep adding word embeddings to create document embeddings like you keep adding BoWs, the notion that average BoWs mean the same thing as average word embeddings, the notion that normalizing BoWs is the same as normalizing word embeddings and will lead to the same kind of search results, etc. The errors you get with BoWs are totally different from the errors you get with word or sentence or document embeddings. And how you fix those errors is totally different.
No. Nothing at all makes sense about word embeddings from the point of BoW.
Also, yes BoW is a total dead end. They have been completely supplanted. There's never any case where someone should use them.
> as someone with a PhD in the topic, who has taught many students, and published dozens of papers
:joy: How about a repo I can run that proves you know what you're talking about? :D Just put together examples and you won't have to throw around authority fallacies.
> if you add 2 vectors together nothing happens and its meaningless
In game dev you move a character in a 3D space by taking their current [x, y, z] vector and adding a different [x, y, z] to it. Even though left/right has nothing to do with up/down, because there are foundational concepts like increase/decrease in relation to a [0, 0, 0] origin, it still affects all the axes and gives you a valuable result.
Take that same basic idea and apply it to text-embedding-ada-002 with its 1536 dimensional embedding arrays, and you can similarly "navigate words" by doing similar math on different vectors. That's what is meant by the king - man = queen type concepts.
I think what the person before you meant about it not making sense is that it's strange that something like a dot product gives you similarity. To your point (I think) it only would assuming the vectors were scored in a meaningful way, of course if you did math on nonsense vectors you'd get nonsense results, but if they are modeled appropriately it should be a useful way to find nodes of certain qualities.
Yes! And the follow up that cosine similarity (for BoW) is a super simple similarity metric based on counting up the number of words the two vectors have in common.
How does this enable cosine similarity usage? I don't get the link between incrementing a word's index by it's count in a text and how this ends up with words that have similar meaning to have a high cosine similarity value
I think they are talking about bag-of-words. If you apply a dimensionality reduction technique like SVD or even random projection on bag-of-words, you can effectively create a basic embedding. Check out latent semantic indexing / latent semantic analysis.
You're right, that approach doesn't enable getting embeddings for an individual word. But it would work for comparing similarity of documents - not that well of course, but it's a toy example that might feel more intuitive
I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.
A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
> I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.
You can simplify this with a map, and only store non-zero values, but also you can be in-efficient: this is for learning. You can choose to store more valuable information than just word count. You can store any "feature" you want - various tags on a post, cohort topics for advertising, bucketed time stamps, etc.
For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.
> I also doubt your claim “and the results aren't totally terrible”.
> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc
(1) the comment suggested filtering out these words, and
(2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.
> A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
Wow that seems a lot more complicated for something that was supposed to be a learning exercise.
>> I also doubt your claim “and the results aren't totally terrible”.
>> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc
> (1) the comment suggested filtering out these words,
It mentions that as a possible improvement over the version whose results it claims aren’t terrible.
Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.
Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding.
Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.
Depends on what "AI" means here. There is a spectrum of "we have a bunch of data in a database and some folks hand tuning queries" to "we built a deep learning network to predict XYZ." In the middle of that spectrum lie things like decision trees which provide explainable results.
It's the former, but from what I read in the article, a shady version of that is my take.
The computer basically appeared to randomise a high number of people to kill based on a very shallow dataset, weak data linking and a high desire to kill people who are Palestinian.
It's hard to read the details from the Guardian article and think of it as anything other than a randomised Israeli state murder machine. I can't envisage it being accurate to a point any reasonable person would accept it be used against someone they know, in any circumstances.
That it targeted 10's of thousands is utterly horrifying. That it was involved in any actual battleground scenario makes me think those involved in it's creation and sale are culpable.
Not an answer to your question, more bolstering your comment, hoping someone else will come along and be encouraged to answer. Trying to learn about this on the internet is a nauseating series of blogspam.
If by "conventional neural network" you mean a stack off fully connected layers, then yes, in theory one of those could learn a similar mechanism because of the universal approximation theorem. However, training one might be intractable.
It's good to ignore self-attention for a moment and take a look at a convolutional network (a CNN). Why is a CNN more effective than just stacks of fully connected layers? Well, instead of just throwing data at the network and telling it to figure out what to do with it, we've built in some prior knowledge into the network. We tell it "you know, a cup is going to be a cup even if it is 10 pixels up or 10 pixels down; even if it is in the upper right of the image or the lower left." We also tell it, "you know, the pixels near a given pixel are going to be pretty correlated with that pixel, much more so than pixels far away." Convolutions help us express that kind of knowledge in the form of a neural network.
Self-attention plays a similar role. We are imbuing our network with an architecture that is aware of the data it is about to receive. We tell it "hey, elements in this sequence have a relation with one another, and that relative location might be important for the answer". Similar to convolutions, we also tell it that the location of various tokens in a sequence is going to vary: there shouldn't be much difference between "Today the dog went to the park" and "The dog went to the park today." Like convolutions, self-attention builds in certain assumptions we have about the data we are going to train the network on.
So yes, you are right that fully-connected layers can emulate similar behavior, but training them to do that isn't easy. With self-attention, we've started with more prior knowledge about the problem at hand, so it is easier to solve.
Great answer. Imbuing a deep learning model with well thought out inductive biases is one of the strongest ways of guiding your model to interpret the data the way you want it to. Otherwise it’s kind of shooting in the dark and hoping to get lucky.
I can’t stand it when people lazily personify ML models, but it’s akin to giving someone with no experience some wood and then pointing to a shed and saying “make one of those from this”. Instead you’d expect them to be much more successful if you also give them a saw, a drill, some screws etc.
Good explanation. Which is why the success of transformers, LLMs etc. is still not the final word in Rich Sutton's "The Bitter Lesson" -- no learning method is free of inductive biases.
Inductive biases can work even if they're wrong, because they allow for simple and quick action, simpler reasoning. They don't need to be correct for that to pay off, they just need a positive expected value.
You can verify the reduction of the problem space. Think of it this way, if data has some property, for example, it's mirrored on an axis. If X is a data point, then so is -X.
Well, a model that is aware of this symmetry only has half as much data to look at and one less thing to learn.
But that's only half of it. Truth is, assuming symmetries works pretty well even if the assumption is wrong. Why? Generalization. A model with less data will generalize more (better is perhaps debatable, but it will definitely generalize more)
This is the basic idea behind "geometric deep learning". There's loads of papers, but here's a presentation.
If anything, it seems like this was an effective means to introduce a lot of people to possible liabilities they have under GDPR/CCPA (or why they are not applicable to them).
Fine, but I had no desire to be introduced to the intricacies of the CCPA that afternoon. I was off minding my own business and didn’t ask for an “Are You Compliant For Dummies” course to be dropped in my lap.
This is partly because the first course in real analysis (and usually the first upper division linear algebra course) is the first proof oriented course most students face. The material isn't necessarily the hardest, but the proof stuff is new to many people.
Chongli's experience is much more similar to mine, we covered basic real analysis topics in calculus classes, and the class titled real analysis was a third-year pure math course. It was designed for students who had taken multiple proof oriented classes already and were pursuing a pure math major. It's just one of those things that varies between colleges I guess.
The linked graph shows Sweden at 7x norway's rate, 5x finland's rate and 3x denmarks rate. Am I reading that right? How are you concluding that Sweden's strategy was clearly superior?
> The basic problem with the Remarkable is that it's useless without the $8/month "cloud" plan because that's the only plan that enables its primary feature: handwriting recognition.
Are you saying the key feature for the remarkable is handwriting recognition?
I have a remarkable 2 and I love it, but the whole appeal is to replace the endless paper notebooks and loose sheets that I use. It is great for that. Since I've bought it they've added a bunch of new stuff and the subscription stuff as well. I'm not sure how I feel about it, but then again I have no need for those features. It sounds like most people want a general purpose e-ink tablet whereas the remarkable is basically e-ink paper.
Even as paper, it's subpar, because I can index paper and perform index lookups in sublinear time. On the reMarkable, not so much, and it has not yet stopped being wild to me that for all its evidently brilliant hardware integration, its software falls short on functionality that was an established standard centuries before computing machines were even imagined.
That's a fair criticism. I always rewrite & revise the important notes as I go forward, so I never end up looking back. I can imagine that it is super frustrating to find things that aren't near the end or on a known page number -- their page scrolling functionality is clunky to say the least.