Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Using word vectors to classify spam messages (github.com/doodyparizada)
104 points by doody_parizada on Dec 17, 2017 | hide | past | favorite | 21 comments



Sounds like a fun project. However, I doubt if word vectors buy you anything more than, say, old good Nilsimsa from 2001 (https://en.wikipedia.org/wiki/Nilsimsa_Hash). Side note: py-nilsimsa should iterate over Unicode points instead of UTF-8 bytes. As it stands now, the similarity of any texts in the same language using a non-Latin script is ~80 rather than ~0.


word2vec has the advantage that you could potentially identify spam messages that are paraphrases rather than exact copies of the ones in the training set.


1. Pedantically: it's GloVe, not word2vec. 2. Nilsimsa or any locality-sensitive hash detect changed messages, too, be the changes synonyms or not. 3. I don't think OP's GloVe contains words like v1agra.


We don't have words like v1agra. As I mentioned in the README, we took vectors pretrained on wikipedia. One of the possible improvements can be to train the vectors on our own dataset.


Suggestion for better title:

"Collaboratively Filtering Spam with Word Vectors while Respecting Privacy"


I was hoping to learn about word2vec by reading the sourcecode but am I right when I say this has nothing to do with word2vec?


Looks like it uses GloVe and not word2vec. They're both algorithms for generating word vectors but they are different.


Not by much


We started out with word2vec, but discovered it was easier to work with GloVe for our purpose. There are ways to convert glove vectors to word2vec format such as: https://radimrehurek.com/gensim/scripts/glove2word2vec.html


Slightly tangential, but does anyone know if word2vec can be used in a compound form to build up "concepts"? I'm interested to know if it could be used to identify parallelism in works of literature e.g. identifying plagiarism, parallels between the old and new testament, or intertextual works like Ulysses by Joyce and the Odyssey.



One of the things we found out working on this project, is the problem of converting a word vector to a paragraph vector. Apparently there are many ways to do so and each one yields different results based on the length of the text and content. We used a weighted average of the words based on their frequency in a corpus.


This may be off topic, but could this be used for classifying the trustworthiness or Amazon/App Store/etc reviews? Or does anyone perhaps know about an open source project that can be used to achieve this by someone who doesn't know anything about machine learning?



This sounds like an early version of DCC: https://www.rhyolite.com/dcc/

At first glance, I don't see anything that DCC didn't do, what did I miss?


It seems DCC isn't using word vectors at all? Using word vectors you can know that viagra and v14gr4 is the same word, because it is used in the same way in messages. That in turn means you don't need word lists, and can instead build from huge knowledge bases like GloVe.


That, and the fact that a message is sent in bulk isn't actually a very strong indicator that the message is spam, at least in the email world. As one input to a filtering system, it can be useful, but not as a rule applied on its own without consideration for other factors.


Why is Silicon Valley so interested in censoring certain kinds of speech?


The HN community is international and overwhelmingly not based in Silicon Valley. From his GitHub profile it looks like the author of this project isn't either. So what you said is considerably off the mark. Either way, though, please don't post flamebait here.

https://news.ycombinator.com/newsguidelines.html


'Twas a joke based on a series of other highly active threads on this site. I assumed it would be taken as such. My error!


As a recipient of email, choosing not to read a message is not an offense against free speech, especially when email has become unusable to the individual without doing so.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: