Show HN: Using word vectors to classify spam messages

mci · on Dec 17, 2017

Sounds like a fun project. However, I doubt if word vectors buy you anything more than, say, old good Nilsimsa from 2001 (https://en.wikipedia.org/wiki/Nilsimsa_Hash). Side note: py-nilsimsa should iterate over Unicode points instead of UTF-8 bytes. As it stands now, the similarity of any texts in the same language using a non-Latin script is ~80 rather than ~0.

laretluval · on Dec 17, 2017

word2vec has the advantage that you could potentially identify spam messages that are paraphrases rather than exact copies of the ones in the training set.

mci · on Dec 18, 2017

1. Pedantically: it's GloVe, not word2vec. 2. Nilsimsa or any locality-sensitive hash detect changed messages, too, be the changes synonyms or not. 3. I don't think OP's GloVe contains words like v1agra.

doody_parizada · on Dec 18, 2017

We don't have words like v1agra. As I mentioned in the README, we took vectors pretrained on wikipedia. One of the possible improvements can be to train the vectors on our own dataset.

amelius · on Dec 17, 2017

Suggestion for better title:

"Collaboratively Filtering Spam with Word Vectors while Respecting Privacy"

chrbarrol · on Dec 17, 2017

I was hoping to learn about word2vec by reading the sourcecode but am I right when I say this has nothing to do with word2vec?

drwl · on Dec 17, 2017

Looks like it uses GloVe and not word2vec. They're both algorithms for generating word vectors but they are different.

RHSman2 · on Dec 17, 2017

Not by much

doody_parizada · on Dec 18, 2017

We started out with word2vec, but discovered it was easier to work with GloVe for our purpose. There are ways to convert glove vectors to word2vec format such as: https://radimrehurek.com/gensim/scripts/glove2word2vec.html

programmarchy · on Dec 17, 2017

Slightly tangential, but does anyone know if word2vec can be used in a compound form to build up "concepts"? I'm interested to know if it could be used to identify parallelism in works of literature e.g. identifying plagiarism, parallels between the old and new testament, or intertextual works like Ulysses by Joyce and the Odyssey.

physicsyogi · on Dec 17, 2017

Maybe look into ConceptNet Numberbatch: https://github.com/commonsense/conceptnet-numberbatch/blob/m...

doody_parizada · on Dec 18, 2017

One of the things we found out working on this project, is the problem of converting a word vector to a paragraph vector. Apparently there are many ways to do so and each one yields different results based on the length of the text and content. We used a weighted average of the words based on their frequency in a corpus.

abc-xyz · on Dec 18, 2017

This may be off topic, but could this be used for classifying the trustworthiness or Amazon/App Store/etc reviews? Or does anyone perhaps know about an open source project that can be used to achieve this by someone who doesn't know anything about machine learning?

codegladiator · on Dec 18, 2017

> https://thereviewindex.com/blog/hello-world

Arnt · on Dec 17, 2017

This sounds like an early version of DCC: https://www.rhyolite.com/dcc/

At first glance, I don't see anything that DCC didn't do, what did I miss?

EmilStenstrom · on Dec 17, 2017

It seems DCC isn't using word vectors at all? Using word vectors you can know that viagra and v14gr4 is the same word, because it is used in the same way in messages. That in turn means you don't need word lists, and can instead build from huge knowledge bases like GloVe.

massaman_yams · on Dec 17, 2017

That, and the fact that a message is sent in bulk isn't actually a very strong indicator that the message is spam, at least in the email world. As one input to a filtering system, it can be useful, but not as a rule applied on its own without consideration for other factors.

chasing · on Dec 17, 2017

Why is Silicon Valley so interested in censoring certain kinds of speech?

dang · on Dec 17, 2017

The HN community is international and overwhelmingly not based in Silicon Valley. From his GitHub profile it looks like the author of this project isn't either. So what you said is considerably off the mark. Either way, though, please don't post flamebait here.

https://news.ycombinator.com/newsguidelines.html

chasing · on Dec 17, 2017

'Twas a joke based on a series of other highly active threads on this site. I assumed it would be taken as such. My error!

erik_seaberg · on Dec 17, 2017

As a recipient of email, choosing not to read a message is not an offense against free speech, especially when email has become unusable to the individual without doing so.