Visualizing Clusters of Clickbait Headlines Using Spark, Word2vec, and Plotly

131012 · on Aug 16, 2016

"Once the dictionary is created, we can average all the word vectors for a given headline to get the numeric representation of the headline itself."

Can someone explain me why this is useful? Aren't we losing a lot of precision from word2vec results?

And as a general question: is there any useful knowledge we can extract from this vizualisation apart: most of the time, news channels write about different things?

Don't get me wrong, I think the techniques displayed here are really cool, but I have the feeling the conclusions are either absent or trite.

guidopallemans · on Aug 18, 2016

> Aren't we losing a lot of precision from word2vec results?

Not really, because the vectors are really high-dimensional, and the words that occur in the same headline together usually aren't close together.

So it's not really losing precision, it's more like combining the meanings of the words (numerically).

odbol_ · on Aug 16, 2016

All that in-depth explanation, and he forgot the first rule of data visualization: always label your axis!

This should have been at the top, but was buried at the bottom of the article: "The left side of the 2D representation represents the more serious headlines, while the right side represents the more silly headlines. "

minimaxir · on Aug 16, 2016

The axes are intentionally unlabeled. They don't represent anything, just spatial coordinates.

The fact that X-axis had a parseable interpretation (and the sliding scale of serious is not a hard rule as there are many counterexamples; I noted the scale as an quick observation) was due to the randomness of the clustering algorithm. It is possible that the chart could end up rotated if the algorithm is run again.

soared · on Aug 16, 2016

Its interesting to hover over the blue buzzfeed dots on the left side. Those are those interesting cases where buzzfeed does actual reporting.

Aelinsaar · on Aug 16, 2016

>Coincidentally around the same time Facebook announced their anticlickbait initiative, Facebook open-sourced their fasttext project, which can quickly build models to classify text using some of the above example techniques. Hmmmmmm…

There's an interesting notion, a clickbait filter.