Algorithmic tagging of Hacker News or any other site

Goosey · on May 21, 2014

Looking at the hn demo, I'm impressed. There are definitely relevant tags being generated. Unfortunately there also some noisy tags which clutter the results. Taking one example, the post "DevOps? Join us in the fight against the Big Telcos" given the tags "phone tools sendhub we're news experience customers comfortable", I would say that "we're" is unarguably noise. Another example, "Questions for Donald Knuth" with tags "computer programming don i've knuth taocp algorithms i'm" I would call out "i've" and "i'm".

There are other words in both examples that I personally would not use as tags, but I can't really say they would be universally not-useful. I think a vast improvement could be made just by having a dictionary blacklist filled with things like these - from this tiny sampling contractions seem to be a big loser.

doppenhe · on May 21, 2014

Agreed. Actually we could turn up the number of times it runs inside LDA algorithm and those would fix up but it affects performance. This was just a quick and dirty example (with an expectation of high traffic).

You can also seed LDA with a whitelist of words which we didn't do either - again all in the name of a quick and dirty solution to show.

Glad you liked it!

HNJohnC · on May 21, 2014

How about the plain old traditional 'stop words' list to solve this issue?

ppod · on May 21, 2014

Try using tf-idf instead of raw word frequencies.

doppenhe · on May 21, 2014

not using raw word frequencies but http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation.

Didnt know about tf-idf thanks for the tip.

hnriot · on May 21, 2014

it's hard to imagine someone knowing an LDA without also knowing about TF.IDF (it's a dot product, not a hyphen)

sprobertson · on May 21, 2014

That and the questionable use of stopwords makes it sound like they're just slapping some marketing on an out-of-the box LDA implementation (not that I blame them, it's a dense algorithm).

GFK_of_xmaspast · on May 21, 2014

The OP is pushing "Algorithmica" whose manifesto is here: http://blog.algorithmia.com/post/75680476188/algorithm-devel...

and this doesn't really strike me as much of a victory for the idea that it's just the implementation of an algorithm being the sticking point in practice.

doppenhe · on May 21, 2014

We are just showing the versatility of the platform through a real world use case. LDA is hard to implement/scale for the untrained same as many other machine learning, optimization, graph traversing,etc algorithms. What we are building is crowd-sourced and generalized API where all these algorithms can be combined and used together to really make any application smarter.

The demo we show here is a version of how we used our platform to generate tags for all entries in our API by combining algorithms that existed already in Algorithmia. (modified for performance over quality due to the volume that HN would bring).

Cheers.

s0x · on May 22, 2014

It's only hard because most libriaries I've seen have so little documentation available. It's simple once you understand the library. We need people picking these libraries up, implementing them on weekends on fun projects, documenting their work and code, and publishing it for everyone to learn from.

r00fus · on May 21, 2014

What about tag size (i.e., word length)? For the example of the knuth article, it'd be good if length was > 3.

arg01 · on May 22, 2014

I think you'd need to whitelist some useful acronyms if you implemented that rule.

USA

NSA

DOJ

TCP

POS

etc.

vhf · on May 21, 2014

Very interesting.

I have been doing some research towards automatic tagging lately, and I found several Python project coming close to this goal : https://pypi.python.org/pypi/topia.termextract/ , https://github.com/aneesha/RAKE , https://github.com/ednapiranha/auto-tagify

but none of them is satisfying, whereas Algorithmic Tagging of HN looks pretty good.

I have been trying to implement a similar feature for http://reSRC.io, to automagically tag articles for easy retrieval through the tag search engine.

doppenhe · on May 21, 2014

got your email will be responding later today. We enable automated tagging for any site directly from our API no need to implement anything else.

sytelus · on May 22, 2014

Are you sure? RAKE seem to perform much better than LDA that i s used in OP.

sytelus · on May 22, 2014

Well, it's not that easy. The algorithms are very primitive and too full of noise to be useful.

For example, try this on restaurant reviews like http://www.yelp.com/biz/el-gaucho-seattle. I get these tags:

steak reviews seattle food service gaucho restaurant review

Not useful, right?

The current state of the art would use much more sophisticated NLP for generating POS tags and use sentiment analysis. For example, check out MSR Splat at http://research.microsoft.com/en-us/projects/msrsplat/defaul....

Theodores · on May 21, 2014

This does well on the 'T Shirt test' on some sites, e.g. http://www.riverisland.com/men/t-shirts--vests

This could be really useful in ecommerce for creating search keywords for category pages. The noise in the results matters not, so long as it gets 'T-Shirt' and someone searches for 'T-shirt' then all is well and good.

Are you looking to plug what you have into something such as the Magento e-commerce platform? The right clients could pay proper money for this functionality. It is something I would quite like to speak to you about.

doppenhe · on May 21, 2014

definitely interested please email me at diego at algorithmia dot com

EGreg · on May 21, 2014

LDA is very impressive. But it might be better to have an iterative algorithm that forms a linear-algebraic basis from several tags (and let people add more tags as vectors into the mix) and then every time people upvote something, you update their interests (points in the linear algebraic space) and then every time an article gets upvoted you update ITS tags ...

after a while the system converges to a very useful structure and new members can see correctly tagged articles and the system learns their interests by itself

do you know anything like this already existing?

frik · on May 21, 2014

slashdot.org tag system ?

doppenhe · on May 21, 2014

should be easy enough to implement, our API will be in public beta very soon I can show you how to build it.

EGreg · on May 21, 2014

Please do. Can you contact me at http://qbix.com/about? Shoot me an emailthere plz

EGreg · on May 21, 2014

Is that how slashdot tags work??

dlsym · on May 21, 2014

This has real poetic potential:

    "Erlang and code style" 

    process erlang undefined
    file write data
    true code

zokier · on May 21, 2014

After watching "Enough Machine Learning to Make Hacker News Readable Again"[1] I thought of recommendation engine/machine learning based linkshare/discussion system (eg HN/reddit style). Your frontpage would be continuously formed by your up/down-votes. I'm not sure if the same could be applied to comment threads too, essentially creating automatic moderation. Algorithmic tagging would certainly be useful for that kind of site.

[1] https://news.ycombinator.com/item?id=7712297

NKCSS · on May 21, 2014

Not too impressed to be honest; singular/plural forms are not treated equal; not familiar with LDA, but I've written and LSA implementation in the past, and it did a lot better than what is shown here.

lugg · on May 22, 2014

I'm sure there is an amusing LSD joke in there somewhere.

NicoJuicy · on May 21, 2014

Lol, this seriously took me by suprise. I'm currently developing a HackerNews with tags (you can self host it). I quickly generated this Google Form, if you are interested for being a beta user in the nearby future

https://docs.google.com/forms/d/1UeSD11hrjwhsVbbPiv63VZBrEcz...

PS. Screenshot included + it's already in alpha in a company with 100 users.

zokier · on May 21, 2014

Do you know https://lobste.rs/ which is essentially HN with tags.

NicoJuicy · on May 21, 2014

Yeah, i know lobste.rs. But i go much further then HN or lobste.rs.. (not limiting with only url's or texts is just one feature). It's more a "Document Management System" with HN influence for larger Businesses (or public websites) with ( > 30 users) then a HN copy.

Call it lobste.rs 2.0

snippyhollow · on May 21, 2014

I did that in 2012 for a pet project with a friend https://github.com/SnippyHolloW/HN_stats

Here is the trained topic model (Nov. 30, 2012) with only 40 topics (for file-size mainly) https://dl.dropboxusercontent.com/u/14035465/hn40_lemmatized...

You can load it with Python:

  from gensim.models import ldamodel
  lda = ldamodel.LdaModel.load("hn40_lemmatized.ldamodel")
  lda.alpha = [lda.alpha for _ in range(40)]  # because there was a change since 2012
  lda.show_topics()

Now if you can figure out what is this file: https://dl.dropboxusercontent.com/u/14035465/pg40.params I'll pay you a beer next time you're in Paris or I'm in the Valley. ;-)

doppenhe · on May 21, 2014

I'll take a look, thanks!

sitkack · on May 21, 2014

[0.006791154078718692, 0.004654721825624361, 0.004632114646875695, 0.011976800134937546, 0.01799155954072435, 0.009181094647452455, 0.0345230793213232, 0.005232498042562552, 0.011402661654834138, 0.009024477282685779, 0.007034922780349653, 0.0031922239118904504, 0.007097905854058182, 0.004999249488505551, 0.016499595508879424, 0.024729974527642036, 0.004985711413178751, 0.03119529793092641, 0.015437847520669401, 0.2948424084650949, 0.06912364384956156, 0.004776347484051836, 0.0893067258386264, 0.018226129712679208, 0.0315656235097838, 0.006267920316323028, 0.01240414536928756, 0.005343403072840281, 0.006566103139036195, 0.009403510178615212, 0.009875448474490003, 0.0038449507757111973, 0.007531241580033292, 0.0014680865916836157, 0.00767397135040071, 0.002118254148078463, 0.02605710351719099, 0.034716581721697254, 0.002314474872742398, 0.12599103592023264]

snippyhollow · on May 21, 2014

Yes but what does it mean?

sitkack · on May 21, 2014

Having not read the code, I'd wager it is a 40 element vector of term similarities.

waterside81 · on May 21, 2014

Looks like cosine similarity output

snippyhollow · on May 22, 2014

It's PG's parameters for a naive Bayes model based on these LDA topics, learned by taking his comments on HN as upvotes for the articles content.

andrew_gardener · on May 21, 2014

After receiving 42 comments, I've ran their tagging algorithm on this page and got:

tags tagging hours link doppenhe reply ago lda

looks pretty promising!

gibrown · on May 21, 2014

LDA/Topic Modeling is interesting stuff. I always feel like the way this data gets surfaced as "tags" is very ineffective. Any non-tech person would look at this and generally be confused. So this item is triggering my rants against tagging: - Tagging is like trying to predict the future. What word will help some future person to get to this content? - Tagging often tries to fill the hole left by bad search - There is no evaluation method to measure how good a set of tags are - Tags make very bad UI clutter.

Some of these points are related to encouraging users to tag content, but auto-tagging also seems problematic.

To me something more along the lines of entity extraction is more useful because it is a well defined problem, and can be used to improve a lot of other applications.

sitkack · on May 21, 2014

It seems like you would want to run k-means over the comments and the tags to pull out semantically meaningful words for tags and then reduce the total number of tags over the corpus. Then say, use wikipedia to generate an automatic taxonomy where those extract words occur.

doppenhe · on May 21, 2014

we do have k-means in our API as well this wouldn't be hard to do.

sytelus · on May 22, 2014

Tagging is useful to summarize the content. It's like saying describe this article in 3 nouns. Lot of HN articles are cryptic and if you can pull out good tags it can be helpful to prioritizing its reading. It's even more helpful when there are 100s of comments and you want to know key topics of discussion. The problem is often that generated tags are often very poor quality.

To understand the utility of tagging, look at some article, read it and then put 3 words that best describes the topics. I bet most others would find human generated tags very useful. Machine generated tags are usually no where close to what humans would generate.

lugg · on May 22, 2014

I would like to see this kind of tagging being used to improve search results and simply hiding them. Could increase rank when there are more tag hits. Although I guess that is essentially what good search is.

Tags while cluttering the ui do help you find similar content. Still not as good as a good reccomendation system but a decent stop gap measure in some instances.

NicoJuicy · on May 21, 2014

I like this project (i am creating something like this, so i'm pretty serious).

But doesn't the auto-tagging feature make to much noise for a business use-case? For example, it tags a article of Amazon and includes Google in the tags. White-listing words wouldn't fix this (Google is a whitelisted word if Amazon is).

I don't know about LDA though. Perhaps a proper tag administration would fix this, but then you'd have to remove tags on the go.

doppenhe · on May 21, 2014

would love to chat more diego at algorithmia dot com.

platypii · on May 21, 2014

Direct link to HN with tags: http://algorithmia.com/demo/hn

EGreg · on May 21, 2014

I always wondered -- how are some sites able to get an up-to-date mirror of HN, when HN blocks usage of its API after a while?

Are they using some alternative API that was blessed by HN?

doppenhe · on May 21, 2014

we just use their RSS

EGreg · on May 21, 2014

And they don't block that after a while?

How does iHackerNews show all the comments and everything?

Where are these RSS feeds? I doubt that's how "the pros" so it.

maxerickson · on May 21, 2014

There is also an api:

https://hn.algolia.com/api

I would think lots of apps are still scraping pages.

doppenhe · on May 21, 2014

https://news.ycombinator.com/rss

GFK_of_xmaspast · on May 21, 2014

What good is a blocked RSS to anybody?

nopal · on May 22, 2014

Has anyone seen Open Calais [1]? It does tagging and categorization. It's been around for years and seems pretty powerful. It's a bit lower-level than Algorithmia (not href aware), but it seems more powerful, and a system like Algorithmia could be built on it.

[1] http://www.opencalais.com/about

doczoidberg · on May 21, 2014

With german sites it does not work so well. There is no blacklist for to generic terms for other languages than english?

shawabawa3 · on May 21, 2014

Doesn't seem to handle pdfs properly. For the mtgox link it comes up with

> stream rotate type/page font structparents endobj obj endstream

doppenhe · on May 21, 2014

the demo version if we cant scrape the text from the HTML we cant really run the topic analysis against it. pdfs, images, etc wont work.

r00fus · on May 21, 2014

PDF is easy to support if you use pdftk. You can even simply get the first page for large docs.

http://www.pdflabs.com/tools/pdftk-server/

doppenhe · on May 21, 2014

thanks for the tip.

pjbrunet · on May 21, 2014

I signed up. Not sure if i would use it, but the Algorithmia concept is pretty interesting.

draz · on May 21, 2014

@doppenhe - any hunch as to how well it would work on transcripts?

doppenhe · on May 21, 2014

you can try it yourself at the bottom of the blog post or you can send me a url and I can try a bunch for you diego at algorithmia dot com or @doppenhe.

vincentbarr · on May 21, 2014

error: failed to find worker for algorithm

doppenhe · on May 21, 2014

HN took us down for a second. Back up and running. Thanks for reporting.

hnriot · on May 21, 2014

Maybe also take a look at AlchemyAPI

justplay · on May 21, 2014

looks cool.