Hacker News new | past | comments | ask | show | jobs | submit login
High-reproducibility and high-accuracy method for automated topic classification (amaral-lab.org)
142 points by jestinjoy1 on Jan 31, 2015 | hide | past | favorite | 36 comments



I had some sort of violent dopamine release just reading the headline.

I'm working on a project to make (EU-) law more accessible. So if anybody here knows good methods to visualise/summarise long legal texts (30-300 pages) you could do something for humanity by posting a reply.

(Word clouds just don't cut it in these cases.)


A classic summarization method is:

1. Split your text into sentences

2. Remove stopwords (keeping a copy of the original sentence)

3. For the remaining words, calculate the synthetic TF-IDF score of all the words in the sentence (tf-idf each word then sum them).

4. Keep the n-highest scoring sentences (in the order they appear), or all the sentences with synthetic TF-IDF scores above some threshold.

There's your summary.


I built this once! If anyone wants to play around with it the website is http://bookshrink.com and the code is on github (https://github.com/peterldowns/bookshrink)


How did it work? Any interesting tweaks you can share?

BTW, I love the highlight-in-the-original-text option.


    > How did it work? Any interesting tweaks you can share?
For being such a naïve approach I think it works fantastically. It works better on some types of texts (newspaper articles, research papers) than others (poems, fiction). It was my first real programming project so the code has a certain sentimental meaning for me :)

The analysis code (https://github.com/peterldowns/bookshrink/blob/master/analys...) is very short and the comments include my thoughts on certain tweaks / approaches. Some quirks of the implementation: uses regex for sentence splitting (really!), doesn't perform any stemming, and weights proper nouns heavier than other words.

    > BTW, I love the highlight-in-the-original-text option.
Thanks! The eventual goal was to build a tool that would make essay grading easier for teachers, although I never got around to it.


I find it works better with additional step 2.5) Lemmatize remaining words, using for example Python's NLTK library.


Oh yeah, good idea. That should generally improve the results.

There's a bit of art to it for sure that can improve the results. You may also have to do some pronoun substitution in the summarized sentences (and then decide to do that before or after calculating the synthetic score) so they make more sense.

I find it's hard to tell how well all this will work until you just do it. Not every kind of text works equally well and the only proof that it's really working is "does the summary make sense or not?"

It's also possible that for long structured documents like laws or contracts, that you don't want to summarize the whole thing, but treat major sections like different documents and do intra-document summarization to maintain understandability.

Here's one that does something kinda like what I was writing about above.

http://textcompactor.com/

note I think the correct measure is not TF-IDF, but TF-ISF (Inverse sentence frequency) for single document summarization, but I might be wrong.


When you lemmatize, what are you doing exactly? Are you merely reducing words to a more common form, thereby reducing IDF? For example, are you reducing "walking" or "walked" to "walk", and then using the IDF of walk?


Yes, exactly.

related is the idea of "stemming" which uses an algorithm to try to reduce inflection, to find a common form of a word that various versions come from. Porters algorithm is a well known stemming algorithm. However, sometimes you end up with weird "non-inflected" tokens at the end. (e.g. 'enhancement' might become 'enhanc')

However, lemmatization is considered "better" in that it uses a dictionary of inflected forms that map back to the non-inflected form. So in theory, if the dictionary is comprehensive, you can properly replace inflected forms with their correct non-inflected forms. (e.g. 'enhancement' -> 'enhance')

If your dictionary isn't comprehensive and comes across a token it doesn't recognize, you can try falling back onto a stemming algorithm.


This might be helpful:

http://www.cs.ubc.ca/labs/imager/tr/2014/Overview/

It includes a number of the data mining techniques others have mentioned and implements some nice design guidelines from visual analytics.


Is there a link to your project? I've recently been thinking of trying something similar for UK law.


It's not online yet. But check out popvox.com, govtrack.us and http://openbylaws.org.za/ for some inspiration.


Awesome thanks! Look forward to seeing your project when it's online.


What is the user supposed to learn from the visualization? Is it high level understanding what the text is about? Is it some support to make the text easier to read and understand (like syntax highlighting is for programmers)? Is it something in between, something else?


Check out http://casetext.com. Their "ReCites" extraction is a pretty useful way of summarizing a legal case text.


Excellent! It seem to heavily rely on the community & moderation – something we're exploring with as well. Kinda like (rap)genius.


Haven't looked at EU law in a year or two, but europa and eulex used to provide useful and simple summaries of both legislation and court of justice caselaw.


Contact Kasian Franks kasian.franks@gmail.com he has a good summarization system I'm using.


I find it interesting that this appears to be written by a group of physicists rather than NLP or ML researchers, and I think you can kind of see that in the way they approach the problem. I think a bunch of the work done after LDA among ML and NLP people tended towards (a) using Hierarchical Dirichlet Process models as a platform from which to explore Bayesian nonparametrics more generally (b) better inference algorithms for topic models and (c) somewhat richer models (i.e. author topic models, syntax aware topic models, etc).

And it's not like the people in this field haven't been aware of network-oriented methods. But rather than using community-detection as a mechanism for topic discovery, instead people either focused on networks among topics to see how topics are related, networks among authors such that social network information informed topic discovery, or networks among documents where link/reference information was explicitly part of the model.

These authors seem to get solid results in part by having totally different values/aesthetics. Unlike the Bayesian nonparametrics people, they clearly don't care about picking arbitrary, inflexible parameters (e.g. the 5% threshold), nor do they want their model to have a clear, generative form, nor are they particularly concerned about having a new algorithmic insight (since they throw their hard work to InfoMap, and discuss none of its details), nor do they attempt to advance the expressiveness of their topic model (they proceed with the most basic bag-of-words model available). But it does seem like they get good results on the basic task with a very pragmatic, pipeline approach.


Two of the authors (Kording and Acuna) are definitely not physicists; much of their previous work you might describe as psychology with a strong math modeling background. Interesting that the pub is in a physics journal though.


It was interesting to see a take on the problem from the researchers outside of NLP or ML fields, but the authors only considered classic LDA and PLSA for comparison. I am not currently involved in topic modeling, but I know there exist techniques and modifications to classic models that improve topic discovery (like tf-idf weighting). Can you suggest any modern methods from NLP and ML communities that address the same issues and can rival the authors' findings?


Modeling words co-occurrence graph and then pruning "weak" edges (or achieving similar pruning by using community detection to find clusters) works kind of like a "feature selection" based on something that resembles a bare mutual information or tf*idf.

I'm not entirely familiar with LDA, but from what I was able to understand from their intro, it feels like their LDA application could have used some feature selection.


You can see the source code of a previous iteration of the algorithm here: https://bitbucket.org/andrealanci/topicmapping/src


I'm confused by the discussion of multi-lingual corpora. Is it common in topic modeling to consider documents drawn from disjoint vocabularies, or is it just a kind of thought experiment?


Pretty common when you don't control the data source or for multi language goverment agencies (for example in Canada you may have your court case in French if you desire).


I haven't dug into the details of the paper yet, but I want to commend the authors for 1.) making it possible to actually download the PDF and 2.) giving some indication, within the actual document, when the paper was published. I'm being a little bit snarky, but I'm very sincere in thanking them.


>making it possible to actually download the PDF

The journal they published in, Physical Review X, is a newer open-access journal from APS (along the same lines as PLOS ONE or Nature Scientific Reports). I think it's great but not everyone agrees. To read more on the debate around the open-access phenomenon look at http://blogs.berkeley.edu/2013/10/04/open-access-is-not-the-... and http://www.sciencemag.org/content/342/6154/60.full


It's usually not that the authors don't want to provide PDFs but with scientific journals you're often asked to sign a form that transfers the copyrights to the journal. In other words, you're not allowed to make the paper available even though you're the author.

In my experience, finding out when a paper was published is often not too difficult. In most cases, Google makes it easy to find the Journal issue and/or the conference proceedings of the paper. Or you find some third paper that contains a reference with date information that you can then use to double check.



is there an open source implementation I can use?

What about that sentiment analysis NLP tool that someone posted on HN last year? That was also very good.


[deleted]


Whether this is "Strong AI" or not is a discussion that might make a good paper in a philosophy journal. Computer science alone probably can't tell you if this method can truly "understand" the text.

Science works by separating out the disciplines. Frankly I think "defense against the terminator scenario" could and should be a scientific field on its own at this point on the level of solutions to global warming.

This is interpreted as a side effect for now. That is, until you tell the computer to do something based on the topics.


Strong AI usually refers to artificial general intelligence. It should be capable of pretty much solving new kinds of problems the way humans can. It's safe to conclude this is not it - no philosophy journal paper is needed.


The question of whether LDA as a method "understands" anything could though.


I think it'd be a short paper. It's hard to imagine a probabilistic model that just measures correlations among tokens without even any real locality sensitivity capturing something that could be called understanding without stretching the meaning of the word to the breaking point.


The question of understanding is a philosophical one that doesn't affect outcomes.

If the output of AGI demonstrates intelligence in finding a solution, whether it had "understanding" or not doesn't matter. The only thing that matters is its power to turn inputs into outputs.

The Turing test crystallizes this WRT human language interaction. Assessment of the intelligence of a machine doesn't depend on understanding, consciousness or any of the other baggage dragged in.


> To correctly and properly classify texts you need to understand them.

Who needs to understand them? The algorithm? What does that even mean? And if you mean the authors need to understand how the algorithm works: Wrong again. They probably do, but even if they wouldn't their algorithm might still classify correctly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: