Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: TLDR This – Auto summarize any article or webpage in a click (hackeryogi.com)
235 points by radhakrsna on July 28, 2019 | hide | past | favorite | 114 comments



Hi everyone

I am delighted to share my new project "TLDR This" with you guys.

Problem : There's so much content out there but too little time to read. So many times, it happens that there's a long article that we're interested in, but we don't really feel like scouring through it to extract relevant information from it.

Hence, I created "TLDR This" to help you navigate relevant content quickly and easily, without having to read the whole thing.

Steps to get cracking - 1. Copy and paste either the URL or the text of the article you'd like summarized. 2. Press the "Process Text" button, and you are good to go.

TLDR This also comes with a chrome extension, allowing you to summarize any webpage at the click of a button.

How to use the chrome extension? Just click the “tl;dr” button in Chrome's toolbar on a webpage which you'd like summarized and within a few seconds, you'll get the < 5 sentence summary right there.

Please let me know if you have any feedback/suggestions.

Thanks a lot


Introduce this article: https://johnnyrodgers.is/burnout

Each section subtitle is displayed as a sentence. Could you maybe ML generate a 1 sentence summary for section? I don't think there is an ML which could extract the take aways from each section and make a summary correctly, but that was my original expectation when I saw it first time.


> I don't think there is an ML which could extract the take aways from each section

There's plenty of abstractive and extractive methods for text summarization. In fact, that's what I'm working on right now, is to develop a news article summarization system for longer articles.

If you're interested in learning more, there's an excellent survey paper on ArXiv: https://arxiv.org/abs/1812.02303


is there any working demo to play with? I am not into ML and won't read the whole paper... I expect these sort of mechanisms to make some weighting of the morphemes and sentences and replicate the top score ones whereas a human made summary is based on context and semantics given by humans (for example, the relevant part might be only relevant during the current zeitgeist or for some peoples bias)


I'm not aware of any working demos. In terms of the weighting, your intuition is correct, there's a way how a machine learning algorithm can learn which tokens (what you call morphemes) are of significance automatically. That's called an attention mechanism. It learns what tokens are important and then passes that into a "context vector" of sorts, which is then passed into the generator network (which is some form of recurrent network).

The disadvantage of using only token-based attention is that it's still not enough context for large spans of text (> 1000 tokens), and I'm actually having this issue now in my work. That's where things like applying attention to chunks of text is helpful.

I'm still trying to fully understand attention mechanisms, so I hope this comment made sense.


but facing such problem with an absolute lack of abstraction on the ideas behind obviously cannot work at any non straightforward example, isn't it? there must be some other steps or algorithms or additions to this process; the language is a protocol to encode a message, a representation about abstract concepts and ideas semantically referenced. It is not the same the HTTP server configuration values and statistical analysis of a xml text encoding those; and that still would be straightforward communication, not metaphoric speech or indirect references.

As long time language learner I can tell you that the most significative words in a text are the less common (that's why learning a language by lists of common words does not work out of the box). In the sort of process you mention the words with more meaning for the idea to communicate might have easily less weight than other more common tokens at the same texts, and it very easily depends on the communication style of the subject.

For me this way of making a summary, looking at the language as a sum of interconnected tokens, sounds like trying to replicate a paint by measuring brush strokes directions and pigmentation (vectors of values) rather than trying to understand the abstract concept or idea behind (a vase with flowers, fruits) and recreating it (use the skill of extracted with the vectors to recreate the abstract idea which was encoded behind).

I think I might end up reading the paper you mentioned.


To address how computers establish context, usually the tokens are sent through an embedding layer (basically turning a word into a dense vector). Word embeddings, surprisingly, reflect the real-world semantic relationships behind words (for instance, king - man + woman = queen). It also happens that the algorithms for creating word embeddings group words with similar meanings/semantics together (for example, king and queen would be closer together in n-dimensional space). Overall, embeddings allow a computer to look at a language as more than a bunch of interconnected tokens and actually incorporate the semantics of each word into its predictions. Of course, this isn't perfect, which is why we need attention mechanisms and various other methods to help a computer try to understand context long-term.


This is a very good article. Thanks for sharing!


Some info about it would be great? Do you use an AI model? Do you do semantic analysis? Is it open source?

I tried it on a wikipedia article and it didn't work too well.


> TLDR This also comes with a chrome extension

Firefox uses the exact same extension format, and except for non-trivial cases, “porting” an extension to Firefox involves literally no code-changes.

Consider publishing your extension for Firefox too. You just need a Mozilla-account to publish it on addons.mozilla.org. That’s pretty much it.


Hi,

That's a very good suggestion. I didn't know it was that easy to build an extension for Firefox. I will publish it on the Mozilla store as soon as possible.

Thank you very much for taking the time to use the app and providing your feedback.


Hi! We have added the link for Firefox extension. Please check it out and provide your feedback. Thank you!


Congrats! It worked well on a few sample articles that I tried. Can you provide background on how the text is summarized?


Thanks a lot. Sure, first we create a list of all the individual sentences in the article. Then, we use the TextRank algorithm to rank each of the text sentences. Finally, we select the top 5 most representative sentences from the article.


How do you create list of sentences from the article? How do you know where article starts and what is just navigation or fluff.


Firefox's reader's view is able to do that.

Check out their Github repo here - https://github.com/mozilla/readability

>You need at least one <p> tag around the text, you want to see in Reader View and at least 516 characters in 7 words inside the text.

Source - https://stackoverflow.com/questions/30661650/how-does-firefo...


Hi, We used a python library called newspaper to do it.


how does this compare to newspaper's built-in summary feature? What is different?


Unless you make this painfully transparent, a rational person would have to assume some intent of bias. No offense intended - I'm stoked you made this as its one of those tools I've always intended to make myself. But honestly, given the revelations of the last few years - unless we can verify as a community that there isn't bias, we have to assume there is.


Sorry,

i've some experience in this field:

- This is text-extraction, NOT text-generation - TextRank algorithm is so far fine, but it does not write a "summary", instead it ranks the components of a text according to some "metrics" (simply spoken)

- Using this approach will still make you attackable by copyright claims from copyright owners

- Which stuff is summarized ("put to the final output") is not always clear to me in your implementation, i tried it on some newspaper & blog articles; on some it worked well, on others it didn't.

Funny thing is, i'm currently working on something similar with a slightly different twist - i will post it here if finished, than we can go into a battle :-)


Hi,

- Yes, you are totally right. TextRank is an extractive method. - Ah right. But we aren't storing any information on our servers, just showing selected sentences to the user. - We select the top 5 sentences which have the highest relevancy to the article. I am not an expert in this field so not too sure if that's the best way. Just started with NLP a few days back and wanted to test it out by developing a small application.

Yes, it works on quite a few articles and but also there are some articles where it fails to give accurate results.

Ah nice, I would like to hear more about what you are working on. Let me know if I could contribute to it in some way.

Thank you again for your feedback.


> But we aren't storing any information on our servers, just showing selected sentences to the user.

If this is in reference to the copyright comment, it doesn't matter -- you're still transmitting/redistributing the content, which is what matters. One way to get around this is to ship the code and have the code execute on the user's machine (i.e. what you're presumably doing with the extension).


Ah right, thank you for the detailed explanation. Currently, we are processing the text using a Python backend. In order to process it on the user's side, I guess we'll have to use Javascript. I will try to fix that in the next version. Thank you very much.


You don't need to move anything to the client side, what you're doing is covered under fair use doctrine.


Maybe. Almost nothing is straightforward about fair use.


Agreed, but this is about as close as you can get to safe enough


Being that it's the internet you should think more outside whatever country law you are referring to. For example Spain blocked google news because of aggregating the news as is with little to no transformation.

Plus moving it to the client side would free up whatever resources they are currently using to feed summary info to us.


yup, switching to js would fix this issue.


That's such a ridiculous consequence of our field/times.

Edit: How about the backend just returns pointers to the text (word #x till word #y) and the js just (re)assembles it?


If I understand correctly, that would still require information to be transmitted to the server, ergo copyright infringement.


Which information? The server is then not reproducing/transmitting/redistributing the content, only indices into the content. I don't see why this would be copyright infringement.


Hacker news is not a good place to get legal advice. Best to ignore anyone offering it. Talk to a lawyer for legal advice.


Sure, it's not, so do not take it as advice, especially not ultimate advice. However, I find the inevitable intellectual shutdown when discussing matters like this even more repugnant and unwarranted.


This comment is so poignant it should be part of the site guidelines.


Right, thanks a lot.


    you're still transmitting/redistributing
    the content
Parts of it. Google does the same in their search results. The user can even decide which parts, because they show you the part that contains the search term.

So they provide a service that includes storing your content in it's entirety.

Has this ever been tested in court?


Google is doing something different in regular search results.

They are showing a small extract for context OR a summary specified by the publisher.

That’s completely legit and fair use.


Yes it has been tested in Spanish courts and google news is blocked there.


Are you a lawyer?

Because there is plenty of precedent for this in available APIs and I've never heard of a case claiming this.


Well, my copyright comment was targetted at a distinct case (like "redistributing the summary" on another website or in a book)

Though, just by copying & summarizing with your current implementation, there would be NO ONE to sue you, since you are just grabbing it and displaying it in the browser (sure, depending on the jurisdiction, one may rate this simple step already as some type of copyright issue)

In reality, this will not happen. (Except in North Korea ;-)

My comment regarding copyright was really about grabbing, summarizing and re-distributing it on another webpage, like a news aggregator.


> If this is in reference to the copyright comment, it doesn't matter -- you're still transmitting/redistributing the content, which is what matters.

I'm pretty sure this would be covered by fair use.


I think google was only retransmitting lyrics, and they are getting sued now. Can’t imagine google was actually storing the lyrics although I may be wrong [1]. If someone could clarify this I would really appreciate it as it has implications for a project I’m currently working on.

[1] https://www.theverge.com/platform/amp/2019/6/16/18681225/gen...


That's not a snippet, though, it's wholesale copying.

When you Google a newspaper article you get a verbatim snippet, same concept.


If this were true then Evernote's web clipper would too be infringing copyright... (it is transmitting and redistributing the content)


I'm interested in learning about true generation algorithms. Can you point me in the right direction?


Google “gpt-2”.


Thanks but that is not what the OP is claiming. That generates text from a seed, the OP is talking about an article that generates a summary of an article, but without using existing sentences.


Did you read the paper? https://d4mucfpksywv.cloudfront.net/better-language-models/l...

You don’t need any seed, and can generate summaries (section 3.6).

GPT-2 is the model to learn about if you’re interested in NLP.


Are there any papers benchmarking a transformer NN architecture in comparison to something like a pointer-generator network? I'm doing a bit of work in this area (i.e. reimplementing papers), and I'm curious if GPT2-like models can derive greater semantic meaning.


Both GPT-2 and pointer-generator network are open source, and pretrained models are available, so it should be straightforward to compare them.


I tried some of my articles and it's not really helpful. I even have a conclusion in my articles which is completely ignored.

For example: https://www.karoly.io/amazon-lightsail-review-2018/


Thank you for your feedback.

I am sorry to hear it didn't work on the article you tried.

I have personally tried it on a number of articles and it seems to provide good results. Also, I have received feedback from a number of users who say that it worked for them. But yes there are some articles where it fails to give accurate results since the "article summarisation" technology is still in its development stage.

We will try to improve it in the next version.

Thank you again for your feedback.


Can you provide any working examples?


Shameless self plug, but for anyone interested in the SOTA of extractive text summarization, I wrote a tool for this which allows for folks to use any type of word embeddings to run a modified queryable text rank on. This leads to SOTA text extraction.

https://github.com/Hellisotherpeople/CX_DB8

CX_DB8 also supports word and sentence level extraction. I plan on adding paragraph extraction in a future update.


I like the name, this sounds similar to (or in the same space as) Summly, which morphed into News Digest, a former Yahoo service which was sold to them by a British teenager for millions a few years back.

Good luck!

https://en.m.wikipedia.org/wiki/Nick_D%27Aloisio#Summly

https://medium.com/amp/p/686a32ac1af8


Hi,

Thank you very much for taking the time to use the app and providing your feedback.


The mankind needs this. Cutting off the BS factory, no more copywritting, flashy headlines with unrelated news boddies, or copy-rewritting, or attention wasters and... the yet to appear ML generated article time wasters which are mash ups from others.

If your idea had a company behind I would invest in it. I wish it was a product which would generate a decent enough executive summary, so if I decide to invest more time into reading a whole article or just move on and filter all the BS floating on the internet.

HOWEVER the current status is far away from the desired expectations; try to input the "burnout" entry from HN top page and see the output (or any other article), and I am afraid the concept will stay science fiction for very long time since the amount of intelligence required to make executive summaries is sometimes not even met by new fresh university graduates.


With the way content on the web is going down the shitter I cant wait to have a program that I run on a remote server, one that puppets around a headless browser, executes all the usual javascript cancer and reinterprets the final pages into sane, calm summarised document for local rendering.


romeo and juliet out of project gutenberg: all sentences were extracted from project gutenberg's own introduction

http://www.gutenberg.org/cache/epub/1777/pg1777.html > https://i.imgur.com/pRX5eeP.png

so I looked for something that was just the play http://shakespeare.mit.edu/romeo_juliet/full.html > https://i.imgur.com/f2bM8yW.png

and there are only a handful character entrances

we seems to be quite a way off from summarizing 'anything'.


How do you test whether it accurately removes irrelevant information? If you insert an irrelevant piece of text such as "yesterday it rained" into the text, will the summarizer always delete it?


When I use it on our landingpage it simply extracts one sentence from the page:

Use state-of-the-art pseudonymization and anonymization methods to secure your data in real-time.

I think it's a good summary of what we do, but I wouldn't call this summarization as I would think that implies synthesizing new content from the existing one, it's more a highlighting service I'd say. For other pages it doesnt' works so well, so maybe I just got lucky.


Hi,

It is mainly designed for articles so it wouldn't give any useful info when used on landing pages etc.

There are two kinds of article summarization. We use the extractive approach wherein top n relevant sentences from the article are selected. If the document is too short, it wouldn't select many sentences.

Thank you very much for taking the time to use the app and provide your feedback.


https://en.m.wikipedia.org/wiki/Emma_Goldman

Did it for this article to test it out, it only featured her personal life and none of her accomplishments. Love the idea


Good luck! I tried to launch something like this years ago [0] specifically to summarize HN, it's not easy but I think there's value in this type of service.

In my case, the summarization was manual - I wanted to see if people liked the service, before building the summarization engine.

I also created a chrome extension to be able to see summarize with just one click from HN itself.

[0]: https://github.com/simonebrunozzi/MNMN


This would be a great type of service for news publishers. Digital producers often have to type out summaries (bullets) in stories for those who don't bother reading full articles. If it could be automated by sending the story copy via an API and getting back summary bullets, it would certainly be valuable.


Yes, that sounds like a valuable application of article summarization but it would be hard to generate high-quality summaries that are very readable.


Wow, thanks for sharing. Yes, it's certainly a very valuable service but it is very hard to get it to work on every kind of article. I am just getting into NLP and would definitely like to explore other approaches to solving this problem.


Well i tried it with the following article on HN and it doesn’t work https://www.nbcnews.com/news/us-news/alaska-defunds-scholars...


I had a similar thing on my website for ages (probably 10 years or so) based on the open text summary tool. It works pretty good and is completely open source: https://www.splitbrain.org/services/ots


Ah right, thank you for letting us know.


Is there a premium version where I can have text summarized by actual living, breathing, human beings?


This would be a useful service. My first thought was “use something like mechanical turk”.

That lead into what I believe is the harder problem which is that in order to provide a useful text summary of articles of any topic is that you need to have people who understand the article content.

To make a summary service which is useful, we would need to have “summarizers” who can not only understand a myriad of topics but also possess a deeper understanding of the article domain so the summary is a faithful “distillation” of the original. As we all know, this is not easy, unless you have some expertise in that particular field.

This sort of reminds me of on demand translating services. In a past life I had the need to use translators for technical and legal documents. There were services that offered translators who specialize in certain areas, such as law or technical software writing, for example.


Hi, Currently, we don't provide that but that's a very good suggestion.


It is the only way I would use such a service. Text extraction is a neat trick but by no means useful for serious comprehension. Even if the service wasn’t in real time, having a place I could dump a bunch of articles into and get back a quick summary of what it’s about at some later time would be useful. Sometimes I read comments for hackernews articles to look for a tl;dr and never bother even opening the article.


I think it is a good summarizer, I also saw a few others cited in the comments. Has anyone seen one that is a browser add-in where if I am surfing around and see a long article but don't have time to read it I can click a button without copying and pasting the link ?


Hi,

We are glad that you liked it.

We also have a chrome extension which does exactly as you say - https://chrome.google.com/webstore/detail/tldr-this-free-aut...

Thank you very much for taking the time to use the app and provide your feedback.


Oh cool. I will try it for sure, thanks !


I tried this on my website where gallery images are interspersed with the text.

Unfortunately, this selected almost only the gallery credits lines (By (name) with (person one, person two))

How should I say to automated things: this html is not important for extraction?


Any open-source libraries that do this, or anything similar to this?


I wrote a tool for this called CX_DB8. It's better than any other "TextRank" based extractive summarizer in that it utilizes the latest in word embeddings, and supports queryability. Furthermore, it is the only word-level extractive summarizer to support all of this. Mostly aimed at the competitive debate community but anyone can use it.

https://github.com/Hellisotherpeople/CX_DB8

Hopefully you'll see my paper about this in EMNLP 2019 System Demos...


Ah sounds good. I might give it a try later on. Thanks for sharing.


I actually commented[1] similar to this. I may end up eventually writing one, though I have no idea what is involved. What are your use cases? Are you aware of anything even close? I could definitely use jumping off points for implementation, as I've got no idea how to write something like this.

[1]: https://news.ycombinator.com/item?id=20547875


This is very similar to https://smmry.com/

Enjoy


NLTK (python) will do it. I'm working on a similar project and the text summaries are great, even for non English articles.


Something like this is built in to macOS.


But is it FOSS?


Tried to use this service on the HN comments discussing it. It could not generate a summary of HN comments, now I need to read people's opinions the hard way.


Hi, It is mainly designed for articles so it wouldn't give any useful info when used on a page like HN. Thank you very much for taking the time to use the app and provide your feedback.


I tried in on the HN guidelines and then on the output. So the summary of the summary is:

>What to Submit On-Topic: Anything that good hackers would find interesting.

which isn't bad really.


We are glad that you liked it.

Yes, if the webpage doesn't have that much content then it would return the most relevant line.

Thank you very much for taking the time to use the app and provide your feedback.


How is it differnt than this? https://ilazytoread.herokuapp.com/


I love it!

One important feature, to help improve accuracy, is to display 2 big buttons at the bottom, a thumbs up and a thumbs down, meaning "this TL;DR is reflective of the actual content" or "this TL;DR is not reflective of the actual content".


We are glad that you liked it.

That's a very good suggestion. We will try to incorporate it in our next update.

Thank you very much for taking the time to use the app and provide your feedback.


I found this really useful for my articles. The short snippets would work well as tweets, a tweet bot for article summaries would be cool!


We are glad that you liked it.

That's a very good suggestion. By tweet bot, you mean when someone tweets an article link to the bot, it would respond back with the summary of that article, right?

Thank you very much for taking the time to use the app and provide your feedback.


Yeah, or I could comment on a tweet linking to an article with "@TLDR" and your bot would comment the summary.


Is there a way to loopbin user feedback on summarization that improves the algo? How could you incentivize users in the process?


I guess we would have to use Machine Learning to incorporate that.


On this note - anyone familiar with a similar implementation, preferably one like Outline.com that copies the whole content, but in a local.. none web-based format?

I'm writing a Rust library that for Archival purposes I want to immutably refer to source content. So in short, I need to download it and store it immutably. Yet, don't want to grab all the html, UI images, ads, etc - I just want the content. I've found Outline.com amazing, but the tool I'm writing is "distributed", so I don't want to depend on a service.

Anyone familiar with local tooling for these types of services? TLDR, Outline, etc?


There is a Python library called Newspaper that is designed to do that. I believe this is what outline.com uses.

There is also a JS library called readability which is what is used by Firefox's reader mode.

https://newspaper.readthedocs.io/en/latest/

https://github.com/mozilla/readability


Pray, Mr. Radhakrsna, if I put into the machine an article which contains errors, will a corrected summary come out?


I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

(Charles Babbage)


How is this different than https://smmry.com/ ?

The autotldr bot has been active for a number of years on reddit, summarizing articles quite well.


Hi,

The goal is the same.

TLDR This uses a different algorithm. It is a side project I developed to test out NLP.

Although there are some web apps out there, that allow you to summarize an article, there really isn't any useful chrome extension that does the job well. TLDR This comes with an accompanying chrome extension as well.

Thank you again for your feedback.


Does anyone have links to papers on text summarization that you would recommend for research?


Just tried on one of my articles. Good job! It works like a charm!

Any high level info on how it works?


OP replied in a thread above.

> Thanks a lot. Sure, first we create a list of all the individual sentences in the article. Then, we use the TextRank algorithm to rank each of the text sentences. Finally, we select the top 5 most representative sentences from the article.


Summarization used to be a built-in feature of Microsoft Word, but it was taken out.


I would like a bookmarklet for iOS Safari. Nice work!


We are glad that you liked it.

I will try to port it to Safari.

Thank you very much for taking the time to use the app and provide your feedback.


How does this work? Is it Free Software?


Hi, We use a version of the TextRank algorithm to rank each of the text sentences in the article and then select the top 5 most representative sentences from the article. Yes, it is Free and we plan to make it open-source soon as well. Thank you very much for taking the time to use the app and provide your feedback.


It would be great if you could offer a parameter in your URL so I could make a get request like https://tldr.hackeryogi.com/?url=somearticleurl.com I would embed that in https://uptopnews.com/


Hi, Sure, I will create an API endpoint that will allow you to do that as soon as possible. Thanks




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: