- This is text-extraction, NOT text-generation
- TextRank algorithm is so far fine, but it does not write a "summary", instead it ranks the components of a text according to some "metrics" (simply spoken)
- Using this approach will still make you attackable by copyright claims from copyright owners
- Which stuff is summarized ("put to the final output") is not always clear to me in your implementation, i tried it on some newspaper & blog articles; on some it worked well, on others it didn't.
Funny thing is, i'm currently working on something similar with a slightly different twist - i will post it here if finished, than we can go into a battle :-)
- Yes, you are totally right. TextRank is an extractive method.
- Ah right. But we aren't storing any information on our servers, just showing selected sentences to the user.
- We select the top 5 sentences which have the highest relevancy to the article. I am not an expert in this field so not too sure if that's the best way. Just started with NLP a few days back and wanted to test it out by developing a small application.
Yes, it works on quite a few articles and but also there are some articles where it fails to give accurate results.
Ah nice, I would like to hear more about what you are working on. Let me know if I could contribute to it in some way.
> But we aren't storing any information on our servers, just showing selected sentences to the user.
If this is in reference to the copyright comment, it doesn't matter -- you're still transmitting/redistributing the content, which is what matters. One way to get around this is to ship the code and have the code execute on the user's machine (i.e. what you're presumably doing with the extension).
Ah right, thank you for the detailed explanation.
Currently, we are processing the text using a Python backend. In order to process it on the user's side, I guess we'll have to use Javascript.
I will try to fix that in the next version.
Thank you very much.
Being that it's the internet you should think more outside whatever country law you are referring to. For example Spain blocked google news because of aggregating the news as is with little to no transformation.
Plus moving it to the client side would free up whatever resources they are currently using to feed summary info to us.
Which information? The server is then not reproducing/transmitting/redistributing the content, only indices into the content. I don't see why this would be copyright infringement.
Sure, it's not, so do not take it as advice, especially not ultimate advice. However, I find the inevitable intellectual shutdown when discussing matters like this even more repugnant and unwarranted.
you're still transmitting/redistributing
the content
Parts of it. Google does the same in their search results. The user can even decide which parts, because they show you the part that contains the search term.
So they provide a service that includes storing your content in it's entirety.
Well, my copyright comment was targetted at a distinct case (like "redistributing the summary" on another website or in a book)
Though, just by copying & summarizing with your current implementation, there would be NO ONE to sue you, since you are just grabbing it and displaying it in the browser (sure, depending on the jurisdiction, one may rate this simple step already as some type of copyright issue)
In reality, this will not happen. (Except in North Korea ;-)
My comment regarding copyright was really about grabbing, summarizing and re-distributing it on another webpage, like a news aggregator.
I think google was only retransmitting lyrics, and they are getting sued now. Can’t imagine google was actually storing the lyrics although I may be wrong [1]. If someone could clarify this I would really appreciate it as it has implications for a project I’m currently working on.
Thanks but that is not what the OP is claiming. That generates text from a seed, the OP is talking about an article that generates a summary of an article, but without using existing sentences.
Are there any papers benchmarking a transformer NN architecture in comparison to something like a pointer-generator network? I'm doing a bit of work in this area (i.e. reimplementing papers), and I'm curious if GPT2-like models can derive greater semantic meaning.
i've some experience in this field:
- This is text-extraction, NOT text-generation - TextRank algorithm is so far fine, but it does not write a "summary", instead it ranks the components of a text according to some "metrics" (simply spoken)
- Using this approach will still make you attackable by copyright claims from copyright owners
- Which stuff is summarized ("put to the final output") is not always clear to me in your implementation, i tried it on some newspaper & blog articles; on some it worked well, on others it didn't.
Funny thing is, i'm currently working on something similar with a slightly different twist - i will post it here if finished, than we can go into a battle :-)