Hacker News new | past | comments | ask | show | jobs | submit login

Sorry,

i've some experience in this field:

- This is text-extraction, NOT text-generation - TextRank algorithm is so far fine, but it does not write a "summary", instead it ranks the components of a text according to some "metrics" (simply spoken)

- Using this approach will still make you attackable by copyright claims from copyright owners

- Which stuff is summarized ("put to the final output") is not always clear to me in your implementation, i tried it on some newspaper & blog articles; on some it worked well, on others it didn't.

Funny thing is, i'm currently working on something similar with a slightly different twist - i will post it here if finished, than we can go into a battle :-)




Hi,

- Yes, you are totally right. TextRank is an extractive method. - Ah right. But we aren't storing any information on our servers, just showing selected sentences to the user. - We select the top 5 sentences which have the highest relevancy to the article. I am not an expert in this field so not too sure if that's the best way. Just started with NLP a few days back and wanted to test it out by developing a small application.

Yes, it works on quite a few articles and but also there are some articles where it fails to give accurate results.

Ah nice, I would like to hear more about what you are working on. Let me know if I could contribute to it in some way.

Thank you again for your feedback.


> But we aren't storing any information on our servers, just showing selected sentences to the user.

If this is in reference to the copyright comment, it doesn't matter -- you're still transmitting/redistributing the content, which is what matters. One way to get around this is to ship the code and have the code execute on the user's machine (i.e. what you're presumably doing with the extension).


Ah right, thank you for the detailed explanation. Currently, we are processing the text using a Python backend. In order to process it on the user's side, I guess we'll have to use Javascript. I will try to fix that in the next version. Thank you very much.


You don't need to move anything to the client side, what you're doing is covered under fair use doctrine.


Maybe. Almost nothing is straightforward about fair use.


Agreed, but this is about as close as you can get to safe enough


Being that it's the internet you should think more outside whatever country law you are referring to. For example Spain blocked google news because of aggregating the news as is with little to no transformation.

Plus moving it to the client side would free up whatever resources they are currently using to feed summary info to us.


yup, switching to js would fix this issue.


That's such a ridiculous consequence of our field/times.

Edit: How about the backend just returns pointers to the text (word #x till word #y) and the js just (re)assembles it?


If I understand correctly, that would still require information to be transmitted to the server, ergo copyright infringement.


Which information? The server is then not reproducing/transmitting/redistributing the content, only indices into the content. I don't see why this would be copyright infringement.


Hacker news is not a good place to get legal advice. Best to ignore anyone offering it. Talk to a lawyer for legal advice.


Sure, it's not, so do not take it as advice, especially not ultimate advice. However, I find the inevitable intellectual shutdown when discussing matters like this even more repugnant and unwarranted.


This comment is so poignant it should be part of the site guidelines.


Right, thanks a lot.


    you're still transmitting/redistributing
    the content
Parts of it. Google does the same in their search results. The user can even decide which parts, because they show you the part that contains the search term.

So they provide a service that includes storing your content in it's entirety.

Has this ever been tested in court?


Google is doing something different in regular search results.

They are showing a small extract for context OR a summary specified by the publisher.

That’s completely legit and fair use.


Yes it has been tested in Spanish courts and google news is blocked there.


Are you a lawyer?

Because there is plenty of precedent for this in available APIs and I've never heard of a case claiming this.


Well, my copyright comment was targetted at a distinct case (like "redistributing the summary" on another website or in a book)

Though, just by copying & summarizing with your current implementation, there would be NO ONE to sue you, since you are just grabbing it and displaying it in the browser (sure, depending on the jurisdiction, one may rate this simple step already as some type of copyright issue)

In reality, this will not happen. (Except in North Korea ;-)

My comment regarding copyright was really about grabbing, summarizing and re-distributing it on another webpage, like a news aggregator.


> If this is in reference to the copyright comment, it doesn't matter -- you're still transmitting/redistributing the content, which is what matters.

I'm pretty sure this would be covered by fair use.


I think google was only retransmitting lyrics, and they are getting sued now. Can’t imagine google was actually storing the lyrics although I may be wrong [1]. If someone could clarify this I would really appreciate it as it has implications for a project I’m currently working on.

[1] https://www.theverge.com/platform/amp/2019/6/16/18681225/gen...


That's not a snippet, though, it's wholesale copying.

When you Google a newspaper article you get a verbatim snippet, same concept.


If this were true then Evernote's web clipper would too be infringing copyright... (it is transmitting and redistributing the content)


I'm interested in learning about true generation algorithms. Can you point me in the right direction?


Google “gpt-2”.


Thanks but that is not what the OP is claiming. That generates text from a seed, the OP is talking about an article that generates a summary of an article, but without using existing sentences.


Did you read the paper? https://d4mucfpksywv.cloudfront.net/better-language-models/l...

You don’t need any seed, and can generate summaries (section 3.6).

GPT-2 is the model to learn about if you’re interested in NLP.


Are there any papers benchmarking a transformer NN architecture in comparison to something like a pointer-generator network? I'm doing a bit of work in this area (i.e. reimplementing papers), and I'm curious if GPT2-like models can derive greater semantic meaning.


Both GPT-2 and pointer-generator network are open source, and pretrained models are available, so it should be straightforward to compare them.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: