Hacker News new | past | comments | ask | show | jobs | submit login

OP here. I had a lot of difficulty finding info on Books1 and Books2, even on HN. If there's a better source of info, please link or post.

What's the value of these scant few thousand unpublished romance and fantasy novels in the context of the rest of the corpus -- vast scrapings, all of Wikipedia, etc.? A sample of how people write? Why aren't more public domain works included?




> What's the value of these

The Pile (the 800GB dataset by Eluther AI) contains BookCorpus2, along with two much larger datasets of books (and a whole lot of not-book stuff). From their paper [0] the reasoning for the book datasets is that they are "are invaluable for long-range context modeling research and coherent storytelling". The reasoning for including BookCorpus2 next to Books3 and Project Gutenberg boils down to "no significant overlap with the other datasets, and others use it".

In general books are a great source of extremely high quality long-form content. They are longer than most content found on the web, and are generally of high quality, having gone through many revision rounds between editor and author. Just that both of these aren't really true of BookCorpus. Even a dump of highly rated stories from fanfiction.org might be better.

0: https://arxiv.org/pdf/2101.00027.pdf


Are these primarily fiction books, or a mix of fiction and non-fiction?


The books in the books3 collection aren't categorized. The source, however, currently is at a ratio of 2:1 nonfiction to fiction, and from what I've seen, whoever created the books3 archive simply attempted to gather all the EPubs they could, with their only criteria being availability.


[flagged]


the bots are not capable of reflection, they do not know and cannot check what is in their training data


Why is that list ^ not interesting? Illustrative?


It's not useful. You can ask ChatGPT again in a new session, and you'll get a different list. You can then ask it about them and find out it's making it up. For example, "Wuthering Heights" is on your list, when I ask it "Pride and Prejudice" is on it. In another session, I can ask it for opening lines, character lists, etc. from those works and then ask for a crossover story where the characters from each meet each other. The model isn't likely to regurgitate the original text in its entirety, but it does know them.


Additionally, I believe a general norm has been developing to not post raw output of ChatGPT without commentary/editing, because it's low value. If anyone wanted similar output, they could just ask ChatGPT themselves. It also often gives false info, as this one demonstrates. You need to fact check it and demonstrate that before posting it if you're posting it as a source of information, IMO.


[flagged]


With the way Copyright now lasts roughly forever and covers roughly everything then it can't expect to be respected. Copyright owners have used their power to tilt the bargain entirely in their favour. They have only themselves to blame if people increasing don't want to obey their rules.


Yeah you're right, it's totally justified to steal the work of still living artists without their permission to build massive for profit systems to automate their jobs away because some corporations have abused copyright.


Not sure while they are complaining. Even if have a AI read their work is "stealing" they'll still be making money from their copyright for the next 100 years at least.

The "artists" seem to be as happy to abuse copyright as the corporations.


> A lot of people, especially on hacker news, believe something being on the internet is a license to do whatever the fuck they want with that data.

A license to do whatever the fuck they want? No. To do things other than make copies? Sure. Copyright is the right to copy, no more than that. Learning from something is not copying it. If you want to complain about memorisation, that’s fair. But learning from something is not something copyright was intended to restrict, so no, plenty of people absolutely will not care about copyright when it comes to this kind of thing, and rightly so.


I think you're reading too much into AI specific stuff, I would happily violate copyright for no reason at all.


ChatGPT told me it doesn't have the text of the US Declaration of Independence due to copyright. It does not have the English Magna Carta within its accessible text. That seems unexpected. It does have the US Constitution.


Large language models don’t know what they know. Asking them what they know is often going to give you an incorrect answer.

  $ llm -m 4 "Quote the declaration of independence:"
IN CONGRESS, JULY 4, 1776

The unanimous Declaration of the thirteen united States of America

When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed...

(Note: The full text of the Declaration of Independence is quite lengthy, so I have included the most well-known portion here. The full text, including a list of grievances against King George III and the signatures of the signers, can be found elsewhere.)


It made that answer up.



Perhaps there's part of the prompt that tells GPT not to tell users details about written words to avoid embarrassing copyrighted text leaking out, but the prompt is slightly too strict and lets GPT also not talk about 'open' texts? Pure speculation.


I don't think that's the case, although there was an interesting bug a while back where it would freeze after each word when asked to quote the opening to, e.g., Moby Dick.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: