Hacker News new | past | comments | ask | show | jobs | submit login

The person you are replying to, said it clearly: "there is no such corpus of text that contains only accurate knowledge"

Deep learning, learns a model of the world, and this model can be as inaccurate as it goes. Earth may as well have 10 moons for a DL model. In order for Earth to have only 1 moon, there has to be a dataset which encodes only that information, and not even once more moons. A drunk person who stares at the moon, sees more than one moon and writes about that on the internet, has to be excluded from the training data.

Also the model of the Othello world, is very different from a model of the real world. I don't know about Othello, but in chess it is pretty well known that all possible chess positions, are more than there are atoms in the universe. For all practical purposes, the dataset of all possible chess positions is infinite.

The dataset of every possible event on earth, every second is also more than all the atoms in the universe. For all practical purposes, it is infinite as well.

Do you know that one dataset is more infinite than the other? Does modern DL state that all infinities are the same?




Wrong again. When you apply statistical learning over a large enough dataset, the wrong answers simply become random normal noise (a consequence of the central limit theorem) - the kind of noise which deep learning has always excelled at filtering out, long before LLMs where a thing - and the truth becomes a constant offset. If you have thousands of pictures of dogs and cats and some were incorrectly labelled, you can still train a perfectly good classifier that will achieve more or less 100% accuracy (and even beat humans) on validation sets. It doesn't matter if a bunch of drunk labellers tainted the ground truth as long as the dataset is big enough. That was the state of DL 10 years ago. Today's models can do a lot more than that. You don't need infinite datasets, they just need to be large enough and cover your domain well.


> You don't need infinite datasets, they just need to be large enough and cover your domain well.

When you are talking about distinguishing noise from a signal, or truth from not-totally-truth, and the domain is sufficiently small, e.g a game like Othello or data from a corporation, then i agree with everything in your comment.

When the domain is huge, then distinguishing truth from lies/non-truth/not-totally-truth is impossible. There will not be such a high quality dataset, because everything changes over time, truth and lies are a moving target.

If we humans cannot distinguish between truth and non-truth, but the A.I. is able to, then we are talking about AGI. Then we put the machines to discover new laws of physics. I am all for it, i just don't see it happening anytime soon.


What you're talking about is by definition no longer facts but opinions. Even AGI won't be able to turn opinions into facts. But LLMs are already very good at giving opinions rather than facts thanks to alignment training.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: