Hacker News new | past | comments | ask | show | jobs | submit | john_minsk's comments login

My personal 5 cents is that reasoning will be there when LLM gives you some kind of outcome and then when questioned about it can explain every bit of result it produced.

For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.

And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.

I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.


I see two approaches for explaining the outcome: 1. Reasoning back on the result and justifying it. 2. Explainability - somehow justifying by looking at which neurons have been called. The first could lead to lying. E.g. think of a high schooler explaining copied homework. While the second one does indeed access the paths influencing the decision, but is a hard task due to the inherent way neural networks work.

> if we asked an LLM to produce an image of a "human woman photorealistic" it produces result

Large language models don't do that. You'd want an image model.

Or did you mean "multi-model AI system" rather than "LLM"?


It might be possible for a language model to paint a photorealistic picture though.

It is not.

You are confusing LLM:s with Generative AI.


No, I'm not confusing it. I realize that LLMs sometimes connect with diffusion models to produce images. I'm talking about language models actually describing pixel data of the image.

Can an LLM use tools like humans do? Could it use an image model as a tool to query the image?

No, a LLM is a Large Language Model.

It can language.


You could teach it to emit patterns that (through other code) invoke tools, and loop the results back to the LLM.

Bad times create strong people.

To clarify, I really think that is what's happening. People feel that their future is not a guaranteed success and make safer choices to be clear minded and focused to achieve success. Probably just my bias is talking...

Very cool.

However, I can't fully agree that generating 3d scene "on the fly" is the future of maps and many other use cases for AR.

The thing with geospatial, buildings, roads, signs, etc. objects is that they are very static, not many changes are being made to them and many changes are not relevant to the majority of use cases. For example: today your house is white and in 3 years it has stains and yellowish color due to time, but everything else is the same.

Given that storage is cheap and getting cheaper, bandwidth of 5G and local networks is getting too fast for most current use cases, while computer graphics compute is still bound by our GPU performance, I say that it would be much more useful to identify the location and the building that you are looking at and pull the accurate model from the cloud (further optimisations might be needed like to pull only the data user has access to or needs access to given the task he is doing). Most importantly users will need to have access to a small subset of 3D space on daily basis, so you can have a local cache on end devices for best performance and rendering. Or stream rendered result from the cloud like nVidia GDN is doing.

Most precise models will come from CAD files for newly built buildings, retrospectively going back to CAD files of buildings build in last 20-30 years(I would bet most of them have some soft of computer model made before) and finally going back even further - making AI look at the old 2D construction plans of the building and reconstructing it in 3D.

Once the building is reconstructed (or a concrete pole like shown in the article) you can pull its 3D model from the cloud and place it in front of the user - this will cover 95% of use cases for AR. For 5% of the tasks you might want real time recognition of the current state of surfaces for some tasks or changes in geometry (like tracking the changes in the road quality compared with the previous scans or with reference model), but these cases can be tackled separately and having precise 3D model will only help, but won't be needed to be reconstructed from scratch.

This is a good 1st step to make a 3D map, however there should be an option to go to the real location and make edits to 3D plan by the expert so that the model can be precise and not "kind of" precise.


Very interesting. What is the current state of this tech?


No. it should be owned by the owners of the land on which these objects are located. You should be able to provide access at different levels of detail to public or private entities that need said access and revoke it at your own will. May be make some money out of it.

3D artist can create a model of a space and offer rights to the owner of the land, who in turn can choose to create his own model or use the one provided by an artist.


This is amazing.


Super interesting! Thank you!


I've heard that there is Library Genesis extension on Google Chrome that allows you to search catalog right from the browser window. It has everything SciHub has to offer. Article from the post was found without any issue.

Rumors are that extension was working just fine for the past year+ so it is quite reliable.


Why would you spread such unverified rumours on an online forum? Whoever acts on them might get themselves into trouble...


Great post and Elon Musk has similar rule in his thinking.

For anyone who liked the trick in the post consider checking out TRIZ: https://en.m.wikipedia.org/wiki/TRIZ

There are too many interesting ideas in this framework, but one of the first steps in the algorithm of solving a problem is to "Formulate ideal final result", according to TRIZ.

Now the "Ideal final result" has a specific definition: The part of the system doesn't exist but the function is still performed.

I'm having a lot of fun with this and other tools coming from TRIZ when solving problems every day. You might like it as well!

As for A/B testing and getting unexpected results: inside TRIZ there is explanation why it works - it is called "Psychological innertion". i.e. when engineer is getting a problem it is usually already formulated in a certain way and engineer has all kinds of assumptions before he even starts solving a problem. This leads to him thinking along specific "rails" not getting out of box. Once you have algorithm like TRIZ, it allows to break through psychological innertion and look at the problem with clear eyes.

Some other trick one might use to find interesting solutions to the problem from the post: "Make problem more difficult". I.e. instead of how to make calculator simple and unrestandable, formulate it in a different way: "how to make calculator simple and unrestandable, visual, fun to use and interact with, wanting to share with your collegues?"

"Turn harm into benefit". calculator in the post is treated as a necessary evil. Flip it. Now we have a calculator, but we could show some extra info next to prices, which our competitors can't do. We can compare with competitors and show that our prices are better, calculator can serve as a demo of how customer is always in control of their spending as the same interface is available after they become customer to control their spend etc.

Formulating this way already gave me some ideas what could be added to calculator to make it work.

Hope it helps someone.


This is super cool!

As a marketing spin for this - consider packaging it into Nvidia NIM format and make it generate 3D graph as 3D OpenUSD scene. From where I'm standing this route has a lot of potential.

Also if you never looked into it there is a project called wikidata. There each object contains a unique ID and hierarchy, which helps build semantic web. Exploring their data using your interface might be effective. (please check similar projects though as an idea seems straightforward and someone might have already done that for them)


You're a legend mate I know nothing of what you say, but it is very interesting. It is time I brush up my skills and I thank you very much for the clues! I'll look it up.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: