I don't see how this is mind blowing, or even mildly surprising! It's essentiall...

simonw · 2025-04-26T15:22:15 1745680935

If you want to be impressed I suggest trying this yourself on your own photos.

I don't consider it my job to impress or mind-blow people: I try to present as realistic as possible a representation of what this stuff can do.

That's why I picked an example where its first guess was 200 miles off!

cyral · 2025-04-27T04:14:40 1745727280

Reading the replies to this is funny. It's like the classic dropbox thread. "But this could be done with a nearest neighbor search and feature detection!" If this isn't mind blowing to someone I don't know if any amount of explaining will help them get it.

casey2 · 2025-04-27T09:15:05 1745745305

It's not mindblowing because there were public systems doing performing much better years earlier. Using the exact same tech. This is less like rsync vs drop box and more like you are freaking out over Origin or Uplay when Steam has been around for years.

simonw · 2025-04-27T13:54:34 1745762074

Which public systems were those?

HarHarVeryFunny · 2025-04-26T16:58:23 1745686703

I'm not a computer. I expect a computer to also do better than me at memorizing the phone book, but I'm not impressed by it.

simonw · 2025-04-26T17:10:32 1745687432

In that case, are you at all surprised that this technology did not exist two years ago?

HarHarVeryFunny · 2025-04-26T18:08:45 1745690925

I'm not sure what you're getting at. What's useful about LLMs, and especially multi-modal ones, is that that you can ask them anything and they'll answer to best of their ability (especially if well prompted). I'm not sure that o3, as a "reasoning" model is adding much value here - since there is not a whole lot of reasoning going on.

This is basically fine-grained image captioning followed by nearest neighbor search, which is certainly something you could have built as soon as decent NN-based image captioning became available, at least 10 years ago. Did anyone do it? I've no idea, although it'd seem surprising if not.

As noted, what's useful about LLMs is that they are a "generic solution", so one doesn't need to create a custom ML-based app to be able to do things like this, but I don't find much of a surprise factor in them doing well at geoguessing since this type of "fuzzy lookup" is exactly what a predict-next-token engine is designed to do.

simonw · 2025-04-26T18:13:15 1745691195

How does nearest neighbor search relate to this?

HarHarVeryFunny · 2025-04-26T18:35:43 1745692543

If you forget the LLM implementation, fundamentally what you are trying to do here is first detect a bunch of features in the photo (i.e. fine-grain image captioning "in foreground a firepit with safety warning on glass, in background a model XX car parked in front of a bungalow, in distance rolling hills" etc) then do a fuzzy match of this feature set with other photos you have seen - which ones have the greatest number of things in common to the photo you are looking up? You could implement this in a custom app by creating a high-dimensional feature space embedding then looking for nearest neighbors, similar to how face recognition works.

Of course an LLM is performing this a bit differently, and with a bit more flexibility, but the starting point is going to be the same - image feature/caption extraction, which in combination then recall related training samples (both text-only, and perhaps multi-model) which are used to predict the location answer you have asked for. The flexibility of the LLM is that it isn't just treating each feature ("fire pit", "CA licence plate") as independent, but will naturally recall contexts where multiple of these occur together, but IMO not so different in that regard to high dimensional nearest neighbor search.

simonw · 2025-04-26T18:41:05 1745692865

Thanks, that's a good explanation.

My hunch is that the way the latest o3/o4-mini "reasoning" models work is different enough to be notable.

If you read through their thought traces they're tackling the problem in a pretty interesting way, including running additional web searches for extra contextual clues.

HarHarVeryFunny · 2025-04-26T21:04:09 1745701449

It's not clear how much the reasoning helped, especially since the reasoning OpenAI display is more post-hoc summary of what it did that the actual reasoning process itself, although after the interest in DeepSeek-R's traces they did say they would show more. You would think that potentially it could do things like image search to try to verify/reject any initial clue-based hunches, but not obvious whether it did that or not.

The "initial" response of the model is interesting:

"The image shows a residential neighborhood with small houses, one of which is light green with a white picket fence and a grey roof. The fire pit and signposts hint at a restaurant or cafe, possibly near the coast. The environment, with olive trees and California poppies, suggests a coastal California location, perhaps Central Coast like Cambria or Morro Bay. The pastel-colored houses and the hills in the background resemble areas like Big Sur. A license plate could offer more, but it's hard to read."

Where did all that come from?! The leap from fire pit & signposts to possible coastal location is wild (& lucky) if that is really the logic it used. The comment on potential licence plate utility, without having first noted that a licence plate is visible is odd, seemingly either an indication that we are seeing a summary of some unknown initial response, and/or perhaps that the model was trained on a mass of geoguessing data where photos were paired not with descriptions but rather commentary such as this.

The model doesn't seem to realize the conflict between this being a residential neighborhood, and there being a presumed restaurant across the road from a residence!

casey2 · 2025-04-27T09:17:50 1745745470

So you admit that this tech is at least 2 years old publicly and likely much older privately?

skydhash · 2025-04-26T17:22:55 1745688175

Did it not, or no one was interested enough to build one? I’m pretty certain there’s a database of portraits somewhere where they search id details from photograph. Automatic tagging exists for photo software. I don’t see why that can be extrapolated to landmarks with enough data.

XenophileJKO · 2025-04-26T18:28:23 1745692103

I think you are underestimating the importance of a "world model" in the process. It is the modeling of how all these details are related to each other that is critical here.

The LLM will have an edge by being able to draw on higher level abstract concepts.

casey2 · 2025-04-27T09:25:19 1745745919

I think you are overestimating how much knowledge is o3s world model. Just because it can output something doesn't mean it's likely that it will substantially affect it's future outputs. Even just talking to it about college level algebra it seems to not understand these abstract concepts at all. I definitely don't feel the AGI I feel like it's a teenager trying to BS it's way through an essay with massive amounts of plagiarism.

simonw · 2025-04-26T17:31:03 1745688663

If it existed two years ago I certainly couldn't play with it on my phone.

skydhash · 2025-04-26T17:38:53 1745689133

You’re not playing with it on your phone. You’re accesing a service with your phone. Like saying you can use emacs on iOS when you are just ssh-ing to a remote linux box.