To those argue that LLMs might cheat by using EXIF, I saw a post recently on twitter (https://x.com/tszzl/status/1915212958755676350) and out of curiosity, screen-captured the photo and passed it to O3. So no EXIF.
It took 4.5m to guess the location. The guess was accurate (checked using Google Street View).
What was amazing about it:
1. The photo did not have ANY text
2. It picked elements of the image and inferred based on those, like a fountain in a courtyard, or shape of the buildings.
All in all, it's just mind-blowing how this works!
o3 burns through what I assume is single-digit dollars just to do some performative tool use to justify and slightly narrow down its initial intuition from the base model.
I don't see how this is mind blowing, or even mildly surprising! It's essentially going to use the set of features detected in the photo as a filter to find matching photos in the training set, and report the most frequent matches. Sometimes it'll get it right, sometimes not.
It'd be interesting to see the photo in the linked story at same resolution as provided to o3, since the licence plate in the photo in the story is at way lower resolution than the zoomed in version shown that o3 had access to. It's not a great piece of primary evidence to focus on though since a CA plate doesn't have to mean the car is in CA.
The clues that o3 doesn't seem to be paying attention to seems just as notable as the ones it does. Why is it not talking about car models, felt roof tiles, sash windows, mini blinds, fire pit (with warning on glass, in english), etc?
Being location-doxxed by a computer trained on a massive set of photos is unsurprising, but the example given doesn't seem a great example of why this could/will be a game changer in terms of privacy. There's not much detective work going on here - just narrowing the possibilities based on some of the available information, and happening to get it right in this case.
Reading the replies to this is funny. It's like the classic dropbox thread. "But this could be done with a nearest neighbor search and feature detection!" If this isn't mind blowing to someone I don't know if any amount of explaining will help them get it.
It's not mindblowing because there were public systems doing performing much better years earlier. Using the exact same tech. This is less like rsync vs drop box and more like you are freaking out over Origin or Uplay when Steam has been around for years.
I'm not sure what you're getting at. What's useful about LLMs, and especially multi-modal ones, is that that you can ask them anything and they'll answer to best of their ability (especially if well prompted). I'm not sure that o3, as a "reasoning" model is adding much value here - since there is not a whole lot of reasoning going on.
This is basically fine-grained image captioning followed by nearest neighbor search, which is certainly something you could have built as soon as decent NN-based image captioning became available, at least 10 years ago. Did anyone do it? I've no idea, although it'd seem surprising if not.
As noted, what's useful about LLMs is that they are a "generic solution", so one doesn't need to create a custom ML-based app to be able to do things like this, but I don't find much of a surprise factor in them doing well at geoguessing since this type of "fuzzy lookup" is exactly what a predict-next-token engine is designed to do.
If you forget the LLM implementation, fundamentally what you are trying to do here is first detect a bunch of features in the photo (i.e. fine-grain image captioning "in foreground a firepit with safety warning on glass, in background a model XX car parked in front of a bungalow, in distance rolling hills" etc) then do a fuzzy match of this feature set with other photos you have seen - which ones have the greatest number of things in common to the photo you are looking up? You could implement this in a custom app by creating a high-dimensional feature space embedding then looking for nearest neighbors, similar to how face recognition works.
Of course an LLM is performing this a bit differently, and with a bit more flexibility, but the starting point is going to be the same - image feature/caption extraction, which in combination then recall related training samples (both text-only, and perhaps multi-model) which are used to predict the location answer you have asked for. The flexibility of the LLM is that it isn't just treating each feature ("fire pit", "CA licence plate") as independent, but will naturally recall contexts where multiple of these occur together, but IMO not so different in that regard to high dimensional nearest neighbor search.
My hunch is that the way the latest o3/o4-mini "reasoning" models work is different enough to be notable.
If you read through their thought traces they're tackling the problem in a pretty interesting way, including running additional web searches for extra contextual clues.
It's not clear how much the reasoning helped, especially since the reasoning OpenAI display is more post-hoc summary of what it did that the actual reasoning process itself, although after the interest in DeepSeek-R's traces they did say they would show more. You would think that potentially it could do things like image search to try to verify/reject any initial clue-based hunches, but not obvious whether it did that or not.
The "initial" response of the model is interesting:
"The image shows a residential neighborhood with small houses, one of which is light green with a white picket fence and a grey roof. The fire pit and signposts hint at a restaurant or cafe, possibly near the coast. The environment, with olive trees and California poppies, suggests a coastal California location, perhaps Central Coast like Cambria or Morro Bay. The pastel-colored houses and the hills in the background resemble areas like Big Sur. A license plate could offer more, but it's hard to read."
Where did all that come from?! The leap from fire pit & signposts to possible coastal location is wild (& lucky) if that is really the logic it used. The comment on potential licence plate utility, without having first noted that a licence plate is visible is odd, seemingly either an indication that we are seeing a summary of some unknown initial response, and/or perhaps that the model was trained on a mass of geoguessing data where photos were paired not with descriptions but rather commentary such as this.
The model doesn't seem to realize the conflict between this being a residential neighborhood, and there being a presumed restaurant across the road from a residence!
Did it not, or no one was interested enough to build one? I’m pretty certain there’s a database of portraits somewhere where they search id details from photograph. Automatic tagging exists for photo software. I don’t see why that can be extrapolated to landmarks with enough data.
I think you are underestimating the importance of a "world model" in the process. It is the modeling of how all these details are related to each other that is critical here.
The LLM will have an edge by being able to draw on higher level abstract concepts.
I think you are overestimating how much knowledge is o3s world model. Just because it can output something doesn't mean it's likely that it will substantially affect it's future outputs. Even just talking to it about college level algebra it seems to not understand these abstract concepts at all. I definitely don't feel the AGI I feel like it's a teenager trying to BS it's way through an essay with massive amounts of plagiarism.
You’re not playing with it on your phone. You’re accesing a service with your phone. Like saying you can use emacs on iOS when you are just ssh-ing to a remote linux box.
I have friends in faculty positions at well-known universities who were very unhappy about these practices, but could not publicly discuss it fearing repercussion, prior to these events.
TBC, I am not supporting any of the things happening. I do think the DEI thing went too far, but what the new admin. is doing can be much worse.
If you are interested in this topic, I suggest watching the conversation between Edward Gibson and Lex Fridman. In the middle of the conversation [1] Edward talks about how there is a "human language comprehension network" within human brain, which gets activated only when we read or speak in any human languages, but nothing else. For example for those speaking multiple languages, reading or writing in any of those languages activates the network. But reading gibberish or computer code, neither activate the network.
Thanks for the links. I watched some of his videos where he explained how DataColada did their forensic investigation in data manipulation.
What amazes me is how simple was the fraud (or at least the ones reported by Pete!). They basically just opened an excel file, started from the top, changed some random numbers, until they reached the effect they aimed for!!! Really? What about those that can do more sophisticated data manipulation?
> Really? What about those that can do more sophisticated data manipulation?
They win. They are rewarded with the power to control governments and whole populations, they are feted by the media and they accumulate an army of staunch defenders who Believe The Science. When the true believers find their way onto juries or into the courts, they are able to destroy their opponents lives (see the recent case of Mann vs Steyn, which will certainly be very encouraging and helpful for Gino in her quest to bankrupt her critics).
(author) Please feel free to open an issue if you try again. Poetry can be painful, I might just switch to a requirements.txt file in the future. (you can skip poetry if you want by just pulling everything in pyproject.toml into a requirements.txt file also)
I found the use of poetry a bresh of fresh air compared to the usual python silliness. Painless, as opposed to getting the cuda stuff working which took a lot longer.
I have a set of PDF files, and this week was thinking how I can link them to an LLM and be able to ask questions about them. So this was very timely.
I did a quick side-by-side testing against Nougat, and it clearly works better. On a handful of PDFs I tested, Marker extracted considerably more text (the text did not have any math, just academic papers), finished the job faster, and did not crash on any pdf, while Nougat took a lot longer to finish, and sometimes crashed due to out-of-memory error (could not allocate more than 7GB RAM!)
I don't think tweeting and blogging can be compared really. I would see tweeting as a form of "talking to a group of people". You often don't research, proofread, and rewrite yourself when talking. The UI to tweeting also boosts this (mostly mobile devices I assume, via a small text box).
While bloggin is for writing an essay. You may write the essay and just publish it, but in most cases you do some research and at least proofread it once. And again the blogging UI is optimized for this: you have an empty page, nothing other than your written content.
And they really complement each other: you talk to people to get ideas for your essays, and you write essays to share your ideas with people and use them as the base for your writing. I don't think you ever can replace tweeter (or similar services) with blogging.
Yeah I agree with this. Tweeting is more akin to posting to a forum or mailing list than writing a blog entry. Heck, social media in general is basically the forum/messenger equivalent of the current era.
Could some people use a social media service like a blog? Sure, in the same way some people used internet forums as the equivalent of a blog or article posting site in the 90s and 00s. But that's not really their purpose, and they're not going to be replacements for most people.
Some people write long form tweets or threads, and they can take hours to research/compose etc. Not as long as blogs, but still a decent amount of work. I personally prefer to follow people like that.
It sounds like this only applies to Google's own exposure notification service and would not apply to standalone contact-tracing apps such as the one being proposed by the UK government.
You can read the chat here: https://chatgpt.com/share/680a449f-d8dc-8001-88f4-60023323c7...
It took 4.5m to guess the location. The guess was accurate (checked using Google Street View).
What was amazing about it:
All in all, it's just mind-blowing how this works!reply