They had a bot, for a long time, that responded to every github issue in the persona of the founder and tried to solve your problem. It was bad at this, and thus a huge proportion of people who had a question about one of their yolo models received worse-than-useless advice "directly from the CEO," with no disclosure that it was actually a bot.
The bot is now called "UltralyticsAssistant" and discloses that it's automated, which is welcome. The bad advice is all still there though.
(I don't know if they're really _famous_ for this, but among friends and colleagues I have talked to multiple people who independently found and were frustrated by the useless github issues.)
I was hit by this while working on a project for class and it was the most frustrating thing ever. The bot would completely hallucinate functions and docs and it confused everyone. I found one post where someone did the simple prompt injection of "ignore previous instructions and x" and it worked but I think it's delted now. Swore off ultralytics after that.
Change in IP location? If you suddenly jump across the country but your traffic is still coming from a cell provider (so unlikely to be a VPN) Lyft could infer that you've flown, and to which airport.
Alternatively maybe barometer? All modern iPhones have one, but I don't know if apps are allowed to access it in the background (or without location services), and it wouldn't let Lyft tell which airport you were at.
It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)
I'd be pleasantly surprised - GPT-4o is their bread and butter (it powers paid ChatGPT) and QA seems to be slightly ahead on benchmarks at similar or lower latency (so very roughly, it might be cheaper to run).
To some extent the "mystery" (and temporary free-as-in-beer-ness) of this model might be getting to me, but I think it's pretty interesting. Given the token throughput (250B this week) it's obvious there's a pretty major player behind the model, but why is it stealthed? Maybe there's something about the architecture or training that would put people off if it was public right off the bat? Maybe they're purely collecting usage/acceptance data and want unbiased users?
On the Aider Polyglot leaderboard it's ~middle of the leading pack, comparable to DeepSeek V3 and 3.5 Sonnet. I ran NoLi(teral)Ma(tching), an unsaturated long-context benchmark, on it and was impressed though:
I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.
* - I also reproduced the Llama 3.1 8B result to check my setup.
Depends on what you consider vaguely useful, but any modern GPU which can fit the weights at a reasonable quantization (so, 16+ GB, maybe 12 in a pinch) will probably be tolerable. Bare minimum is probably ~3060 12GB.
Macs are a bit odd because you're using system memory (so, more but slower) and Apple's APIs, but it's definitely worth a try. I'd guess >= M2 Pro + >= 24 GB RAM would do okay. Also maybe try an MLX version? (e.g. https://huggingface.co/justinmeans/DeepCoder-14B-Preview-mlx...)
That's probably more or less what's happening. If the site is doing full-text search, and there are no exact matches, adding more terms is going to generate more and more partial matches. Some kind of re-ranker might be able to make sense of them but those are difficult to get good results out of, probably rarely employed by a run-of-the-mill commerce site, and might be drowned out by competing objectives (sponsored results, ranking by likelihood of purchase, etc.)
reply